Polynomial Trendlines
By Tim on Tuesday, March 29 2011, 11:16 - Development - Permalink
I recently had to generate a polynomial trendline for some data I had. Microsoft Excel can do it (as can some other applications) but I needed to do it "by hand" (actually I did it in Java...)
Partly for my own benefit, and partly for the benefit of the
google Duck Duck Go, I'm
going to write down how I did it.
So, let's start at the beginning.
Why a trendline
I had plotted a line graph of the data I had. What I needed was a curve that approximated the line. There are 2 purposes for that
- To demonstrate what the overall pattern of the data is - is the line going up or down? Does it cycle or is it moving in 1 consistent direction?
- To make (loose) predictions about what will happen in the future. If the line is going down, when does it hit zero? If it's going up, at what rate is it moving?
What I needed
What we need then, is a formula f(x) that produces a y value. So that, for any given x (horizontal) value, we can generate a y value. Where the x value is in the range of the data we already have, we want the f(x) to produce a y value that is close to the data we have (i.e. we minimises the error).
A polynomial trendline
For a polynomial trendline, we want to find a formula in the form:
a{0} + a{1}x + a{2}x^2 + ... + a{n-1}x^n-1 + a{n}x^n = y (Where
a{n} is the nth coefficient and x^n is x
raised to the power n)
So, a polynomial of order 2 would be: a + (b * x) + (c * x * x) =
y
We need to find out the "best" values for a, b, and
c, so that we plug in a value for x and determine an
approximation for y.
Finding the coefficients
For a polynomial of order n, we have n+1 coefficients to
work out.
Because we have n+1 unknowns, we need to use n+1 different
points from our data in order to find a formula.
So, to find the formula for an order 2 polynomial, we need 3 points from our data to help us find a, b, and c. Let's do a simple example by hand, and then we can look at how to solve the general case.
Let's assume our three points are (0,0), (1,2),
(2,1). We use those values for x and y to
produce 3 simultaneous equations:
a + b * 0 + c * 0 = 0a + b * 1 + c * 1 = 2a + b * 2 + c * 4 = 1
Simplifying, we get
a = 0a + b + c = 2a + 2b + 4c = 1
Which means
a = 0b = 2 - cb = (1 - 4c)/2
Thus
2 - c = (1-4c)/24 - 2c = 1 - 4c2c = -3c = -3/2
And
b = 2 - -3/2b = 7/2
So
a=0b=7/2c=-3/2
And our final formula is (7x)/2 - (3x^2)/2 = y
So, we can predict that if x = 3, then y = (7*3)/2 -
(3*3*3)/2 = 21/2 - 27/2 = -3
The general case
The general case for an order n polynomial, with n+1
unknowns, and n+1 sample points, is a series of n+1
simultaneous equations with n+1 coefficients.
We can solve that using Cramer's Rule.
In Java, I used JAMA
to implement Cramer's rule, which is actually quite easy.
For a set of data points (x{0}, y{0}) , (x{1}, y{1})
, (x{2}, y{2}) , ... , (x{n}, y{n}), the coefficient
matrix is
| x{0}^0 x{0}^1 x{0}^2 ... x{0}^n |
| x{1}^0 x{1}^1 x{1}^2 ... x{1}^n |
| x{2}^0 x{2}^1 x{2}^2 ... x{2}^n |
| ...
| x{n}^0 x{n}^1 x{n}^2 ... x{n}^n |
and the column matrix is
| y{0} |
| y{1} |
| y{2} |
| ...
| y{n} |
Solving with Cramer's rule gives us
| a{0} |
| a{1} |
| a{2} |
| ...
| a{n} |
And for any given x, we can find y by calculating
a{0}x^0 + a{1}x^1 + a{2}x^2 + ... + a{n}x^n
Which data points to use
We can pick any set of data points, and that will give us an equation, but we want the best fit. So how do we calculate that?
In my case, I simply picked random points.
I repeated that 25000 times, and then chose the equation that had the lowest
error (using the sum of squares).
What order polynomial
In theory the higher the order the better the fit - that is, the
error will be smaller. (This is true, because if a lower order equation would
fit better, then we will simply find that a{n} is 0 when
we solve our simultaneous equations).
But we're not simply looking for a better fit - we're also looking for something that visually demonstrates the trend of the data and makes helpful predictions about future data. That means that a lower order equation (which actually has a higher error) is more useful to us.
In my case, after plotting curves up to order 5, I determined that the equation for order 2 better suited my needs - so I could probably have saved myself some trouble with Cramer's rule and implemented the simpler case.