Algebra
Textbooks
Boundless Algebra
Linear Functions
Applications of Linear Functions
Algebra Textbooks Boundless Algebra Linear Functions Applications of Linear Functions
Algebra Textbooks Boundless Algebra Linear Functions
Algebra Textbooks Boundless Algebra
Algebra Textbooks
Algebra
Concept Version 7
Created by Boundless

Fitting a Curve

Curve fitting with a line attempts to draw a line so that it "best fits" all of the data.

Learning Objective

  • Model a set of data points as a line using the least squares approximation


Key Points

    • Curve fitting is useful for finding a curve that best fits the data. This allows assumptions about how the data is roughly spread out and predictions about future data points.
    • Linear regression attempts to graph a line that best fits the data.
    • Ordinary least squares approximation is a type of linear regression that minimizes the sum of the squares of the difference between the approximated value (from the line), and the actual value.
    • The slope of the line that approximates $n$ data points is given by $m=\frac{\sum_{i=1}^{n}x_{i}y_{i}-\frac{1}{n}\sum_{i=1}^{n}x_{i}\sum_{j=1}^{n}y_{j}}{\sum_{i=1}^{n}(x_{i}^{2})-\frac{1}{n}(\sum_{i=1}^{n}x_{i})^{2}}$.
    • The $y$-intercept of the line that approximates $n$ data points is given by: $b= \displaystyle{\frac{1}{n} \sum_{i=1}^{n} y_{1} - m \frac{1}{n} \sum_{i=1}^{n} x_{i} = \left (\bar{y} - m \bar{x} \right)} $

Terms

  • curve fitting

    The process of constructing a curve, or a mathematical function, that has the best fit to a series of data points, possibly subject to constraints.

  • outlier

    A value in a statistical sample which does not fit a pattern nor describes most other data points.

  • least squares approximation

    An attempt to minimize the sums of the squared distance between the predicted point and the actual point.

  • linear regression

    An approach to modeling the linear relationship between a dependent variable, $y$ and an independent variable, $x$.


Full Text

Curve Fitting

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data. Fitted curves can be used as an aid for data visualization, to infer values of a function where no data are available, and to summarize the relationships among two or more variables. Extrapolation refers to the use of a fitted curve beyond the range of the observed data, and is subject to a greater degree of uncertainty since it may reflect the method used to construct the curve as much as it reflects the observed data.

In this section, we will only be fitting lines to data points, but it should be noted that one can fit polynomial functions, circles, piece-wise functions, and any number of functions to data and it is a heavily used topic in statistics.

Linear Regression Formula

Linear regression is an approach to modeling the linear relationship between a dependent variable, $y$ and an independent variable, $x$. With linear regression, a line in slope-intercept form, $y=mx+b$ is found that "best fits" the data.

The simplest and perhaps most common linear regression model is the ordinary least squares approximation. This approximation attempts to minimize the sums of the squared distance between the line and every point.  

$\displaystyle m=\frac{\sum_{i=1}^{n}x_{i}y_{i}-\frac{1}{n}\sum_{i=1}^{n}x_{i}\sum_{j=1}^{n}y_{j}}{\sum_{i=1}^{n}(x_{i}^{2})-\frac{1}{n}(\sum_{i=1}^{n}x_{i})^{2}}$

To find the slope of the line of best fit, calculate in the following steps:

  1. The sum of the product of the $x$ and $y$ coordinates $\sum_{i=1}^{n}x_{i}y_{i}$.
  2. The sum of the $x$-coordinates $\sum_{i=1}^{n}x_{i}$.
  3. The sum of the $y$-coordinates $\sum_{j=1}^{n}y_{j}$.
  4. The sum of the squares of the $x$-coordinates $\sum_{i=1}^{n}(x_{i}^{2})$.
  5. The sum of the $x$-coordinates squared $(\sum_{i=1}^{n}x_{i})^{2}$.
  6. The quotient of the numerator and denominator.

$\displaystyle \begin{aligned} b&= \frac{1}{n} \sum_{i=1}^{n} y_{1} - m \frac{1}{n} \sum_{i=1}^{n} x_{i} \\ &= \left (\bar{y} - m \bar{x} \right) \end{aligned}$

To find the $y$-intercept ($b$) , calculate using the following steps:

  1. The average of the $y$-coordinates. Let $\bar{y}$, pronounced $y$-bar, represent the mean (or average) $y$ value of all the data points: $\bar y =\frac{1}{n}\sum_{i=1}^{n} y_{i}$.
  2. The average of the $x$-coordinates. Respectively $\bar{x}$, pronounced $x$-bar, is the mean (or average) $x$ value of all the data points: $\bar x=\frac{1}{n}\sum_{i=1}^{n} x_{i}$.
  3. Replace values into the formula above $b=\bar{y} - m \bar{x}$.

Using these values of $m$ and $b$ we now have a line that approximates the points on the graph.  

Using the Least Squares Approximation

Example:  Write the least squares fit line and then graph the line that best fits the data 

For $n=8$ points: $(-1,0),(0,0),(1,1),(2,2),(3,1),(4,2.5),(5,3) $ and $(6,4)$.

Example Points

The points are graphed in a scatterplot fashion.

First, find the slope $(m)$ and $y$-intercept $(b)$ that best approximate this data, using the equations from the prior section:

To find the slope, calculate:

  1. The sum of the product of the $x$ and $y$ coordinates $\sum_{i=1}^{n}x_{i}y_{i}$.
  2. The sum of the $x$-coordinates $\sum_{i=1}^{n}x_{i}$.
  3. The sum of the $y$-coordinates $\sum_{i=1}^{n}y_{i}$.

$\displaystyle \begin{aligned} \sum_{i=1}^{n}x_{i}y_{i}&=0+0+1+4+3+10+15+24\\&=57 \end{aligned} $$\displaystyle \begin{aligned} \sum_{i=1}^{n}x_{i}&=-1+0+1+2+3+4+5+6\\&=20 \end{aligned}$$\displaystyle \begin{aligned} \sum_{i=1}^{n}y_{i}&=0+0+1+2+1+2.5+3+4\\&=13.5 \end{aligned}$

$\displaystyle m=\frac{\sum_{i=1}^{n}x_{i}y_{i}-\frac{1}{n}\sum_{i=1}^{n}x_{i}\sum_{j=1}^{n}y_{j}}{\sum_{i=1}^{n}(x_{i}^{2})-\frac{1}{n}(\sum_{i=1}^{n}x_{i})^{2}}$

4. Calculate the numerator:  The product of the $x$ and $y$-coordinates minus one-eighth the product of the sum of the $x$-coordinates and the sum of the $y$-coordinates:

$\displaystyle \sum_{i=1}^{n}x_{i}y_{i}-\frac{1}{n}\sum_{i=1}^{n}x_{i}\sum_{j=1}^{n}y_{j}$

The numerator in the slope equation is:

 $\displaystyle 57-\frac{1}{8}(20)(13.5)=23.25$

5. Calculate the denominator:  The sum of the squares of the $x$-coordinates minus one-eighth the sum of the $x$-coordinates squared:

 $\displaystyle \sum_{i=1}^{n}(x_{i}^{2})-\frac{1}{n}(\sum_{i=1}^{n}x_{i})^{2}$

$\displaystyle \begin{aligned} \sum_{i=1}^{n}(x_{i}^{2})&=1+0+1+4+9+16+25+36\\&=92 \end{aligned}$

The denominator is $92-\frac{1}{8}(20)^{2}=92-50=42$ and the slope is the quotient of the numerator and denominator: $\frac{23.25}{42}\approx0.554.$

Now for the $y$-intercept, ($b$) one-eighth times the average of the $$$x$-coordinates: $\bar{x}=\frac{20}{8}=2.5$ and one-eighth times the average of the $y$-coordinates: $\bar{y}=\frac{13.5}{8}=1.6875$. 

Therefore $b=\frac{1}{n} \sum_{i=1}^{n} y_{1} - m \frac{1}{n} \sum_{i=1}^{n} x_{i} \\$:

 $\displaystyle b\approx1.6875-0.554(2.5)=0.3025.$

Our final equation is therefore $y=0.554x+0.3025$, and this line is graphed along with the points.

Least Squares Fit Line

The line found by the least squares approximation, $y = 0.554x+0.3025$. Notice 4 points are above the line, and 4 points are below the line. 

Outliers and Least Square Regression

If we have a point that is far away from the approximating line, then it will skew the results and make the line much worse.  For instance, let's say in our original example, instead of the point $(-1,0)$ we have $(-1,6)$. 

Using the same calculations as above with the new point, the results are:$m\approx0.0536$ and $b\approx2.3035$, to get the new equation $y=0.0536x+2.3035$.

Looking at the points and line in the new figure below, this new line does not fit the data well, due to the outlier $(-1,6)$.  Indeed, trying to fit linear models to data that is quadratic, cubic, or anything non-linear, or data with many outliers or errors can result in bad approximations.

Outlier Approximated Line

Here is the approximated line given the new outlier point at (-1, 6).

[ edit ]
Edit this content
Prev Concept
Linear Mathematical Models
What is a Quadratic Function?
Next Concept
Subjects
  • Accounting
  • Algebra
  • Art History
  • Biology
  • Business
  • Calculus
  • Chemistry
  • Communications
  • Economics
  • Finance
  • Management
  • Marketing
  • Microbiology
  • Physics
  • Physiology
  • Political Science
  • Psychology
  • Sociology
  • Statistics
  • U.S. History
  • World History
  • Writing

Except where noted, content and user contributions on this site are licensed under CC BY-SA 4.0 with attribution required.