In this chapter we investigated techniques to describe the association between two quantitative variables.
Typically we begin by graphing the bivariate data, which is the data that results from measuring two quantitative variables on each individual. Our goal is to see if an association exists between pairs of individuals, and if it does, to describe the association.
A scatterplot is used to graphically display quantitative bivariate data. Construct the scatterplot by placing one variable on the horizontal axis and the other on the vertical axis. If there is a clear choice of an explanatory and a response variable, then plot the explanatory variable on the horizontal axis and the response variable on the vertical axis. Each individual has a measurement value for x and y. The individual’s information is plotted in the x-y plane as an ordered pair (x,y).
In practice, statistical software such as CrunchIt! is used to make the graph and so we instead focus on describing the relationship in terms of direction, form, and strength. Direction is positive (negative) if smaller values of x tend to correspond with smaller (larger) values of y and larger values of x tend to correspond with larger (smaller) values of y. Form describes the general shape of the graph and strength refers to how much scatter there is about the form. A scatterplot may also be helpful in identifying outliers, which are points whose x or y (or both) values are different from the rest.
In addition to using graphical procedures to describe the linear relationship between two quantitative variables, we use a numerical calculation called the correlation coefficient. This is a measure of the strength and direction of the linear relationship and it is a number between –1 and 1, inclusive. If r is positive (negative) then there is a positive (negative) association between x and y. The closer r is to 1 or -1, the stronger the relationship there is between x and y. When r is zero, there is no linear association between x and y.
The formula to compute r is \(r = \frac{\large 1}{\large n-1}\sum{\left(\frac{\large x-\overline{x}}{\large s_x} \right)}\left(\frac{\large y-\overline{y}}{\large s_y} \right )\). Typically, however, we use statistical software to perform this tedious computation.
When graphical and numerical methods suggest that there is a sufficiently strong linear relationship between the two quantitative variables, our next step is to find a regression line to model the relationship. A regression line predicts values of the response variable, y, for given values of the explanatory variable, x. The predicted value of y is denoted as \( \hat{y}\) to distinguish it from the observed value of y.
A regression line is of the form \( \hat{y} = a + bx\), where a is the y-intercept and b is the slope. Although statistical software typically reports this least squares regression line, a and b can be found using the formulas: \(b = r * \big(\frac{ \large s_{y} }{ \large s_{x} }\big) \) and \(a = \overline{y} - b * \overline{x} \).
The stronger the relationship between x and y, the better the fit will be for the regression line. In practice, a data set is almost never perfectly linear, so there will be prediction error when modeling data. For a given value of x, we define its residual to be the difference between its observed value y and its predicted value \( \hat{y}\). A residual plot displays the x-coordinate and corresponding residual for each value of x in the data set.
The square of the correlation coefficient, r2, has a meaningful interpretation; it represents the fraction of variability observed in the y-values that can be explained by the least squares regression line.
Finally, when modeling data, be mindful to use the least squares regression line to make predictions using values of the explanatory variable that are within the range of the given data. Using the model to make predictions for data that falls outside of the range can lead to extrapolation errors.