Day 6: Prediction
With regression, you can make predictions. Accurate predictions requires a strong correlation. When there is a strong correlation between two variables (positive or negative), you can make accurate predictions from one to the other. If sales and time are highly correlated, you can predict what sales will be in the future…or in the past. You can enhance the sharpness of an image by predicting what greater detail would look like (filling in the spaces between the dots with predicted values). Of course the accuracy of your predictions, depends on the strength of the correlation. Weak correlations produce lousy predictions.
An extension of the correlation, a regression allows you to compare your data looks to a specific model: a straight line. Instead of using a normal curve (bell-shaped hump) as a standard, regression draws a straight line through the data. The more linear your data, the better it will fit the regression model.Once a line of regression is drawn, it can be used to make specific predictions. You can predict how many shoes people will buy based on how many hats they buy, assuming there is a strong correlation between the two variables.
Just as a correlation can be seen in a scatterplot, a regression can be represented graphically too. A regression would look like a single straight line drawn through as many points on the scatterplot
as possible. If your data points all fit on a straight line (extremely unlikely), the relationship between the two variables would be very linear.
J
Most likely, there will be a cluster or cloud of data points. If the scatterplot is all cloud and no trend, a regression line won’t help…you wouldn’t know where to draw it: all lines would be equally bad.
But if there the scatterplot reveals a general trend, some lines will obviously be better than others. In essence, you try draw a line that follows the trend but divides or balances the data points equally.
In a positive linear trend, the regression line will start in the bottom left part of the scatterplot and go toward the top right part of the figure. It won’t hit all of the data points but it will hit most or come close to them.
You can use either variable as a predictor. The choice is yours. But the results mostly likely won’t be the same, unless the correlation between the two variables is perfect (either +1 or -1). So it matters which variable is selected as a predictor and which is characterized as the criterion (outcome variable).
Predicting also assumes that the relationship between the two variables is strong. A weak correlation will produce a poor line of prediction. Only strong (positive or negative) correlations will produce accurate predictions.
A regression allows you to see if the data looks like a straight line. Obviously, if your data is cyclical, a straight line won’t represent it very well. But if there is a positive or negative trend, a straight line is a good model. It is not so much that we apply the model to the data; more like we collect the data and ask if it looks this model (linear), that model (circular or cyclic) or that model (chance).
If the data approximates a straight line, you can then use that information to predict what will happen in the future. Predicting the future assumes, of course, that conditions remain the same. The stock market is hard to predict because it gets changing, up and down, slowly up, quickly down. It’s too erratic to predict its future, particularly in the short run.
If you roll a bowling ball down a lane and measure the angle it is traveling, you can predict where the ball will hit when it reaches the pins. The size, temperature and shape of the bowling lane are assumed to remain constant for the entire trip, so a linear model would work well with this data. If you use the same ball on a grass lane which has dips and bulges, the conditions are not constant enough to accurately predict its path.
Predicting the future also assumes that the relationship between the two variables is strong. A weak correlation will produce a poor line of prediction. Only strong (positive or negative) correlations will produce accurate predictions.
A regression is composed of three primary characteristics. Any two of these three can be used to draw a regression line: pivot point, slope and intercept.
First, the regression line always goes through the point where the mean of X and the mean of Y meet. This is reasonable since the best prediction of a variable (knowing nothing else about it) is its mean. Since the mean is a good measure of central tendency (where everyone is hanging out), it is a good measure to use.
Second, a regression line has slope. For every change in X, slope will indicate the change in Y. If the correlation between X and Y is perfect, slope will be 1; every time X gets larger by 1, Y will get larger by 1. Slope indicates the rate of change in Y, given a change of 1 in X.
Third, a regression line has a Y intercept: the place where the regression line crosses the Y axis. Think of it as the intersection between the sloping regressing line and vertical axis.
Regression means to go back to something. We can regress to our childhood; regress out of a building (leave the way we came in). Or regress back to the line of prediction. Instead of looking at the underlying data points, we use the line we’ve created to make predictions. Instead of relying on real data, we regress to our prediction line.
There are two major determinants of a prediction’s accuracy: (a) the amount of variance the predictor shares with the criterion and (b) the amount of dispersion in the criterion.
Taking them in order, if the correlation between the two variables is not strong, it is very difficult to predict from one to the other. In a strong positive correlation, you know that when X is low Y is low. Know where one variable is makes it easy to the general location of the other variable.
A good measure of predictability, therefore, is the coefficient of determination (calculated by squaring r). R-squared (r2) indicates how much the two variables have in common. If r2is close to 1, there is a lot of overlap between the variables and it becomes quite easy to predict one from the other.
Even when the correlation is perfect, however, predictions are limited by the amount of dispersion in the criterion. Think of it this way: if everyone has the same score (or nearly so), it is easy to predict that score, particularly if the variable is correlated with another variable. But if everyone has a different score (lots of dispersion from the mean), guessing the correct value is difficult.
The standard error of estimate (see) takes both of these factors into consideration and produces a standard deviation of error around the prediction line. A prediction is presented as plus or minus its see.
The true score of a prediction will be within 1 standard error of estimate of the regression line 68% of the time. If the predicted score is 15 (just to pick a number), we’re 68% sure that the real score is 15 plus or minus 3 (or whatever the see is).
Similarly, we’re 96% sure that the real score falls within two standard deviations of the regression line (15 plus or minus 6). And we’re 99.9% sure that the real score fall within 3 see of the prediction (15 plus or minus 9).
Probability
Want to jump ahead?
- What is statistics?
- Ten Day Guided Tour
- How To Calculate Statistics
- Start At Square One
- Practice Items
- Resources
- Final Exam
Book
Statictics Safari
Photo by Daniel Lerman on Unsplash