When we see one little dot moving along a screen and then another starts to move at just the right time, we are built to see that as a causal relationship. We see the first dot pushing the second dot. We're built that way. We want causes.
But sometimes, where we think we see causation, in reality it turns out to be a correlation.
Correlation does not imply causation. Let me say it a different way. Correlation doesn't even imply causation. If you have found a correlational relationship between two variables, you know nothing, nothing, nothing about the causal relationship between those two variables.
Let's look at an example. Are tall people more successful? If their success happens independent of them being tall, that is they're both tall and successful just by chance, then height and success in this limited example are correlated. But if there was evidence that tall people were successful because of their height, this would be a causal relationship.
As we know, correlation does not imply causation. Rather a causal relationship is a special kind of a correlational relationship. Too often, we see correlation and assume causation.
I have a folder that is full of hundreds of articles I've clipped over the years in which the headline claims that there's a causal relationship between A and B, and the study is correlational. Just because there's this correlational connection between these two phenomenon, A must cause B.
In fact, usually they start out with a preconception, with a belief that A causes B. Then they do a study collecting those numbers and they see that nice positive association between the two sets of numbers and they go, you see? A causes B. Drinking diet soda causes heart problems. This is a very, very, very serious mistake.
Let's start with how we determine correlation. To observe correlational relationships between two things, all you have to do is to make multiple measurements of two variables over time. We can learn about relationships between the two variables by observing patterns of variation in a series of measurements.
Let's take a hypothetical example. Foot size and speed. If every time a foot increases in size by an inch, top running speed increases by one mile per hour, we call the relationship between the two variables a perfect positive correlation.
A perfect correlation occurs when one variable changes by a fixed amount and another variable also changes by a fixed amount.
We track correlation with the correlation coefficient, represented by a lower case r. This is a measure of the direction and the strength of a correlation. In this case, a perfect positive correlation, the correlation coefficient would equal 1. If every time foot size increases by one inch, top speed decreases by one mile per hour, then we are seeing a perfect negative correlation and the correlation coefficient equals negative 1.
A perfect negative correlation occurs when an increase of one unit and one variable is associated with the decrease of one unit and another variable. For example, if it were the case that for every hour of partying on Friday night you did, your test score decreased by five points, these two things would be perfectly negatively correlated.
But what about if things aren't so perfect, like in normal life? If every time foot size increases by an inch, top speed doesn't increase or decrease systematically, we are looking at two uncorrelated variables, and the value of r, the correlation coefficient, is 0. We can see how slight changes in the value of the r variable reflect shifts in strength of correlation.
But there are two other possibilities. The other possibility is that B is causing A. That's called the bi-directionality problem. And another possibility is that C is causing A and B. That's called the third variable problem. Correlation does not even imply causation.
Take this example. A study finds a negative correlation between time spent watching TV and performance on memory tasks. The more time someone spent indoors watching TV, the worse their memory seems to become. These variables appear to be negatively correlated. But why?
One possibility is that watching TV itself causes decreases in memory performance. Another is that decreases in memory performance somehow causes more TV to be watched. But are we forgetting to get outside and get exercise?
It turns out that this very idea that exercise could be involved is the missing part of the equation, known as the third variable, which may lead us to get outside, thus watching less TV, and could also lead us to improve our memory. A and B may not be causally related at all. They may both be causally dependent on C.
We must always be aware that the relationship between two variables cannot be inferred from the naturally occurring correlation between them. There could always be a third variable lurking, causing the changes in both.
So how do you get from a correlational relationship to causation? You get from one to the other with a kind of research that's called the experiment.
So now, instead of watching TV, we want to test the effects of the other variable, exercise, and see whether a causal relationship exists between exercise and improved memory. To do this, we need to set up an experiment that equalizes the other variables except for exercise. To control for variations in test subjects, we would divide them into an experimental group and a control group in a process called random assignment.
But of course, it's important that they not know which of those two groups that they're in, lest their expectations influence their behavior.
Just think to yourself how useless an experimental finding would be if we allowed people to sign up for any condition they wanted. All the lazy people would go into the low exercise condition, all the athletic people into the high exercise condition.
If we suspect that some third variable may be playing an important role, we can use a matched samples technique. And by that we simply mean that every person in the experimental group is equated with another individual in the control group on a particular dimension.
So for example, if we're studying the relationship between exercise and memory, we might want to make sure that for every 18 year-old in the control group, we have one 18 year-old in the experimental group. For every person who's overweight in the control group, we have somebody who's equally overweight in the experimental group. Each person in one group is matched with one person in the other. Therefore on average, the two groups will have equal weight, height, age, and any other variable that might concern us.
Although the matched samples technique is useful, it does not control for the third variable entirely. It controls for a particular third variable, not all third variables.
Using our example, even if we matched weight, height, and age, we may miss other variables such as a pre-existing medical condition. For this reason, random assignment is preferable, because it gives the two groups a better chance of being the same on average for all variables.
Back to our example. We're measuring exercise against memory performance, manipulating the amount of exercise in our experimental group while keeping it constant in our control group. When we compare these measurements, we are computing the differences on average memory test scores between the two groups.
Because we have control of the variables except exercise, we never have to ask whether the third variable is causing something. We already know what's causing the change, our manipulation of amount of exercise. Once this stronger correlation appears, we could gather more evidence in support of a causal relationship through repeated trials and more manipulation of variables.
While manipulating test subjects and variables allows us to study causal relationships, there are ethical considerations we must take into account when designing experiments. Like all scientists, psychologists are bound to a strict code of ethics that governs research. There are many variables that would be either unethical or impossible to manipulate. In those instances, we can still make value observations about behavior by measuring correlation between variables, understanding that this does not imply causation.