Measurements, valid and invalid

No one would object to using a tape measure reading in centimeters to measure the length of a bed. Many people object to using SAT scores to measure readiness for college. Let’s shortcut that debate: just measure the height in inches of all applicants and accept the tallest. Bad idea, you say. Why? Because height has nothing to do with being prepared for college. In more formal language, height is not a valid measure of a student’s academic background.

167

image
Figure 8.1: Figure 8.1 The unemployment rate from August 1991 to July 1994. The gap shows the effect of a change in how the government measures unemployment.

Valid measurement

A variable is a valid measure of a property if it is relevant or appropriate as a representation of that property.

It is valid to measure length with a tape measure. It isn’t valid to measure a student’s readiness for college by recording her height. The BLS unemployment rate is a valid measure, even though changes in the official definitions would give a somewhat different measure. Let’s think about measures, both valid and invalid, in some other settings.

image
Mahaux Photography/Getty Images

EXAMPLE 4 Measuring highway safety

Roads got better. Speed limits increased. Big SUVs and crossovers have replaced some cars, while smaller cars and hybrid vehicles have replaced others. Enforcement campaigns reduced drunk driving. How did highway safety change between 2007 and 2012 in this changing environment?

We could just count deaths from motor vehicles. The Fatality Analysis Reporting System says there were 41,259 deaths in 2007 and 33,561 deaths five years later in 2012. The number of deaths decreased. These numbers alone show progress. However, we need to keep in mind other things that happened during this same time frame to determine how much progress has been made. For example, the number of licensed drivers rose from 206 million in 2007 to 212 million in 2012. The number of miles that people drove decreased from 3031 billion to 2969 billion during this same time period. If more people drive fewer miles, should we expect more or fewer deaths? The count of deaths alone is not a valid measure of highway safety. So what should we use instead?

168

Rather than a count, we should use a rate. The number of deaths per mile driven takes into account the fact that more people drive more miles than in the past. In 2012, vehicles drove 2,969,000,000,000 miles in the United States. Because this number is so large, it is usual to measure safety by deaths per 100 million miles driven rather than deaths per mile. For 2012, this death rate is

The death rate fell from 1.4 deaths per 100 million miles in 2007 to 1.1 in 2012. That’s a decrease—there were 21% fewer deaths per mile driven in 2012 than in 2007. Driving became safer during this time period even though there were more drivers on the roads.

Rates and counts

Often, a rate (a fraction, proportion, or percentage) at which something occurs is a more valid measure than a simple count of occurrences.

NOW IT’S YOUR TURN

Question 8.1

8.1 Driver fatigue. A researcher studied the number of traffic accidents that were attributed to driver fatigue at different times of the day. He noticed that the number of accidents was higher in the late afternoon (between 5 and 6 P.M.) than in the early afternoon (between 1 and 2 P.M.). He concluded that driver fatigue plays a more prominent role in traffic accidents in the late afternoon than in the early afternoon. Do you think this conclusion is justified?

169

Using height to measure readiness for college and using counts when rates are needed are examples of clearly invalid measures. The tougher questions concern measures that are neither clearly invalid nor obviously valid.

EXAMPLE 5 Achievement tests

When you take a chemistry exam, you hope that it will ask you about the main points of material listed in the course syllabus. If it does, the exam is a valid measure of how much you know about the course material. The College Board, which administers the SAT, also offers Advanced Placement (AP) exams in a variety of disciplines. These AP exams are not very controversial. Experts can judge validity by comparing the test questions with the syllabus of material the questions are supposed to cover.

EXAMPLE 6 IQ tests

Psychologists would like to measure aspects of the human personality that can’t be observed directly, such as “intelligence” or “authoritarian personality.” Does an IQ test measure intelligence? Some psychologists say Yes rather loudly. There is such a thing as general intelligence, they argue, and the various standard IQ tests do measure it, though not perfectly. Other experts say No equally loudly. There is no single intelligence, just a variety of mental abilities (for example, logical, linguistic, spatial, musical, kinesthetic, interpersonal, and intrapersonal) that no one instrument can measure.

The disagreement over the validity of IQ tests is rooted in disagreement over the nature of intelligence. If we can’t agree on exactly what intelligence is, we can’t agree on how to measure it.

Statistics is little help in these examples. The examples start with an idea like “knowledge of chemistry” or “intelligence.” If the idea is vague, validity becomes a matter of opinion. However, statistics can help a lot if we refine the idea of validity a bit.

EXAMPLE 7 The SAT again

“SAT bias will illegally cheat thousands of young women out of college admissions and scholarship aid they have earned by superior classroom performance.” That’s what the organization FairTest said when the 1999 SAT scores were released. The gender gap was larger on the math part of the test, where women averaged 495 and men averaged 531. Fifteen years later, in 2014, the gap remained. Among high school seniors, women averaged 499 and men 530 on the math part of the test. The federal Office of Civil Rights says that tests on which women and minorities score lower are discriminatory.

170

The College Board, which administers the SAT, replies that there are many reasons some groups have lower average scores than others. For example, more women than men from families with low incomes and little education sign up for the SAT. Students whose parents have low incomes and little education have, on the average, fewer advantages at home and in school than richer students. They have lower SAT scores because their backgrounds have not prepared them as well for college. The mere fact of lower scores doesn’t imply that the test is not valid.

Is the SAT a valid measure of readiness for college? “Readiness for college academic work” is a vague concept that probably combines inborn intelligence (whatever we decide that is), learned knowledge, study and test-taking skills, and motivation to work at academic subjects. Opinions will always differ about whether SAT scores (or any other measure) accurately reflect this vague concept.

Instead, we ask a simpler and more easily answered question: do SAT scores help predict students’ success in college? Success in college is a clear concept, measured by whether students graduate and by their college grades. Students with high SAT scores are more likely to graduate and earn (on the average) higher grades than students with low SAT scores. We say that SAT scores have predictive validity as measures of readiness for college. This is the only kind of validity that data can assess directly.

image What can’t be measured matters One member of the young Edmonton Oilers hockey team of 1981 finished last in almost everything one can measure: strength, speed, reflexes, eyesight. That was Wayne Gretzky, soon to be known as “the Great One.’’ He broke the National Hockey League scoring record that year, then scored yet more points in seven different seasons. Somehow the physical measurements didn’t catch what made Gretzky the best hockey player ever. Not everything that matters can be measured.

Predictive validity

A measurement of a property has predictive validity if it can be used to predict success on tasks that are related to the property measured.

Predictive validity is the clearest and most useful form of validity from the statistical viewpoint. “Do SAT scores help predict college grades?” is a much clearer question than “Do IQ test scores measure intelligence?” However, predictive validity is not a yes-or-no idea. We must ask how accurately SAT scores predict college grades. Moreover, we must ask for what groups the SAT has predictive validity. It is possible, for example, that the SAT predicts college grades well for men but not for women. There are statistical ways to describe “how accurately.” The Statistical Controversies feature in this chapter asks you to think about these issues.

171

STATISTICAL CONTROVERSIES

SAT Exams in College Admissions

image
Susan Stava/The New York Times/Redux

Colleges use a variety of measures to make admissions decisions. The student’s record in high school is the most important, but SAT scores do matter, especially at selective colleges. The SAT has the advantage of being a national test. An A in algebra means different things in different high schools, but an SAT Math score of 625 means the same thing everywhere. The SAT can’t measure willingness to work hard or creativity, so it won’t predict college performance exactly, but most colleges have long found it helpful.

The accompanying table gives some results about how well SAT scores predict first-year college grades from a sample of 151,316 students in 2006. The numbers in the table say what percentage of the variation among students in college grades can be predicted by SAT scores (Critical Reading, Writing, and Math tests combined), by high school grades, and by SAT and high school grades together. An entry of 0% would mean no predictive validity, and 100% would mean predictions were always exactly correct.

How well do you think SAT scores predict first-year college grades? Should SAT scores be used in deciding college admissions?

All institutions Private institutions Public institutions
SAT 28% 32% 27%
School grades 29% 30% 28%
Both together 38% 42% 37%