myers11ecoach

10.2 Assessing Intelligence

intelligence test a method for assessing an individual’s mental aptitudes and comparing them with those of others, using numerical scores.

achievement test a test designed to assess what a person has learned.

10-4 What is an intelligence test, and what is the difference between achievement and aptitude tests?

aptitude test a test designed to predict a person’s future performance; aptitude is the capacity to learn.

An intelligence test assesses people’s mental abilities and compares them with others, using numerical scores. How do we design such tests, and what makes them credible? Consider why psychologists created tests of mental abilities and how they have used them.

By this point in your life, you’ve faced dozens of ability tests: school tests of basic reading and math skills, course exams, intelligence tests, driver’s license exams. Psychologists classify such tests as either achievement tests, intended to reflect what you have learned, or aptitude tests, intended to predict your ability to learn a new skill. Exams covering what you have learned in this course are achievement tests. A college entrance exam, which seeks to predict your ability to do college work, is an aptitude test—a “thinly disguised intelligence test,” says Howard Gardner (1999). Indeed, report Meredith Frey and Douglas Detterman (2004), total scores on the U.S. SAT have correlated +.82 with general intelligence scores in a national sample of 14- to 21-year-olds (FIGURE 10.3).

Figure 10.3
Close cousins: Aptitude and intelligence scores A scatterplot shows the close correlation that has existed between intelligence scores and verbal and quantitative SAT scores. (Data from Frey and Detterman, 2004.)

393

Early and Modern Tests of Mental Abilities

10-5 When and why were intelligence tests created, and how do today’s tests differ from early intelligence tests?

Some societies concern themselves with promoting the collective welfare of the family, community, and society. Other societies emphasize individual opportunity. Plato, a pioneer of the individualist tradition, wrote more than 2000 years ago in The Republic that “no two persons are born exactly alike; but each differs from the other in natural endowments, one being suited for one occupation and the other for another.” As heirs to Plato’s individualism, people in Western societies have pondered how and why individuals differ in mental ability.

Francis Galton: Belief in Hereditary Genius

Western attempts to assess such differences began in earnest with English scientist Francis Galton (1822–1911), who was fascinated with measuring human traits. When his cousin Charles Darwin proposed that nature selects successful traits through the survival of the fittest, Galton wondered if it might be possible to measure “natural ability” and to encourage those of high ability to mate with one another. At the 1884 London Health Exhibition, more than 10,000 visitors received his assessment of their “intellectual strengths” based on such things as reaction time, sensory acuity, muscular power, and body proportions. But alas, on these measures, well-regarded adults and students did not outscore others. Nor did the measures correlate with each other.

Although Galton’s quest for a simple intelligence measure failed, he gave us some statistical techniques that we still use (as well as the phrase nature and nurture). And his persistent belief in the inheritance of genius—reflected in his book, Hereditary Genius—illustrates an important lesson from both the history of intelligence research and the history of science: Although science itself strives for objectivity, individual scientists are affected by their own assumptions and attitudes.

Alfred Binet: Predicting School Achievement

Alfred Binet (1857–1911) “Some recent philosophers have given their moral approval to the deplorable verdict that an individual’s intelligence is a fixed quantity, one which cannot be augmented. We must protest and act against this brutal pessimism” (Binet, 1909, p. 141).

mental age a measure of intelligence test performance devised by Binet; the chronological age that most typically corresponds to a given level of performance. Thus, a child who does as well as an average 8-year-old is said to have a mental age of 8.

Modern intelligence testing traces its birth to early twentieth-century France, where a new law required all children to attend school. French officials knew that some children, including many newcomers to Paris, would struggle and need special classes. But how could the schools make fair judgments about children’s learning potential? Teachers might assess children who had little prior education as slow learners. Or they might sort children into classes on the basis of their social backgrounds. To minimize such bias, France’s minister of public education gave Alfred Binet and others, including Théodore Simon, the task of studying this problem.

In 1905, Binet and Simon first presented their work under the archaic title, “New Methods for Diagnosing the Idiot, the Imbecile, and the Moron” (Nicolas & Levine, 2012). They began by assuming that all children follow the same course of intellectual development but that some develop more rapidly. On tests, therefore, a “dull” child should score much like a typical younger child, and a “bright” child like a typical older child. Thus, their goal became measuring each child’s mental age, the level of performance typically associated with a certain chronological age. The average 9-year-old, then, has a mental age of 9. Children with below-average mental ages, such as 9-year-olds who perform at the level of typical 7-year-olds, would struggle with age-appropriate schoolwork. A 9-year-old who performs at the level of typical 11-year-olds should find schoolwork easy.

To measure mental age, Binet and Simon theorized that mental aptitude, like athletic aptitude, is a general capacity that shows up in various ways. They tested a variety of reasoning and problem-solving questions on Binet’s two daughters, and then on “bright” and “backward” Parisian schoolchildren. The items they developed eventually predicted how well French children would handle their schoolwork.

394

“The IQ test was invented to predict academic performance, nothing else. If we wanted something that would predict life success, we’d have to invent another test completely.”

Social psychologist Robert Zajonc (1984b)

Binet and Simon made no assumptions concerning why a particular child was slow, average, or precocious. Binet personally leaned toward an environmental explanation. To raise the capacities of low-scoring children, he recommended “mental orthopedics” that would help develop their attention span and self-discipline. He believed his intelligence test did not measure inborn intelligence as a scale measures weight. Rather, it had a single practical purpose: to identify French schoolchildren needing special attention. Binet hoped his test would be used to improve children’s education, but he also feared it would be used to label children and limit their opportunities (Gould, 1981).

RETRIEVAL PRACTICE

What did Binet hope to achieve by establishing a child’s mental age?

Binet hoped that the child’s mental age (the age that typically corresponds to the child’s level of performance), would help identify appropriate school placements of children.

Stanford-Binet the widely used American revision (by Terman at Stanford University) of Binet’s original intelligence test.

Lewis Terman: The Innate IQ

intelligence quotient (IQ) defined originally as the ratio of mental age (ma) to chronological age (ca) multiplied by 100 (thus, IQ = ma/ca × 100). On contemporary intelligence tests, the average performance for a given age is assigned a score of 100.

Binet’s fears were realized soon after his death in 1911, when others adapted his tests for use as a numerical measure of inherited intelligence. This began when Stanford University professor Lewis Terman (1877–1956) found that the Paris-developed questions and age norms worked poorly with California schoolchildren. Adapting some of Binet’s original items, adding others, and establishing new age norms, Terman extended the upper end of the test’s range from teenagers to “superior adults.” He also gave his revision the name it retains today—the Stanford-Binet.

From such tests, German psychologist William Stern derived the famous intelligence quotient, or IQ. The IQ is simply a person’s mental age divided by chronological age and multiplied by 100 to get rid of the decimal point:

Thus, an average child, whose mental and chronological ages are the same, has an IQ of 100. But an 8-year-old who answers questions as would a typical 10-year-old has an IQ of 125.

The original IQ formula worked fairly well for children but not for adults. (Should a 40-year-old who does as well on the test as an average 20-year-old be assigned an IQ of only 50?) Most current intelligence tests, including the Stanford-Binet, no longer compute an IQ in this manner (though the term IQ still lingers as a shorthand expression for “intelligence test score”). Instead, they represent the test-taker’s performance relative to the average performance of others the same age. This average performance is arbitrarily assigned a score of 100, and about two-thirds of all test-takers fall between 85 and 115.

Terman (1916, p. 4) promoted the widespread use of intelligence testing to “take account of the inequalities of children in original endowment” by assessing their “vocational fitness.” In sympathy with Francis Galton’s eugenics—the much-criticized nineteenth-century movement that proposed measuring human traits and using the results to encourage only smart and fit people to reproduce—Terman envisioned that the use of intelligence tests would “ultimately result in curtailing the reproduction of feeble-mindedness and in the elimination of an enormous amount of crime, pauperism, and industrial inefficiency” (p. 7).

With Terman’s help, the U.S. government developed new tests to evaluate both newly arriving immigrants and World War I army recruits—the world’s first mass administration of an intelligence test. To some psychologists, the results indicated the inferiority of people not sharing their Anglo-Saxon heritage. Such findings were part of the cultural climate that led to a 1924 immigration law that reduced Southern and Eastern European immigration quotas to less than a fifth of those for Northern and Western Europe.

395

Binet probably would have been horrified that his test had been adapted and used to draw such conclusions. Indeed, such sweeping judgments became an embarrassment to most of those who championed testing. Even Terman came to appreciate that test scores reflected not only people’s innate mental abilities but also their education, native language, and familiarity with the culture assumed by the test. Abuses of the early intelligence tests serve to remind us that science can be value-laden. Behind a screen of scientific objectivity, ideology sometimes lurks.

RETRIEVAL PRACTICE

What is the IQ of a 4-year-old with a mental age of 5?

125 (5 ÷ 4 × 100 = 125)

David Wechsler: Separate Scores for Separate Skills

Psychologist David Wechsler created what is now the most widely used individual intelligence test, the Wechsler Adult Intelligence Scale (WAIS), together with a version for school-age children (the Wechsler Intelligence Scale for Children [WISC]), and another for preschool children (Evers et al., 2012). The latest (2008) edition of the WAIS consists of 15 subtests, including these:

Similarities—Reasoning the commonality of two objects or concepts, such as “In what way are wool and cotton alike?”
Vocabulary—Naming pictured objects, or defining words (“What is a guitar?”)
Block Design—Visual abstract processing, such as “Using the four blocks, make one just like this.”
Letter-Number Sequencing—On hearing a series of numbers and letters, repeat the numbers in ascending order, and then the letters in alphabetical order: “R-2-C-1-M-3.”

Wechsler Adult Intelligence Scale (WAIS) the WAIS and its companion versions for children are the most widely used intelligence tests; contain verbal and performance (nonverbal) subtests.

The WAIS yields not only an overall intelligence score, as does the Stanford-Binet, but also separate scores for verbal comprehension, perceptual organization, working memory, and processing speed. Striking differences among these scores can provide clues to cognitive strengths or weaknesses. For example, a low verbal comprehension score combined with high scores on other subtests could indicate a reading or language disability. Other comparisons can help a psychologist or psychiatrist establish a rehabilitation plan for a stroke patient. In such ways, these tests help realize Binet’s aim: to identify opportunities for improvement and strengths that teachers and others can build upon. Such uses are possible, of course, only when we can trust the test results.

Matching patterns Block design puzzles test visual abstract processing ability. Wechsler’s individually administered intelligence test comes in forms suited for adults and children.

Question

DFtWpQ5cKjO2qAzHInQvQ0bAmugWaVThL4gRhuakzyD7Uw4sMF4quqhXAXtCK5jFlCKaAx3YiBxc89yq6bLxI6z2Atbun1JG0MrvxFkvE+AD6GFtVySbBOU7LfnPZ21mF2BK49AbPt64kcN/ZXD4TJjcQJwiYBGru4CJZqo98ZpD92xHkhnyyclzabcG+wKzBx5FfWil0rNSsUvp9cq0gRiw1k3r3LdyuyQ7s5AW8LADB23dmCb4ag5J+Xo+BTFshRd76ML5PNrHvYE5

Possible sample answer: The earliest test was created by Alfred Binet, with Théodor Simon. Because the Paris-developed questions worked poorly with California schoolchildren, Lewis Terman revised the test to create an American version, called the Stanford-Binet. Using Binet’s concept of mental age, William Stern devised the intelligence quotient (IQ), a single number that indicated a child’s performance relative to others his or her age. This formula did not work well for adults. David Wechsler created separate intelligence tests for adults, children, and preschool children. These tests yield not only an overall intelligence score, but also separate scores for verbal comprehension, perceptual organization, working memory, and processing speed.

RETRIEVAL PRACTICE

An employer with a pool of applicants for a single available position is interested in testing each applicant’s potential. To help her decide whom she should hire, she should use an ______________ (achievement/aptitude) test. That same employer wishing to test the effectiveness of a new, on-the-job training program would be wise to use an ______________ (achievement/aptitude) test.

aptitude; achievement

Principles of Test Construction

10-6 What is a normal curve, and what does it mean to say that a test has been standardized and is reliable and valid?

To be widely accepted, a psychological test must meet three criteria: It must be standardized, reliable, and valid. The Stanford-Binet and Wechsler tests meet these requirements.

standardization defining uniform testing procedures and meaningful scores by comparison with the performance of a pretested group.

396

normal curve the bell-shaped curve that describes the distribution of many physical and psychological attributes. Most scores fall near the average, and fewer and fewer scores lie near the extremes.

Standardization

The number of questions you answer correctly on an intelligence test would reveal almost nothing. To know how well you performed, you would need some basis for comparison. That’s why test-makers give new tests to a representative sample of people. The scores from this pretested group become the basis for future comparisons. If you later take the test following the same procedures, your score will be meaningful when compared with others. This process is called standardization.

If we construct a graph of test-takers’ scores, the scores typically form a bell-shaped pattern called the normal curve. No matter what attributes we measure—height, weight, or mental aptitude—people’s scores tend to form a bell curve. The highest point is the midpoint, or the average score. On an intelligence test, we give this average score a value of 100 (FIGURE 10.4). Moving out from the average, toward either extreme, we find fewer and fewer people. For both the Stanford-Binet and Wechsler tests, a person’s score indicates whether that person’s performance fell above or below the average. A performance higher than all but 2 percent of all scores earns an intelligence score of 130. A performance lower than 98 percent of all scores earns an intelligence score of 70.

Figure 10.4
The normal curve Scores on aptitude tests tend to form a normal, or bell-shaped, curve around an average score. For the Wechsler scale, for example, the average score is 100.

To keep the average score near 100, the Stanford-Binet and Wechsler scales are periodically restandardized. If you took the WAIS, Fourth Edition, recently, your performance was compared with a standardization sample who took the test during 2007, not to David Wechsler’s initial 1930s sample. If you compared the performance of the most recent standardization sample with that of the 1930s sample, do you suppose you would find rising or declining test performance? Amazingly—given that college entrance aptitude scores have sometimes dropped, such as during the 1960s and 1970s—intelligence test performance has improved. This worldwide phenomenon is called the Flynn effect, in honor of New Zealand researcher James Flynn (1987, 2012), who first calculated its magnitude. As FIGURE 10.5 indicates, the average person’s intelligence test score in 1920 was—by today’s standard—only a 76! Such rising performance has been observed in 29 countries, from Canada to rural Australia (Ceci & Kanaya, 2010). Although there have been some regional reversals, the historic increase is now widely accepted as an important phenomenon (Lynn, 2009; Teasdale & Owen, 2005, 2008).

Figure 10.5
Getting smarter? In every country studied, intelligence test performance rose during the twentieth century, as shown here with American Wechsler and Stanford-Binet test performance between 1918 and 2007. In Britain, test scores have risen 27 points since 1942. (Data from Horgan, 1995, updated with Flynn, 2012, 2014.)

The Flynn effect’s cause has been a psychological mystery. Did it result from greater test sophistication? But the gains began before testing was widespread and have even been observed among preschoolers. Better nutrition? As the nutrition explanation would predict, people have gotten not only smarter but taller. But in postwar Britain, notes Flynn (2009), the lower-class children gained the most from improved nutrition but the intelligence performance gains were greater among upper-class children. Or did the Flynn effect stem from more education? More stimulating environments? Less childhood disease? Smaller families and more parental investment (Sundet et al., 2008)? Flynn (2012) attributes the performance increase to our need to develop new mental skills to cope with modern environments. But others argue that it may be accounted for by changes in the tests (Kaufman et al., 2013). Regardless of what combination of factors explains the rise in intelligence test scores, the phenomenon counters one concern of some hereditarians—that the higher twentieth-century birthrates among those with lower scores would shove human intelligence scores downward (Lynn & Harvey, 2008).

reliability the extent to which a test yields consistent results, as assessed by the consistency of scores on two halves of the test, on alternative forms of the test, or on retesting.

397

Reliability

validity the extent to which a test measures or predicts what it is supposed to. (See also content validity and predictive validity.)

Knowing where you stand in comparison to a standardization group still won’t say much about your intelligence unless the test has reliability. A reliable test gives consistent scores, no matter who takes the test or when they take it. To check a test’s reliability, researchers test people many times. They may retest using the same test or they may split the test in half to see whether odd-question scores and even-question scores agree. If the two scores generally agree, or correlate, the test is reliable. The higher the correlation between the test–retest or the split-half scores, the higher the test’s reliability. The tests we have considered so far—the Stanford-Binet, the WAIS, and the WISC—are very reliable (about +.9). When retested, people’s scores generally match their first score closely.

content validity the extent to which a test samples the behavior that is of interest.

Validity

predictive validity the success with which a test predicts the behavior it is designed to predict; it is assessed by computing the correlation between test scores and the criterion behavior. (Also called criterion-related validity.)

High reliability does not ensure a test’s validity—the extent to which the test actually measures or predicts what it promises. Imagine using a miscalibrated tape measure to measure people’s heights. Your results would be very reliable. No matter how many times you measured, people’s heights would be the same. But your results would not be valid—you would not be giving the information you promised—real height.

Tests that tap the pertinent behavior, or criterion, have content validity. The road test for a driver’s license has content validity because it samples the tasks a driver routinely faces. Course exams have content validity if they assess one’s mastery of a representative sample of course material. But we expect intelligence tests to have predictive validity: They should predict the criterion of future performance, and to some extent they do.

Are general aptitude tests as predictive as they are reliable? As critics are fond of noting, the answer is plainly No. The predictive power of aptitude tests is fairly strong in the early school years, but later it weakens. Academic aptitude test scores are reasonably good predictors of achievement for children ages 6 to 12, where the correlation between intelligence score and school performance is about +.6 (Jensen, 1980). Intelligence scores correlate even more closely with scores on achievement tests: +.81 in one comparison of 70,000 English children’s intelligence scores at age 11 with their academic achievement in national exams at age 16 (Deary et al., 2007, 2009). The SAT, used in the United States as a college entrance exam, has been less successful in predicting first-year college grades. (The correlation, less than +.5, has been, however, a bit higher when adjusting for high scorers electing tougher courses [Berry & Sackett, 2009; Willingham et al., 1990].) By the time we get to the Graduate Record Examination (GRE; an aptitude test similar to the SAT but for those applying to graduate school), the correlation with graduate school performance is an even more modest but still significant +.4 (Kuncel & Hezlett, 2007).

398

Why does the predictive power of aptitude scores diminish as students move up the educational ladder? Consider a parallel situation: Among all American and Canadian football linemen, body weight correlates with success. A 300-pound player tends to overwhelm a 200-pound opponent. But within the narrow 280- to 320-pound range typically found at the professional level, the correlation between weight and success becomes negligible (FIGURE 10.6). The narrower the range of weights, the lower the predictive power of body weight becomes. If an elite university takes only those students who have very high aptitude scores, and then gives them a restricted range of high grades, those scores cannot possibly predict much. This will be true even if the test has excellent predictive validity with a more diverse sample of students. Likewise, modern grade inflation has produced less diverse high school grades. With their diminished range, high school grades now predict college grades no better than have SAT scores (Sackett et al., 2012). So, when we validate a measure using a wide range of scores but then use it with a restricted range of scores, it loses much of its predictive validity.

Figure 10.6
Diminishing predictive power Let’s imagine a correlation between football linemen’s body weight and their success on the field. Note how insignificant the relationship becomes when we narrow the range of weight to 280 to 320 pounds. As the range of data under consideration narrows, its predictive power diminishes.

Question

tWtb+mPZoxXFDEhHpO5rj+Q69SrkvqT3FCY8BOxSSMMm6iveZR2r0ZTadWhFhJPCvr4IYtl2ouS3GgpLNtZDbHkUXqWTJBJJk49Lj8hVuBvj3bowUYMBHsirn5Pg0q0FNGRXwYjp5JbaITBPekk7LQzW5nwL7lOsvKnYhUzI+HEMoKEITin4LOtSChqO/kFZFix2jywAKa4fScUASaoAzygDt+usyAvhnKnd9xzNg9heChaxNS7bCZTjTg7zj45i0lQuQThCmpaXF5elgDM+SmGSePntSDJU0KVkwA44sh8naiNjooeGlIp+gYGEue8R5ctdQC0uRYE8cylghovfcKqVwd/p1dWZz05j8WwmuN+VtZ+o33Sd2kBVUF0=

Possible sample answer: The most common intelligence tests are standardized (pretested on similar groups of people), reliable (people tend to achieve very similar scores on multiple versions), and valid (measuring what they are supposed to measure and, to some extent, predictive of future achievement).

RETRIEVAL PRACTICE

What are the three criteria that a psychological test must meet in order to be widely accepted? Explain.

A psychological test must be standardized (pretested on a similar group of people), reliable (yielding consistent results), and valid (measuring what it is supposed to measure).

Correlation coefficients were used in this section. Here’s a quick review: Correlations do not indicate cause-effect, but they do tell us whether two things are associated in some way. A correlation of −1.0 represents perfect ______________ (agreement/disagreement) between two sets of scores: As one score goes up, the other score goes ______________ (up/down). A correlation of ______________ represents no association. The highest correlation, +1.0, represents perfect ______________ (agreement/disagreement): As the first score goes up, the other score goes ______________ (up/down).

disagreement; down; zero; agreement; up

399