4.3 Vision II: Recognizing What We Perceive

Our journey into the visual system has already revealed how it accomplishes some pretty astonishing feats. But the system needs to do much more in order for us to be able to interact effectively with our visual worlds. Let’s now consider how the system links together individual visual features into whole objects, allows us to recognize what those objects are, organizes objects into visual scenes, and detects motion and change in those scenes. Along the way, we’ll see that studying visual errors and illusions provides key insights into how these processes work.

Attention: The “Glue” That Binds Individual Features into a Whole

Specialized feature detectors in different parts of the visual system analyze each of the multiple features of a visible object: orientation, color, size, shape, and so forth. But how are different features combined into single, unified objects? What allows us to perceive that the young man in the photo is wearing a red shirt and the young woman is wearing a yellow shirt? Why don’t we see free-floating patches of red and yellow? These questions refer to what researchers call the binding problem in perception: how features are linked together so that we see unified objects in our visual world rather than free-floating or miscombined features (Treisman, 1998, 2006).

binding problem

How features are linked together so that we see unified objects in our visual world rather than free-floating or miscombined features.

Illusory Conjunctions: Perceptual Mistakes

We correctly combine features into unified objects; so, for example, we see that the young man is wearing a gray shirt and the young woman is wearing a red shirt.
Thomas Barwick/Getty Images

In everyday life, we correctly combine features into unified objects so automatically and effortlessly that it may be difficult to appreciate that binding is ever a problem at all. However, researchers have discovered errors in binding that reveal important clues about how the process works. One such error is known as an illusory conjunction, a perceptual mistake where features from multiple objects are incorrectly combined. In one study, researchers briefly showed participants visual displays in which black digits flanked colored letters, such as a red A and a blue X, then instructed participants first to report the black digits and second to describe the colored letters (Triesman & Schmidt, 1982). Participants frequently reported illusory conjunctions, claiming to have seen, for example, a blue A or a red X (see FIGURE 4.13a and b). These illusory conjunctions were not just the result of guessing; they occurred more frequently than other kinds of errors, such as reporting a letter or color that was not present in the display (see FIGURE 4.13c).

illusory conjunction

A perceptual mistake where features from multiple objects are incorrectly combined.

110

Figure 4.13: FIGURE 4.13 Illusory Conjunctions Illusory conjunctions occur when features such as color and shape are combined incorrectly. For example, when participants are shown a red A and blue X, they sometimes report seeing a blue A and red X. Other kinds of errors, such as a misreported letter (e.g., reporting T when no T was presented) or misreported color (reporting green when no green was presented) occur rarely, indicating that illusory conjunctions are not the result of guessing. (Information from Roberston, 2003.)

How does the study of illusory conjunctions help us understand the role of attention in feature binding?

Why do illusory conjunctions occur? Psychologist Anne Treisman and her colleagues proposed a feature-integration theory (Treisman, 1998, 2006; Treisman & Gelade, 1980; Treisman & Schmidt, 1982), which holds that focused attention is not required to detect the individual features that comprise a stimulus, such as the color, shape, size, and location of letters, but it is required to bind those individual features together. From this perspective, attention provides the “glue” necessary to bind features together, and illusory conjunctions occur when it is difficult for participants to pay full attention to the features that need to be glued together. For example, in the experiments we just considered, participants were first required to name the black digits, thereby reducing attention to the colored letters and allowing illusory conjunctions to occur. When experimental conditions are changed so that participants can pay full attention to the colored letters and they are able to correctly bind their features together, illusory conjunctions disappear (Treisman, 1998; Treisman & Schmidt, 1982).

feature-integration theory

The idea that focused attention is not required to detect the individual features that comprise a stimulus, but is required to bind those individual features together.

The Role of the Parietal Lobe

The binding process makes use of feature information processed by structures within the ventral visual stream, the “what” pathway (Seymour et al., 2010; see FIGURE 4.12). But because binding involves linking together features that appear at a particular spatial location, it also depends critically on the parietal lobe in the dorsal stream, the “where” pathway (Robertson, 1999). For example, Treisman and others studied R.M., who had suffered strokes that destroyed both his left and right parietal lobes. Although many aspects of his visual function were intact, he had severe problems attending to spatially distinct objects. When presented with stimuli such as those in FIGURE 4.13, R.M. perceived an abnormally large number of illusory conjunctions, even when he was given as long as 10 seconds to look at the displays (Friedman-Hill, Robertson, & Treisman, 1995; Robertson, 2003).

Recognizing Objects by Sight

Take a quick look at the letters in the accompanying illustration. Even though they’re quite different from one another, you probably effortlessly recognized them as all being examples of the letter G. Now consider the same kind of demonstration using your best friend’s face. Suppose one day your friend gets a dramatic new haircut—or adds glasses, hair dye, or a nose ring. Even though your friend now looks strikingly different, you still recognize that person with ease. Just like the variability in Gs, you somehow are able to extract the underlying features of the face that allow you to accurately identify your friend.

A quick glance and you recognize all these letters as G, but their varying sizes, shapes, angles, and orientations ought to make this recognition task difficult. What is it about the process of object recognition that allows us to perform this task effortlessly?

This thought exercise may seem trivial, but it’s no small perceptual feat. If the visual system were somehow stumped each time a minor variation occurred in an object, the inefficiency of it all would be overwhelming. We’d have to process information effortfully just to perceive our friend as the same person from one meeting to another, not to mention laboring through the process of knowing when a G is really a G. In general, though, object recognition proceeds fairly smoothly, in large part due to the operation of the feature detectors we discussed earlier.

111

How do feature detectors help the visual system accurately perceive an object in different circumstances, such as your friend’s face? Some researchers argue for a modular view—namely, that specialized brain areas, or modules, detect and represent faces or houses or even body parts. Using fMRI to examine visual processing in healthy young adults, researchers found a subregion in the temporal lobe that responds most strongly to faces compared to just about any other object category, whereas a nearby area responds most strongly to buildings and landscapes (Kanwisher, McDermott, & Chun, 1997). This view suggests we have not only feature detectors to aid in visual perception but also “face detectors,” “building detectors,” and possibly other types of neurons specialized for particular types of object perception (Downing et al., 2006; Kanwisher & Yovel, 2006). Other researchers argue for a more distributed representation of object categories. In this view, it is the pattern of activity across multiple brain regions that identifies any viewed object, including faces (Haxby et al., 2001). Each of these views explains some data better than the other one, and researchers continue to debate their relative merits.

How do we recognize our friends, even when they’re hidden behind sunglasses?

Our visual systems allow us to identify people as the same individual even when they change such features as hair style and skin color. Despite the extreme changes in these two photos, you can probably tell that they both portray Johnny Depp.
Jun Sato/WireImage/Getty Images
Warner Bros./The Kobal Collection/Art Resource

Principles of Perceptual Organization

Before object recognition can even kick in, the visual system must perform another important task: grouping the image regions that belong together into a representation of an object. The idea that we tend to perceive a unified, whole object rather than a collection of separate parts is the foundation of Gestalt psychology, which you read about in Chapter 1. Gestalt perceptual grouping rules govern how the features and regions of things fit together (Koffka, 1935). Here’s a sampling:

Figure 4.14: FIGURE 4.14 Perceptual Grouping Rules As noted by the Gestalt psychologists, the brain is predisposed to impose order on incoming sensations, such as by responding to patterns among stimuli and grouping like patterns together.

112

Separating Figure from Ground

Figure 4.15: FIGURE 4.15 Rubin’s vase Fixate your eyes on the center of the image and your perception will alternate between a vase and facing silhouettes, even as the sensory stimulation remains constant.

Perceptual grouping involves visually separating an object from its surroundings. In Gestalt terms, this means identifying a figure apart from the (back)ground in which it resides. For example, the words on this page are perceived as figural: They stand out from the ground of the sheet of paper on which they’re printed. Similarly, your instructor is perceived as the figure against the backdrop of all the other elements in your classroom. You certainly can perceive these elements differently, of course: The words and the paper are all part of a thing called a page, and your instructor and the classroom can all be perceived as your learning environment. Typically, though, our perceptual systems focus attention on some objects as distinct from their environments.

Size provides one clue to what’s figure and what’s ground: Smaller regions are likely to be figures, such as tiny letters on a big sheet of paper. Movement also helps: Your instructor is (we hope) a dynamic lecturer, moving around in a static environment. Another critical step toward object recognition is edge assignment. Given an edge, or boundary, between figure and ground, which region does that edge belong to? If the edge belongs to the figure, it helps define the object’s shape, and the background continues behind the edge. Sometimes, though, it’s not easy to tell which is which.

Edgar Rubin (1886–1951), a Danish psychologist, capitalized on this ambiguity and developed a famous illusion called the Rubin vase or, more generally, a reversible figure–ground relationship. You can view this “face–vase” illusion in FIGURE 4.15 in two ways, either as a vase on a black background or as a pair of silhouettes facing each other. Your visual system settles on one or the other interpretation and fluctuates between them every few seconds. This happens because the edge that would normally separate figure from ground is really part of neither: It equally defines the contours of the vase as it does the contours of the faces. Evidence from fMRIs shows, quite nicely, that when people are seeing the Rubin image as faces, there is greater activity in the face-selective region of the temporal lobe we discussed earlier than when they are seeing it as a vase (Hasson et al., 2001).

113

Perceiving Depth and Size

Objects in the world are arranged in three dimensions—length, width, and depth—but the retinal image contains only two dimensions, length and width. How does the brain process a flat, 2-D retinal image so that we perceive the depth of an object and how far away it is? The answer lies in a collection of depth cues that change as you move through space. Monocular and binocular depth cues all help visual perception (Howard, 2002).

Monocular Depth Cues

Monocular depth cues are aspects of a scene that yield information about depth when viewed with only one eye. These cues rely on the relationship between distance and size. Even when you have one eye closed, the retinal image of an object you’re focused on grows smaller as that object moves farther away and larger as it moves closer. Our brains routinely use these differences in retinal image size, or relative size, to perceive distance. Most adults, for example, fall within a familiar range of heights (perhaps 5-7 feet tall), so retinal image size alone is usually a reliable cue to how far away they are. Our visual system automatically corrects for size differences and attributes them to differences in distance. FIGURE 4.16 demonstrates how strong this effect is.

Figure 4.16: FIGURE 4.16 Relative Size When you view images of people, such as the people in the left-hand photo, the object you perceive as smaller appears farther away. With a little image manipulation, you can see in the right-hand photo that the relative size difference projected on your retinas is far greater than you perceive. The image of the person in the blue vest is exactly the same size in both photos.
The Photo Works

monocular depth cues

Aspects of a scene that yield information about depth when viewed with only one eye.

In addition to relative size, there are several more monocular depth cues, such as

Figure 4.17: FIGURE 4.17 Pictorial Depth Cues Visual artists rely on a variety of monocular cues to make their work come to life. You can rely on cues such as linear perspective (a), texture gradient (b), interposition (c), and relative height (d) in an image to infer distance, depth, and position, even if you’re wearing an eye patch.
Superstock
Age Fotostock/Superstock
Altrendo/Getty Images
Rob Blakers/Getty Images

114

Binocular Depth Cues

Figure 4.18: FIGURE 4.18 Binocular Disparity We see the world in three dimensions because the image of an object falls on the retinas of each eye at a slightly different place. In this two-object scene, the images of the square and the circle fall on different points of the retina in each eye. The disparity in the positions of the circle’s retinal images provides a compelling cue to depth.

We can also obtain depth information through binocular disparity, the difference in the retinal images of the two eyes that provides information about depth. Because our eyes are slightly separated, each registers a slightly different view of the world. Your brain computes the disparity between the two retinal images to perceive how far away objects are, as shown in FIGURE 4.18. Viewed from above in the figure, the images of the more distant square and the closer circle each fall at different points on each retina. The View-Master toy and 3-D movies both work by exploiting retinal disparity.

The View-Master has been a popular toy for decades. It is based on the principle of binocular disparity: Two images taken from slightly different angles produce a stereoscopic effect.
Masterfile Royalty-Free
© Nmpft/Sspl/The Image Works

binocular disparity

The difference in the retinal images of the two eyes that provides information about depth.

Illusions of Depth and Size

We all are vulnerable to illusions, which, as you’ll remember from the Psychology: Evolution of a Science chapter, are errors of perception, memory, or judgment in which subjective experience differs from objective reality (Wade, 2005). A famous illusion that makes use of a variety of depth cues is the Ames room, which is trapezoidal in shape rather than square (see FIGURE 4.19a). A person standing in one corner of an Ames room is physically twice as far away from the viewer as a person standing in the other corner. But when viewed with one eye through the small peephole placed in one wall, the Ames room looks square because the shapes of the windows and the flooring tiles are carefully crafted to look square from the viewing port (Ittelson, 1952). The visual system perceives the far wall as perpendicular to the line of sight so that people standing at different positions along that wall appear to be at the same distance, and the viewer’s judgments of their sizes are based directly on retinal image size. As a result, a person standing in the right corner appears to be much larger than a person standing in the left corner (see FIGURE 4.19b).

Figure 4.19: FIGURE 4.19 The Amazing Ames Room (a) A diagram showing the actual proportions of the Ames room reveals its secrets. The sides of the room form a trapezoid with a back wall that’s way off square. The uneven floor makes the room’s height in the far back corner shorter than the other. (b) Looking into the Ames room through the viewing port with only one eye, the observer infers a normal size–distance relationship—that both people are the same distance away. But the different image sizes they project on the retina leads the viewer to conclude, based on monocular cues, that one person is very small and the other is very large.
Phil Schermeister/Corbis

115

Perceiving Motion and Change

You should now have a good sense of how we see what and where objects are, a process made substantially easier when the objects stay in one place. But real life, of course, is full of moving targets. Birds fly, horses gallop, and trees bend in the wind. Understanding how we perceive motion and why we sometimes fail to perceive change can bring us closer to appreciating how visual perception works in everyday life.

Motion Perception

To sense motion, the visual system must encode information about both space and time. The simplest case to consider is an observer who does not move trying to perceive an object that does.

As an object moves across an observer’s stationary visual field, it first stimulates one location on the retina, and then a little later it stimulates another location on the retina. Neural circuits in the brain can detect this change in position over time and respond to specific speeds and directions of motion (Emerson, Bergen, & Adelson, 1992). A region in the middle of the temporal lobe referred to as MT (part of the dorsal stream we discussed earlier) is specialized for the visual perception of motion (Born & Bradley, 2005; Newsome & Paré, 1988), and brain damage in this area leads to a deficit in normal motion perception (Zihl, von Cramon, & Mai, 1983).

Of course, in the real world, rarely are you a stationary observer. As you move around, your head and eyes move all the time. The motion-perception system must take into account the position and movement of your eyes—and ultimately of your head and body—in order to perceive the motions of objects correctly. The brain accomplishes this feat by monitoring your eye and head movements and “subtracting” them from the motion in the retinal image.

How can flashing lights on a casino sign give the impression of movement?

The movement of objects in the world is not the only event that can evoke the perception of motion. The successively flashing lights of a Las Vegas casino sign can evoke a strong sense of motion because people perceive a series of flashing lights as a whole, moving object (see again FIGURE 4.15f). This perception of movement as a result of alternating signals appearing in rapid succession in different locations is called apparent motion. Video technology and animation depend on apparent motion. Motion pictures flash 24 frames per second (fps). A slower rate would produce a much choppier sense of motion; a faster rate would be a waste of resources because we would not perceive the motion as any smoother than it appears at 24 fps.

apparent motion

The perception of movement as a result of alternating signals appearing in rapid succession in different locations.

116

Change Blindness and Inattentional Blindness

Motion involves a change in an object’s position over time, but objects in the visual environment can change in ways that do not involve motion (Rensink, 2002). You might walk by the same clothing store window every day and notice when a new suit or dress is on display. Intuitively, we feel that we can easily detect changes to our visual environment. However, our comfortable intuitions have been challenged by experimental demonstrations of change blindness, which occurs when people fail to detect changes to the visual details of a scene (Rensink, 2002; Simons & Rensink, 2005). One study tested this idea by having an experimenter ask a person on a college campus for directions (Simons & Levin, 1998). While they were talking, two men walked between them holding a door that hid a second experimenter (see FIGURE 4.20). Behind the door, the two experimenters traded places so that when the men carrying the door moved on, a different person was asking for directions than the one who had been there just a second or two earlier. Remarkably, only 7 of 15 participants reported noticing this change.

Figure 4.20: FIGURE 4.20 Change Blindness The white-haired man was giving directions to one experimenter (a), who disappeared behind the moving door (b), only to be replaced by another experimenter (c). Like many other people, the man failed to detect a seemingly obvious change.

change blindness

When people fail to detect changes to the visual details of a scene.

Although surprising, these findings once again illustrate the importance of focused attention for visual perception. Just as focused attention is critical for binding together the features of objects, it is also necessary for detecting changes to objects and scenes (Rensink, 2002; Simons & Rensink, 2005). Change blindness is most likely to occur when people fail to focus attention on the changed object (even though the object is registered by the visual system) and is least likely to occur for items that draw attention to themselves (Rensink, O’Regan, & Clark, 1997).

How can a failure of focused attention explain change blindness?

The role of focused attention in conscious visual experience is also dramatically illustrated by the closely related phenomenon of inattentional blindness, a failure to perceive objects that are not the focus of attention. We’ve already seen that using cell phones has negative effects on driving (see The Real World: Multitasking). In another study, researchers asked whether cell phone use contributes to inattentional blindness in everyday life (Hyman et al., 2010). They recruited a clown to ride a unicycle in the middle of a large square in the middle of the campus at Western Washington University. On a pleasant afternoon, the researchers asked 151 students who had just walked through the square whether they saw the clown. Seventy-five percent of the students who were using cell phones failed to notice the clown, compared with less than 50% who were not using cell phones. Using cell phones draws on focused attention, resulting in increased inattentional blindness and emphasizing again that our conscious experience of the visual environment is restricted to those features or objects selected by focused attention.

inattentional blindness

A failure to perceive objects that are not the focus of attention.

College students who were using their cell phones while walking through campus failed to notice the unicycling clown more frequently than students who were not using their cell phones.
Hyman et al., 2010

117

Culture & Community: Does culture influence change blindness?

Does culture influence change blindness? The evidence for change blindness that we’ve considered in this chapter comes from studies using participants from Western cultures, mainly Americans. Would change blindness occur in individuals from other cultures? Think back to the Culture & Community box from the Psychology: Evolution of a Science chapter, where we discussed evidence showing that people from Western cultures rely on an analytic style of processing information (i.e., they tend to focus on an object without paying much attention to the surrounding context), whereas people from Eastern cultures tend to adopt a holistic style (i.e., they tend to focus on the relationship between an object and the surrounding context: Kitayama et al., 2003; Nisbett & Miyamoto, 2005).

With this distinction in mind, Masuda and Nisbett (2006) noted that previous studies of change blindness, using mainly American participants, had shown that participants are more likely to detect changes in the main or focal object in a scene, and less likely to detect changes in surrounding context. The researchers hypothesized that individuals from an Eastern culture would be more focused on—and therefore likely to notice—changes in surrounding context than individuals from a Western culture. To test their prediction, they conducted three experiments examining change detection in American and Japanese college students (Masuda & Nisbett, 2006). In each experiment, they made changes either to the main object in a scene or to the surrounding context (e.g., objects in the background).

The results of the experiments were consistent with predictions: Japanese students detected more changes to contextual information than did American students, whereas American students detected more changes to focal objects than did Japanese students. These findings extend earlier reports that people from Eastern and Western cultures see the world differently, with Easterners focusing more on the context in which an object appears and Westerners focusing more on the object itself.

SUMMARY QUIZ [4.3]

Question 4.7

1. Our ability to visually combine details so that we perceive unified objects is explained by
  1. feature-integration theory.
  2. illusory conjunction.
  3. synesthesia.
  4. ventral and dorsal streaming.

a.

Question 4.8

2. The idea that specialized brain areas represent particular classes of objects is
  1. the modular view.
  2. attentional processing.
  3. distributed representation.
  4. neuron response.

a.

Question 4.9

3. What kind of cues are relative height and linear perspective?
  1. motion-based
  2. binocular
  3. monocular
  4. template

c.

118