7.2 Operant Conditioning: Reinforcements from the Environment

The study of classical conditioning is the study of behaviors that are reactive. Most animals don’t voluntarily salivate or feel spasms of anxiety; rather, animals exhibit these responses involuntarily during the conditioning process. But we also engage in voluntary behaviors in order to obtain rewards and avoid punishment. Operant conditioning is a type of learning in which the consequences of an organism’s behavior determine whether it will be repeated in the future. The study of operant conditioning is the exploration of behaviors that are active.

operant conditioning

A type of learning in which the consequences of an organism’s behavior determine whether it will be repeated in the future.

The Development of Operant Conditioning: The Law of Effect

The study of how active behavior affects the environment began at about the same time as classical conditioning. In the 1890s, Edward Thorndike studied instrumental behaviors—that is, behavior that required an organism to do something, solve a problem, or otherwise manipulate elements of its environment (Thorndike, 1898). Some of Thorndike’s experiments used a puzzle box, which was a wooden crate with a door that would open when a concealed lever was moved in the right way (see FIGURE 7.4). A hungry cat placed in a puzzle box would try various behaviors to get out—scratching at the door, meowing loudly, sniffing the inside of the box, putting its paw through the openings—but only one behavior opened the door and led to food: tripping the lever in just the right way. After this happened, Thorndike placed the cat back in the box for another round. Over time, the ineffective behaviors became less and less frequent, and the one instrumental behavior (going right for the latch) became more frequent (see FIGURE 7.5). From these observations, Thorndike developed the law of effect: Behaviors that are followed by a “satisfying state of affairs” tend to be repeated and those that produce an “unpleasant state of affairs” are less likely to be repeated.

Figure 7.4: FIGURE 7.4 Thorndike’s Puzzle Box In Thorndike’s original experiments, food was placed just outside the door of the puzzle box, where the cat could see it. If the cat triggered the appropriate lever, the door would open, letting the cat out.
Figure 7.5: FIGURE 7.5 The Law of Effect Thorndike’s cats displayed trial-and-error behavior when trying to escape from the puzzle box until, over time, they discovered the solution. Once they figured out what behavior was instrumental in opening the latch, they stopped all other ineffective behaviors and escaped from the box faster and faster.
Data from Thorndike, 1898.

law of effect

Behaviors that are followed by a “satisfying state of affairs” tend to be repeated and those that produce an “unpleasant state of affairs” are less likely to be repeated.

217

What is the relationship between behavior and reward?

Such learning is very different from classical conditioning. Remember that in classical conditioning experiments, the US occurred on every training trial no matter what the animal did. Pavlov delivered food to the dog whether it salivated or not. But in Thorndike’s work, the behavior of the animal determined what happened next. If the behavior was “correct” (i.e., the latch was triggered), the animal was rewarded with food. Incorrect behaviors produced no results, and the animal was stuck in the box until it performed the correct behavior. Although different from classical conditioning, Thorndike’s work resonated with most behaviorists at the time: It was still observable, quantifiable, and free from explanations involving the mind (Galef, 1998).

B. F. Skinner: The Role of Reinforcement and Punishment

Several decades after Thorndike’s work, B. F. Skinner (1904–1990) coined the term operant behavior to refer to behavior that an organism produces that has some impact on the environment. In Skinner’s system, all of these emitted behaviors “operated” on the environment in some manner, and the environment responded by providing events that either strengthened those behaviors (i.e., they reinforced them) or made them less likely to occur (i.e., they punished them). Skinner’s elegantly simple observation was that most organisms do not behave like a dog in a harness, passively waiting to receive food no matter what the circumstances. Rather, most organisms are like cats in a box, actively engaging the environment in which they find themselves to reap rewards (Skinner, 1938, 1953). In order to study operant behavior scientifically, Skinner developed the operant conditioning chamber, or Skinner box, as it is commonly called (shown in FIGURE 7.6), which allows a researcher to study the behavior of small organisms in a controlled environment.

operant behavior

Behavior that an organism produces that has some impact on the environment.

Figure 7.6: FIGURE 7.6 Skinner Box In a typical Skinner box, or operant conditioning chamber, a rat, pigeon, or other suitably sized animal is placed in this environment and observed during learning trials that use operant conditioning principles.
Science Source

218

Skinner’s approach to the study of learning focused on reinforcement and punishment. These terms, which have commonsense connotations, have particular meanings in psychology, in terms of their effects on behavior. Therefore, a reinforcer is any stimulus or event that functions to increase the likelihood of the behavior that led to it, whereas a punisher is any stimulus or event that functions to decrease the likelihood of the behavior that led to it.

reinforcer

Any stimulus or event that functions to increase the likelihood of the behavior that led to it.

punisher

Any stimulus or event that functions to decrease the likelihood of the behavior that led to it.

B. F. Skinner with one of his many research participants.
© Sam Falk/Science Source

Whether a particular stimulus acts as a reinforcer or a punisher depends in part on whether it increases or decreases the likelihood of a behavior. Presenting food is usually reinforcing and produces an increase in the behavior that led to it; removing food is often punishing and leads to a decrease in the behavior. Turning on a device that causes an electric shock is typically punishing (and decreases the behavior that led to it); turning it off is rewarding (and increases the behavior that led to it).

To keep these possibilities distinct, Skinner used the term positive for situations in which a stimulus was presented and negative for situations in which it was removed. Consequently, there is positive reinforcement (where a rewarding stimulus is presented) and negative reinforcement (where an unpleasant stimulus is removed), as well as positive punishment (where an unpleasant stimulus is administered) and negative punishment (where a rewarding stimulus is removed). Here the words positive and negative mean, respectively, something that is added or something that is taken away, but the terms do not mean “good” or “bad” as they do in everyday speech. As you can see from TABLE 7.1, positive and negative reinforcement increase the likelihood of the behavior; positive and negative punishment decrease the likelihood of the behavior.

Table : Table 7.1 Reinforcement and Punishment

 

Increases the Likelihood of Behavior

Decreases the Likelihood of Behavior

Stimulus is presented

Positive reinforcement

Positive punishment

Stimulus is removed

Negative reinforcement

Negative punishment

These distinctions can be confusing at first; after all, “negative reinforcement” and “punishment” both sound like they should be “bad” and produce the same type of behavior. However, negative reinforcement involves something pleasant; it’s the removal of something unpleasant, like a shock, and the absence of a shock is indeed pleasant.

Why is reinforcement more constructive than punishment in learning desired behavior?

Reinforcement is generally more effective than punishment in promoting learning. There are many reasons (Gershoff, 2002), but one reason is this: Punishment signals that an unacceptable behavior has occurred, but it doesn’t specify what should be done instead. Spanking a young child for starting to run into a busy street certainly stops the behavior, but it doesn’t promote any kind of learning about the desired behavior.

Primary and Secondary Reinforcement and Punishment

Reinforcers and punishers often gain their functions from basic biological mechanisms. A pigeon that pecks at a target in a Skinner box is usually reinforced with food pellets, just as an animal that learns to escape a mild electric shock has avoided the punishment of tingly paws. Food, comfort, shelter, or warmth are examples of primary reinforcers because they help satisfy biological needs. However, the vast majority of reinforcers or punishers in our daily lives have little to do with biology: Verbal approval, a bronze trophy, or money all serve powerful reinforcing functions, yet none of them taste very good or help keep you warm at night.

219

These secondary reinforcers derive their effectiveness from their associations with primary reinforcers through classical conditioning. For example, money starts out as a neutral CS that, through its association with primary USs like acquiring food or shelter, takes on a conditioned emotional element. Flashing lights, originally a neutral CS, acquire powerful negative elements through association with a speeding ticket and a fine.

Negative reinforcement involves the removal of something unpleasant from the environment. When Daddy stops the car, he gets a reward: His little monster stops screaming. However, from the perspective of the child, this is positive reinforcement. The child’s tantrum results in something positive added to the environment—stopping for a snack.
© Michelle Selesnick/Flickr Vision

Immediate versus Delayed Reinforcement and Punishment

A key determinant of the effectiveness of a reinforcer is the amount of time between the occurrence of a behavior and the reinforcer: The more time that elapses, the less effective the reinforcer (Lattal, 2010; Renner, 1964). This was dramatically illustrated in experiments in which food reinforcers were given at varying times after the rat pressed the lever (Dickinson, Watt, & Griffiths, 1992). Delaying reinforcement by even a few seconds led to a reduction in the number of times the rat subsequently pressed the lever, and extending the delay to a minute rendered the food reinforcer completely ineffective (see FIGURE 7.7). The most likely explanation for this effect is that delaying the reinforcer made it difficult for the rats to figure out exactly what behavior they needed to perform in order to obtain it. In the same way, parents who wish to reinforce their children for playing quietly with a piece of candy should provide the candy while the child is still playing quietly; waiting until later when the child may be engaging in other behaviors—perhaps making a racket with pots and pans—will make it more difficult for the child to link the reinforcer with the behavior of playing quietly (Powell et al., 2009). Similar considerations apply to punishment: As a general rule, the longer the delay between a behavior and the administration of punishment, the less effective the punishment will be in suppressing the targeted behavior (Kamin, 1959; Lerman & Vorndran, 2002).

Figure 7.7: FIGURE 7.7 Delay of Reinforcement Rats pressed a lever in order to obtain a food reward.The number of lever presses declined substantially with longer delays between the lever press and the delivery of food reinforcement. (Data from Dickinson, Watt, & Griffiths, 1992.)
Suppose you are the mayor of a suburban town and you want to institute some new policies to decrease the number of drivers who speed on residential streets. How might you use punishment to decrease the behavior you desire (speeding)? How might you use reinforcement to increase the behavior you desire (safe driving)? Based on the principles of operant conditioning you read about in this section, which approach do you think might be most fruitful?
© Eden Breitz/Alamy

How does the concept of delayed reinforcement relate to difficulties with quitting smoking?

The greater potency of immediate versus delayed reinforcers may help us to appreciate why it can be difficult to engage in behaviors that have long-term benefits. The smoker who desperately wants to quit smoking will be reinforced immediately by the feeling of relaxation that results from lighting up, but he or she may have to wait years to be reinforced with better health that results from quitting; the dieter who sincerely wants to lose weight may easily succumb to the temptation of a chocolate sundae that provides reinforcement now, rather than wait for the reinforcement that would come weeks or months later, once the weight was lost.

220

The Basic Principles of Operant Conditioning

After establishing how reinforcement and punishment produced learned behavior, Skinner and other scientists began to expand the parameters of operant conditioning. Let’s look at some of these basic principles of operant conditioning.

Culture & Community: Are there cultural differences in reinforcers?

Are there cultural differences in reinforcers? Operant approaches that use positive reinforcement have been applied extensively in everyday settings such as behavior therapy (see Treatment of Psychological Disorders, pp. 476–505). Surveys designed to assess what kinds of reinforcers are rewarding to individuals have revealed that there can be wide differences among various groups (Dewhurst & Cautela, 1980; Houlihan et al., 1991).

Recently, 750 high school students from America, Australia, Tanzania, Denmark, Honduras, Korea, and Spain were surveyed in order to evaluate possible cross-cultural differences among reinforcers (Homan et al., 2012). The survey asked students to rate how rewarding they found a range of activities, including listening to music, playing music, taking part in various kinds of sports, shopping, reading, spending time with friends, and so on. The researchers hypothesized that American high school students would differ most strongly from high school students in the third-world countries of Tanzania and Honduras, and that’s what they found. The differences between American and Korean students were nearly as large, and somewhat surprisingly, so were the differences between American and Spanish students. There were much smaller differences between Americans and their Australian or Danish counterparts.

These results should be taken with a grain of salt because the researchers did not control for variables other than culture that could influence the results, such as economic status. Nonetheless, the study suggests that cultural differences should be considered in the design of programs or interventions that rely on the use of reinforcers to influence the behavior of individuals who come from different cultures.

Discrimination and Generalization

Operant conditioning shows both discrimination and generalization effects similar to those we saw with classical conditioning. For example, in one study, researchers presented either an Impressionist painting by Monet or a Cubist painting by Picasso (Watanabe, Sakamoto, & Wakita, 1995). Participants in the experiment were only reinforced if they responded when the appropriate painting was presented. After training, the participants discriminated appropriately; those trained with the Monet painting responded when other paintings by Monet were presented, but not when other Cubist paintings by Picasso were shown; those trained with a Picasso painting showed the opposite behavior. What’s more, the research participants showed that they could generalize: Those trained with Monet responded appropriately when shown other Impressionist paintings, and the Picasso-trained participants responded to other Cubist artwork despite never having seen those paintings before. These results are particularly striking because the research participants were pigeons that were trained to key-peck to these various works of art.

221

Participants trained with Picasso paintings, such as the one on the left, responded to other paintings by Picasso or even to paintings by other Cubists. Participants trained with Monet paintings, such as the one on the right, responded to other paintings by Monet or by other French Impressionists. Interestingly, the participants in this study were pigeons.
© 2013 Estate of Pablo Picasso/Artists Rights Picasso, Pablo (1881-1973) © ARS, NY The Weeping Woman [Femme en pleurs]. 1937. Oil on canvas, 60.8 x 50.0 cm; Tate Gallery, London/Art Resource, NY
Tate Gallery, London/Art Resource, NY

Extinction

As in classical conditioning, operant behavior undergoes extinction when the reinforcements stop. Pigeons cease pecking at a key if food is no longer presented following the behavior. You wouldn’t put more money into a vending machine if it failed to give you its promised candy bar or soda. On the surface, extinction of operant behavior looks like that of classical conditioning.

However, there is an important difference. As noted, in classical conditioning, the US occurs on every trial no matter what the organism does. In operant conditioning, the reinforcements only occur when the proper response has been made, and they don’t always occur even then. Not every trip into the forest produces nuts for a squirrel, auto salespeople don’t sell to everyone who takes a test drive, and researchers run many experiments that do not work out and never get published. Yet these behaviors don’t weaken and gradually extinguish. Extinction is a bit more complicated in operant conditioning than in classical conditioning because it depends, in part, on how often reinforcement is received. In fact, this principle is an important cornerstone of operant conditioning that we’ll examine next.

How is the concept of extinction different in operant conditioning versus classical conditioning?

Schedules of Reinforcement

One day, Skinner was laboriously hand-rolling food pellets to reinforce the rats in his experiments. It occurred to him that perhaps he could save time and effort by not giving his rats a pellet for every bar press but instead delivering food on some intermittent schedule. The results of this hunch were dramatic. Not only did the rats continue bar pressing, but they also shifted the rate and pattern of bar pressing depending on the timing and frequency of the presentation of the reinforcers (Skinner, 1979). Unlike classical conditioning, where the sheer number of learning trials was important, in operant conditioning, the pattern with which reinforcements appeared was crucial. Skinner explored dozens of what came to be known as schedules of reinforcement (Ferster & Skinner, 1957; see FIGURE 7.8). We’ll consider some of the most important next.

Figure 7.8: FIGURE 7.8 Reinforcement Schedules Different schedules of reinforcement produce different rates of responding. These lines represent the amount of responding that occurs under each type of reinforcement. The black slash marks indicate when reinforcement was administered. Notice that ratio schedules tend to produce higher rates of responding than do interval schedules, as shown by the steeper lines for fixed-ratio and variable-ratio reinforcement.
Students cramming for an exam often show the same kind of behavior as pigeons being reinforced under a fixed-interval schedule.
Brand X Pictures/Jupiter Images

Interval Schedules. Under a fixed-interval schedule (FI), reinforcers are presented at fixed-time periods, provided that the appropriate response is made. For example, on a 2-minute fixed-interval schedule, a response will be reinforced, but only after 2 minutes have expired since the last reinforcement. Rats and pigeons in Skinner boxes produce predictable patterns of behavior under these schedules. They show little responding right after the presentation of the reinforcement, but as the next time interval draws to a close, they show a burst of responding. Many undergraduates behave exactly like this. They do relatively little work until just before the upcoming exam, and then they engage in a burst of reading and studying.

fixed-interval schedule (FI)

An operant conditioning principle in which reinforcers are presented at fixed-time periods, provided that the appropriate response is made.

222

Under a variable-interval schedule (VI), a behavior is reinforced based on an average time that has expired since the last reinforcement. For example, on a 2-minute variable-interval schedule, responses will be reinforced every 2 minutes on average. Variable-interval schedules typically produce steady, consistent responding because the time until the next reinforcement is less predictable. One example of a VI schedule in real life might be radio promotional giveaways. The reinforcement—say, a ticket to a rock concert—might occur on average once an hour across the span of the broadcasting day, but it might come early in the 10:00 o’clock hour, later in the 11:00 o’clock hour, immediately into the 12:00 o’clock hour, and so on.

variable-interval schedule (VI)

An operant conditioning principle in which behavior is reinforced based on an average time that has expired since the last reinforcement.

How does a radio station use scheduled reinforcements to keep you listening?

Radio station promotions and giveaways often follow a variable-interval schedule of reinforcement
© Richard Hutchings/Photoedit

Both fixed-interval schedules and variable-interval schedules tend to produce slow, methodical responding because the reinforcements follow a time scale that is independent of how many responses occur. It doesn’t matter if a rat on a fixed-interval schedule presses a bar 1 time during a 2-minute period or 100 times: The reinforcing food pellet won’t drop out of the shoot until 2 minutes have elapsed, regardless of the number of responses.

How do ratio schedules work to keep you spending your money?

These pieceworkers in a textile factory get paid following a fixed-ratio schedule: They receive payment after some set number of shirts have been sewn.
Jeff Holt/Bloomberg via Getty Images

Ratio Schedules. Under a fixed-ratio schedule (FR), reinforcement is delivered after a specific number of responses have been made. One schedule might present reinforcement after every fourth response, a different schedule might present reinforcement after every 20 responses; the special case of presenting reinforcement after each response is called continuous reinforcement. There are many situations in which people are reinforced on a fixed-ratio schedule: Book clubs often give you a freebie after a set number of regular purchases; pieceworkers get paid after making a fixed number of products; and some credit card companies return to their customers a percentage of the amount charged. When a fixed-ratio schedule is operating, it is possible, in principle, to know exactly when the next reinforcer is due. A laundry pieceworker on a 10-response, fixed-ratio schedule who has just washed and ironed the ninth shirt knows that payment is coming after the next shirt is done.

fixed-ratio schedule (FR)

An operant conditioning principle in which reinforcement is delivered after a specific number of responses have been made.

223

Under a variable-ratio schedule (VR), the delivery of reinforcement is based on a particular average number of responses. For example, slot machines in a modern casino pay off on variable-ratio schedules. A casino might advertise that it pays off on “every 100 pulls on average,” but one player might hit a jackpot after 3 pulls on a slot machine, whereas another player might not hit a jackpot until after 80 pulls.

variable-ratio schedule (VR)

An operant conditioning principle in which the delivery of reinforcement is based on a particular average number of responses.

Not surprisingly, variable-ratio schedules produce slightly higher rates of responding than fixed-ratio schedules, primarily because the organism never knows when the next reinforcement is going to appear. What’s more, the higher the ratio, the higher the response rate tends to be; a 20-response variable-ratio schedule will produce considerably more responding than a 2-response variable-ratio schedule.

When schedules of reinforcement provide intermittent reinforcement, when only some of the responses made are followed by reinforcement, they produce behavior that is much more resistant to extinction than a continuous reinforcement schedule. One way to think about this effect is to recognize that the more irregular and intermittent a schedule is, the more difficult it becomes for an organism to detect when the behavior has actually been placed on the road to extinction. For example, if you’ve just put a dollar into a soda machine that, unbeknownst to you, is broken, no soda comes out. Because you’re used to getting your sodas on a continuous reinforcement schedule—one dollar produces one soda—this abrupt change in the environment is easily noticed, and you are unlikely to put additional money into the machine: You’d quickly show extinction. However, if you’ve put your dollar into a slot machine that, unbeknownst to you, is broken, do you stop after one or two plays? Almost certainly not. If you’re a regular slot player, you’re used to going for many plays in a row without winning anything, so it’s difficult to tell that anything is out of the ordinary. The intermittent reinforcement effect refers to the fact that operant behaviors that are maintained under intermittent reinforcement schedules resist extinction better than those maintained under continuous reinforcement.

intermittent reinforcement

An operant conditioning principle in which only some of the responses made are followed by reinforcement.

intermittent reinforcement effect

The fact that operant behaviors that are maintained under intermittent reinforcement schedules resist extinction better than those maintained under continuous reinforcement.

How can operant conditioning produce complex behaviors?

Slot machines in casinos pay out following a variable-ratio schedule.This helps explain why some gamblers feel incredibly lucky, whereas others (like this chap) can’t believe they can play a machine for so long without winning a thing.
© MBI/Alamy

shaping

Learning that results from the reinforcement of successive steps to a final desired behavior.

Imagine you own an insurance company and you want to encourage your salespeople to sell as many policies as possible. You decide to give them bonuses, based on the number of policies sold. How might you set up a system of bonuses using an FR schedule? Using a VR schedule? Which system do you think would encourage your salespeople to work harder, in terms of making more sales?
iStockphoto/Thinkstock

Shaping through Successive Approximations

Have you ever been to AquaLand and wondered how the dolphins learn to jump up in the air, twist around, splash back down, do a somersault, and then jump through a hoop, all in one smooth motion? Well, they don’t. At least not all at once. Rather, elements of their behavior are shaped over time until the final product looks like one smooth motion.

Behavior rarely occurs in fixed frameworks in which a stimulus is presented and then an organism has to engage in some activity or another. Most of our behaviors are the result of shaping, learning that results from the reinforcement of successive steps to a final desired behavior. The outcomes of one set of behaviors shape the next set of behaviors, whose outcomes shape the next set of behaviors, and so on.

Skinner noted that if you put a rat in a Skinner box and wait for it to press the bar, you could end up waiting a very long time: Bar pressing just isn’t very high in a rat’s natural hierarchy of responses. However, it is relatively easy to shape bar pressing. Wait until the rat turns in the direction of the bar, and then deliver a food reward. This will reinforce turning toward the bar, making such a movement more likely. Now wait for the rat to take a step toward the bar before delivering food; this will reinforce moving toward the bar. After the rat walks closer to the bar, wait until it touches the bar before presenting the food. Notice that none of these behaviors is the final desired behavior (reliably pressing the bar). Rather, each behavior is a successive approximation to the final product, or a behavior that gets incrementally closer to the overall desired behavior. In the dolphin example—and indeed, in many instances in which animals perform astoundingly complex behaviors—each smaller behavior is reinforced until the overall sequence of behavior is performed reliably.

224

B. F. Skinner shaping a dog named Agnes. In the span of 20 minutes, Skinner was able to use reinforcement of successive approximations to shape Agnes’s behavior. The result was a pretty neat trick: to wander in, stand on hind legs, and jump.
Library Of Congress/Look Magazine Photographic Collection

Superstitious Behavior

Everything we’ve discussed so far suggests that one of the keys to establishing reliable operant behavior is the correlation between an organism’s response and the occurrence of reinforcement. As you read in the Methods in Psychology chapter, however, just because two things are correlated (i.e., they tend to occur together in time and space) doesn’t imply that there is causality (i.e., the presence of one reliably causes the other to occur).

How would a behaviorist explain superstitions?

People engage in all kinds of superstitious behaviors. When the Detroit Tigers went on a winning streak in the summer of 2011, Tigers’ manager Jim Leyland refused to change his underwear, wearing them to the park every day until the winning streak ended. Skinner thought superstitions resulted from the unintended reinforcement of inconsequential behavior.
AP Photo/Ben Margot

Skinner (1948) designed an experiment that illustrates this distinction. He put several pigeons in Skinner boxes, set the food dispenser to deliver food every 15 seconds, and left the birds to their own devices. Later, he returned and found the birds engaging in odd, idiosyncratic behaviors, such as pecking aimlessly in a corner or turning in circles. He referred to these behaviors as “superstitious” and offered a behaviorist analysis of their occurrence. A pigeon that just happened to have pecked randomly in the corner when the food showed up had connected the delivery of food to that behavior.

225

Because this pecking behavior was reinforced by the delivery of food, the pigeon was likely to repeat it. Now pecking in the corner was more likely to occur, and it was more likely to be reinforced 15 seconds later when the food appeared again. Skinner’s pigeons acted as though there was a causal relationship between their behaviors and the appearance of food when it was merely an accidental correlation.

Although some researchers questioned Skinner’s characterization of these behaviors as superstitious (Staddon & Simmelhag, 1971), later studies have shown that people, like pigeons, often behave as though there’s a correlation between their responses and reward when in fact the connection is merely accidental (Bloom et al., 2007; Mellon, 2009; Ono, 1987; Wagner & Morris, 1987). For example, baseball players who hit several home runs on a day when they happened not to have showered are likely to continue that tradition, laboring under the belief that the accidental correlation between poor personal hygiene and a good day at bat is somehow causal. This “stench causes home runs” hypothesis is just one of many examples of human superstitions (Gilbert et al., 2000; Radford & Radford, 1949).

A Deeper Understanding of Operant Conditioning

To behaviorists such as Watson and Skinner, an organism behaved in a certain way as a response to stimuli in the environment, not because there was any wanting, wishing, or willing by the animal in question. However, some research on operant conditioning digs deeper into the underlying mechanisms that produce the familiar outcomes of reinforcement. Let’s examine three elements that expand our view of operant conditioning: the cognitive, neural, and evolutionary elements of operant conditioning.

The Cognitive Elements of Operant Conditioning

Edward Chace Tolman (1886–1959) argued that there was more to learning than just knowing the circumstances in the environment (the properties of the stimulus) and being able to observe a particular outcome (the reinforced response). Instead, Tolman proposed that the conditioning experience produced knowledge or a belief that, in this particular situation, a specific reward will appear if a specific response is made.

Tolman’s ideas may remind you of the Rescorla–Wagner model of classical conditioning. In both the Rescorla–Wagner model and Tolman’s theories, the stimulus does not directly evoke a response; rather, it establishes an internal cognitive state, which then produces the behavior.

Edward Chace Tolman advocated a cognitive approach to operant learning and provided evidence that in maze-learning experiments, rats develop a mental picture of the maze, which he called a cognitive map.
Bancroft Library/University Of California, Berkeley

Latent Learning and Cognitive Maps. In latent learning, something is learned, but it is not manifested as a behavioral change until sometime in the future. Latent learning can easily be established in rats and occurs without any obvious reinforcement, a finding that posed a direct challenge to the then-dominant behaviorist position that all learning required some form of reinforcement (Tolman & Honzik, 1930a).

latent learning

Something is learned, but it is not manifested as a behavioral change until sometime in the future.

Tolman gave three groups of rats access to a complex maze every day for over 2 weeks. The rats in the control group never received any reinforcement for navigating the maze. They were simply allowed to run around until they reached the goal box at the end of the maze. In FIGURE 7.9 you can see that over the 2 weeks of the study, the control group rats (in green) got a little better at finding their way through the maze, but not by much. A second group of rats (in blue) received regular reinforcements; when they reached the goal box, they found a small food reward there. Not surprisingly, these rats showed clear learning. A third group (orange) was treated exactly like the control group for the first 10 days and then rewarded for the last 7 days. For the first 10 days, these rats behaved like those in the control group. However, during the final 7 days, they behaved like the rats that had been reinforced every day. Clearly, the rats in this third group had learned a lot about the maze and the location of the goal box during those first 10 days even though they had not received any reinforcements for their behavior. In other words, they showed evidence of latent learning.

Figure 7.9: FIGURE 7.9 Latent Learning Rats in a control group that never received any reinforcement (in green) improved at finding their way through the maze over 17 days but not by much. Rats that received regular reinforcements (in blue) showed fairly clear learning; their error rate decreased steadily over time. Rats in the latent learning group (in orange) were treated exactly like the control group rats for the first 10 days and then like the regularly rewarded group for the last 7 days. Their dramatic improvement on day 12 shows that these rats had learned a lot about the maze and the location of the goal box even though they had never received reinforcements. (Data from Tolman & Honzik, 1930b.)

226

These results suggested to Tolman that his rats had developed a cognitive map, a mental representation of the physical features of the environment (Tolman & Honzik, 1930b; Tolman, Ritchie, & Kalish, 1946). Support for this idea was obtained in a clever experiment, in which Tolman trained rats in a maze and then changed the maze—while keeping the start and goal locations in the same spot. Behaviorists would predict that the rats, finding the familiar route blocked, would be stymied. However, faced with a blocked path, the rats instead quickly navigated via a new pathway to the food. This behavior suggested that the rats had formed a sophisticated cognitive map of the environment and could use the map after conditions changed. Tolman’s experiments strongly suggest that there is a cognitive component, even in rats, to operant learning.

cognitive map

A mental representation of the physical features of the environment.

What are cognitive maps, and why are they a challenge to behaviorism?

Learning to Trust: For Better or Worse. Cognitive factors also played a key role in an experiment examining learning and brain activity (using fMRI) in people who played a “trust” game with a fictional partner (Delgado, Frank, & Phelps, 2005). On each trial, a participant could either keep a $1 reward or transfer the reward to a partner, who would receive $3. The partner could then either keep the $3 or share half of it with the participant. When playing with a partner who was willing to share the reward, the participant would be better off transferring the money, but when playing with a partner who did not share, the participant would be better off keeping the $1. Participants in such experiments typically learn who is trustworthy on the basis of trial-and-error, and they give more money to partners who reinforce them by sharing.

In the study by Delgado et al., participants were given detailed descriptions of their partners that either portrayed the partners as trustworthy, neutral, or suspect. Even though during the game, all three of the partners shared equally often, the participants’ cognitions about their partners had powerful effects. Participants transferred more money to the trustworthy partner than to the others, essentially ignoring the trial-by-trial feedback that would ordinarily shape their playing behavior, thus reducing the amount of reward they received. Highlighting the power of the cognitive effect, signals in a part of the brain that ordinarily distinguishes between positive and negative feedback were evident only when participants played with the neutral partner; these feedback signals were absent when participants played with the trustworthy partner and reduced when participants played with the suspect partner.

The Neural Elements of Operant Conditioning

Figure 7.10: FIGURE 7.10 Pleasure Centers in the Brain The nucleus accumbens, medial forebrain bundle, and hypothalamus are all major pleasure centers in the brain.

The first hint of how specific brain structures might contribute to the process of reinforcement came from James Olds and his associates, who inserted tiny electrodes into different parts of a rat’s brain and allowed the animal to control electric stimulation of its own brain by pressing a bar. They discovered that some brain areas produced what appeared to be intensely positive experiences: The rats would press the bar repeatedly to stimulate these structures, sometimes ignoring food, water, and other life-sustaining necessities for hours on end simply to receive stimulation directly in the brain. Olds and colleagues called these parts of the brain pleasure centers (Olds, 1956; see FIGURE 7.10).

How do specific brain structures contribute to the process of reinforcement?

In the years since these early studies, researchers have identified a number of structures and pathways in the brain that deliver rewards through stimulation (Wise, 1989, 2005). The neurons in the medial forebrain bundle, a pathway that meanders its way from the midbrain through the hypothalamus into the nucleus accumbens, are the most susceptible to stimulation that produces pleasure. This is not surprising because psychologists have identified this bundle of cells as crucial to behaviors that clearly involve pleasure, such as eating, drinking, and engaging in sexual activity. Second, the neurons all along this pathway and especially those in the nucleus accumbens itself are all dopaminergic (i.e., they secrete the neurotransmitter dopamine). Remember from the Neuroscience and Behavior chapter that higher levels of dopamine in the brain are usually associated with positive emotions. In recent years, several competing hypotheses about the precise role of dopamine have emerged, including the idea that dopamine is more closely linked with the expectation of reward than with reward itself (Fiorillo, Newsome, & Schultz, 2008; Schultz, 2006, 2007), or that dopamine is more closely associated with wanting or even craving something rather than simply liking it (Berridge, 2007). Whichever view turns out to be correct, dopamine seems to play a key role in how we process reward. (For more on the relationship between dopamine and Parkinson’s, see the Hot Science box: Dopamine and Reward Learning in Parkinson’s Disease.)

227

Hot Science: Dopamine and Reward Learning in Parkinson’s Disease

Dopamine and Reward Learning in Parkinson’s Disease

Many of us have relatives or friends who have been affected by Parkinson’s disease, a movement disorder that involves loss of dopamine-producing neurons. As you learned in the Neuroscience and Behavior chapter, the drug L-dopa is often used to treat Parkinson’s disease because it spurs surviving neurons to produce more dopamine. Dopamine also plays a key role in reward-related learning.

Research suggests that dopamine plays an important role in prediction error: the difference between the actual reward received versus the amount of predicted or expected reward. For example, when an animal presses a lever and receives an unexpected food reward, a positive prediction error occurs (a better than expected outcome), and the animal learns to press the lever again. By contrast, when an animal expects to receive a reward by pressing a lever but does not receive it, a negative prediction error occurs (a worse than expected outcome), and the animal will subsequently be less likely to press the lever again. Prediction error can thus serve as a kind of “teaching signal” that helps the animal to learn to behave in a way that maximizes reward. Intriguingly, dopamine neurons in the reward centers of a monkey’s brain show increased activity when the monkey receives unexpected juice rewards and decreased activity when the monkey does not receive expected juice rewards, suggesting that dopamine neurons play an important role in generating the prediction error (Schultz, 2006, 2007; Schultz, Dayan, & Montague, 1997). Neuroimaging studies show that human brain regions involved in reward-related learning also produce prediction error signals, and that dopamine is involved in generating those signals (O’Doherty et al., 2003; Pessiglione et al., 2006).

Thinkstock

So, how do these findings relate to people with Parkinson’s disease? Several studies report that reward-related learning can be impaired in persons with Parkinson’s (Dahger & Robbins, 2009). In one study of trial-and-error learning, participants with Parkinson’s being treated with dopaminergic drugs had a higher learning rate than patients not on the drugs (Rutledge et al., 2009). However, there was greater learning for the positive prediction error (learning based on positive outcomes) than for the negative prediction error (learning based on negative outcomes). These results may relate to another intriguing feature of Parkinson’s disease: Some individuals develop serious problems with compulsive gambling, shopping, and related impulsive behaviors. Such problems appear only after the individuals contract Parkinson’s disease and receive treatment with certain types of dopaminergic drugs (Ahlskog, 2011; Weintraub, Papay, & Siderowf, 2013), and such problems may reflect an effect of the drug treatment on individuals who are susceptible to compulsive behaviors (Voon et al., 2011).

More studies will be needed to unravel the complex relations among dopamine, reward prediction error, learning, and Parkinson’s disease, but the studies to date suggest that such research should have important practical as well as scientific implications.

228

The Evolutionary Elements of Operant Conditioning

As you’ll recall, classical conditioning has an adaptive value that has been fine-tuned by evolution. Not surprisingly, operant conditioning does too. Several behaviorists who were using simple T mazes like the one shown in FIGURE 7.11 to study learning in rats discovered that if a rat found food in one arm of the maze on the first trial of the day, it typically ran down the other arm on the very next trial. A staunch behaviorist wouldn’t expect the rats to behave this way. According to operant conditioning, prior reinforcement in one arm should increase the likelihood of turning in that same direction next time, not reduce it. How can we explain this?

Figure 7.11: FIGURE 7.11 A Simple T Maze When rats find food in the right arm of a typical T maze, on the next trial, they will often run to the left arm of the maze. This contradicts basic principles of operant conditioning: If the behavior of running to the right arm is reinforced, it should be more likely to occur again in the future. However, this behavior is perfectly consistent with a rat’s evolutionary preparedness. Like most foraging animals, rats explore their environments in search of food and seldom return to where food has already been found. Quite sensibly, if food has already been found in the right arm of the T maze, the rat will search the left arm next to see if more food is there.

What was puzzling from a behaviorist perspective makes sense when viewed from an evolutionary perspective. Rats are foragers, and like all foraging species, they have evolved a highly adaptive strategy for survival. They move around in their environments looking for food. If they find it somewhere, they eat it (or store it) and then go look somewhere else for more. So, if the rat just found food in the right arm of a T maze, the obvious place to look next time is the left arm. The rat knows that there isn’t any more food in the right arm because it just ate the food it found there! Indeed, given the opportunity to explore a complex environment like the multiarm maze shown in FIGURE 7.12, rats will systematically go from arm to arm collecting food, rarely returning to an arm they have previously visited (Olton & Samuelson, 1976).

Figure 7.12: FIGURE 7.12 A Complex T Maze Like many other foraging species, rats placed in a complex T maze such as this one show evidence of their evolutionary preparedness. These rats will systematically travel from arm to arm in search of food, never returning to arms they have already visited.

What explains a rat’s behavior in a T maze?

229

By Kaz www.Cartoonstock.com
The misbehavior of organisms: Pigs are biologically predisposed to root out their food, just as raccoons are predisposed to wash their food. Trying to train either species to behave differently can prove to be an exercise in futility.
John Wilkinson; Ecoscience/CORBIS
Millard H. Sharp/Science Source

Two of Skinner’s former students, Keller Breland and Marian Breland, were among the first researchers to discover that it wasn’t just rats in T mazes that presented a problem for behaviorists (Breland & Breland, 1961). The Brelands, who made a career out of training animals for commercials and movies, often used pigs because pigs are surprisingly good at learning all sorts of tricks. However, the Brelands discovered that it was extremely difficult to teach a pig the simple task of dropping coins in a box. Instead of depositing the coins, the pigs persisted in rooting with them as if they were digging them up in soil, tossing them in the air with their snouts, and pushing them around. The Brelands tried to train raccoons at the same task, with different but equally dismal results. The raccoons spent their time rubbing the coins between their paws instead of dropping them in the box. Having learned the association between the coins and food, the animals began to treat the coins as stand-ins for food. Pigs are biologically predisposed to root out their food, and raccoons have evolved to clean their food by rubbing it with their paws. That is exactly what each species of animal did with the coins. The Brelands’ work shows that all species, including humans, are biologically predisposed to learn some things more readily than others and to respond to stimuli in ways that are consistent with their evolutionary histories (Gallistel, 2000).

SUMMARY QUIZ [7.2]

Question 7.5

1. Which of the following is NOT an accurate statement concerning operant conditioning?
  1. Actions and outcomes are critical to operant conditioning.
  2. Operant conditioning involves the reinforcement of behavior.
  3. Complex behaviors cannot be accounted for by operant conditioning.
  4. Operant conditioning has roots in evolutionary behavior.

c.

Question 7.6

2. Which of the following mechanisms have no role in Skinner’s approach to behavior?
  1. cognitive
  2. neural
  3. evolutionary
  4. all of the above

230

d.

Question 7.7

3. Latent learning provides evidence for a cognitive element in operant conditioning because
  1. it occurs without any obvious reinforcement.
  2. it requires both positive and negative reinforcement.
  3. it points toward the operation of a neural reward center.
  4. it depends on a stimulus–response relationship.

a.