9.2 The Genetic Code

The one-gene–one-polypeptide hypothesis of Beadle and Tatum (see Chapter 6) was the source of the first exciting insight into the functions of genes: genes were somehow responsible for the function of enzymes, and each gene apparently controlled one enzyme. This hypothesis became one of the great unifying concepts in biology because it provided a bridge that brought together the concepts and research techniques of genetics and biochemistry. When the structure of DNA was deduced in 1953, it seemed likely that there must be a linear correspondence between the nucleotide sequence in DNA and the amino acid sequence in a protein. It was soon deduced that the nucleic acid sequence in mRNA going from 5′ to 3′ corresponds to the amino acid sequence going from N-terminus to C-terminus.

If genes are segments of DNA and if a strand of DNA is just a string of nucleotides, then the sequence of nucleotides must somehow dictate the sequence of amino acids in proteins. How does the DNA sequence dictate the protein sequence? The analogy to a code springs to mind at once. Simple logic tells us that, if the nucleotides are the “letters” in a code, then a combination of letters can form “words” representing different amino acids. First, we must ask how the code is read. Is it overlapping or nonoverlapping? Then we must ask how many letters in the mRNA make up a word, or codon, and which codon or codons represent each amino acid. The cracking of the genetic code is the story told in this section.

325

Overlapping versus nonoverlapping codes

Figure 9-4: Overlapping versus nonoverlapping genetic codes
Figure 9-4: An overlapping and a nonoverlapping genetic code would translate differently into an amino acid sequence. The example uses a codon with three nucleotides in the RNA (a triplet code). In an overlapping code, single nucleotides occupy positions in multiple codons. In this illustration, the third nucleotide in the RNA, U, is found in three codons. In a nonoverlapping code, a protein is translated by reading nucleotides sequentially in sets of three. A nucleotide is found in only one codon. In this example, the third U in the RNA is only in the first codon.

Figure 9-4 shows the difference between an overlapping and a nonoverlapping code. The example shows a three-letter, or triplet, code. For a nonoverlapping code, consecutive amino acids are specified by consecutive code words (codons), as shown at the bottom of Figure 9-4. For an overlapping code, consecutive amino acids are specified by codons that have some consecutive bases in common; for example, the last two bases of one codon may also be the first two bases of the next codon. Overlapping codons are shown in the upper part of Figure 9-4. Thus, for the sequence AUUGCUCAG in a nonoverlapping code, the three triplets AUU, GCU, and CAG encode the first three amino acids, respectively. However, in an overlapping code, the triplets AUU, UUG, and UGC encode the first three amino acids if the overlap is two bases, as shown in Figure 9-4.

By 1961, it was already clear that the genetic code was nonoverlapping. Analyses of mutationally altered proteins showed that only a single amino acid changes at one time in one region of the protein. This result is predicted by a nonoverlapping code. As you can see in Figure 9-4, an overlapping code predicts that a single base change will alter as many as three amino acids at adjacent positions in the protein.

Number of letters in the codon

If an mRNA molecule is read from one end to the other, only one of four different bases, A, U, G, or C, can be found at each position. Thus, if the words encoding amino acids were one letter long, only four words would be possible. This vocabulary cannot be the genetic code because we must have a word for each of the 20 amino acids commonly found in cellular proteins. If the words were two letters long, then 4 × 4 = 16 words would be possible; for example, AU, CU, or CC. This vocabulary is still not large enough.

If the words are three letters long, then 4 × 4 × 4 = 64 words are possible; for example, AUU, GCG, or UGC. This vocabulary provides more than enough words to describe the amino acids. We can conclude that the code word must consist of at least three nucleotides. However, if all words are “triplets,” then the possible words are in considerable excess of the 20 needed to name the common amino acids. We will come back to these excess codons later in the chapter.

Use of suppressors to demonstrate a triplet code

Convincing proof that a codon is, in fact, three letters long (and no more than three) came from beautiful genetic experiments first reported in 1961 by Francis Crick, Sidney Brenner, and their co-workers. These experiments used mutants in the rII locus of T4 phage. The use of rII mutations in recombination analysis was discussed in Chapter 5. Phage T4 is usually able to grow on two different E. coli strains, called B and K. However, mutations in the rII gene change the host range of the phage: mutant phages can still grow on an E. coli B host, but they cannot grow on an E. coli K host. Mutations causing this rII phenotype were induced by using a chemical called proflavin, which was thought to act by the addition or deletion of single nucleotide pairs in DNA. (This assumption is based on experimental evidence not presented here.) The following examples illustrate the action of proflavin on double-stranded DNA.

326

Starting with one particular proflavin-induced mutation called FCO, Crick and his colleagues found “reversions” (reversals of the mutation) that were able to grow on E. coli strain K. Genetic analysis of these plaques revealed that the “revertants” were not identical with true wild types. In fact, the reversion was found to be due to the presence of a second mutation at a different site from that of FCO, although in the same gene.

This second mutation “suppressed” mutant expression of the original FCO. Recall from Chapter 6 that a suppressor mutation counteracts or suppresses the effects of another mutation so that the bacterium is more like wild type.

How can we explain these results? If we assume that the gene is read from one end only, then the original addition or deletion induced by proflavin could result in a mutation because it interrupts a normal reading mechanism that establishes the group of bases to be read as words. For example, if each group of three bases on the resulting mRNA makes a word, then the “reading frame” might be established by taking the first three bases from the end as the first word, the next three as the second word, and so forth. In that case, a proflavin-induced addition or deletion of a single pair on the DNA would shift the reading frame on the mRNA from that corresponding point on, causing all following words to be misread. Such a frameshift mutation could reduce most of the genetic message to gibberish. However, the proper reading frame could be restored by a compensatory insertion or deletion somewhere else, leaving only a short stretch of gibberish between the two. Consider the following example in which three-letter English words are used to represent the codons:

The insertion suppresses the effect of the deletion by restoring most of the sense of the sentence. By itself, however, the insertion also disrupts the sentence:

THE FAT CAT AAT ETH EBI GRA T

327

If we assume that the FCO mutant is caused by an addition, then the second (suppressor) mutation would have to be a deletion because, as we have seen, only a deletion would restore the reading frame of the resulting message (a second insertion would not correct the frame). In the following diagrams, we use a hypothetical nucleotide chain to represent RNA for simplicity. We also assume that the code words are three letters long and are read in one direction (from left to right in our diagrams).

  1. Wild-type message

    CAU CAU CAU CAU CAU 

  2. rIIa message: Words after the addition are changed (×) by frameshift mutation (words marked ✓ are unaffected).

  3. rIIarIIb message: Few words are wrong, but reading frame is restored for later words.

The few wrong words in the suppressed genotype could account for the fact that the “revertants” (suppressed phenotypes) that Crick and his associates recovered did not look exactly like the true wild types in phenotype.

We have assumed here that the original frameshift mutation was an addition, but the explanation works just as well if we assume that the original FCO mutation is a deletion and the suppressor is an addition. You might want to verify it on your own. Very interestingly, combinations of three additions or three deletions have been shown to act together to restore a wild-type phenotype. This observation provided the first experimental confirmation that a word in the genetic code consists of three successive nucleotides, or a triplet. The reason is that three additions or three deletions within a gene automatically restore the reading frame in the mRNA if the words are triplets.

Degeneracy of the genetic code

As already stated, with four letters from which to choose at each position, a three-letter codon could make 4 × 4 × 4 = 64 words. With only 20 words needed for the 20 common amino acids, what are the other words used for, if anything? Crick’s work suggested that the genetic code is degenerate, meaning that each of the 64 triplets must have some meaning within the code. For the code to be degenerate, some of the amino acids must be specified by at least two or more different triplets.

The reasoning goes like this. If only 20 triplets were used, then the other 44 would be nonsense in that they would not encode any amino acid. In that case, most frameshift mutations could be expected to produce nonsense words, which presumably stop the protein-building process, and the suppression of frameshift mutations would rarely, if ever, work. However, if all triplets specified some amino acid, then the changed words would simply result in the insertion of incorrect amino acids into the protein. Thus, Crick reasoned that many or all amino acids must have several different names in the base-pair code; this hypothesis was later confirmed biochemically.

328

KEY CONCEPT

The discussion so far demonstrates that

1. The linear sequence of nucleotides in a gene determines the linear sequence of amino acids in a protein.

2. The genetic code is nonoverlapping.

3. Three bases encode an amino acid. These triplets are termed codons.

4. The code is read from a fixed starting point and continues to the end of the coding sequence. We know that the code is read sequentially because a single frameshift mutation anywhere in the coding sequence alters the codon alignment for the rest of the sequence.

5. The code is degenerate in that some amino acids are specified by more than one codon.

Cracking the code

The deciphering of the genetic code—determining the amino acid specified by each triplet—was one of the most exciting genetic breakthroughs of the past 50 years. After the necessary experimental techniques became available, the genetic code was cracked in a rush.

One breakthrough was the discovery of how to make synthetic mRNA. If the nucleotides of RNA are mixed with a special enzyme (polynucleotide phosphorylase), a single-stranded RNA is formed in the reaction. Unlike transcription, no DNA template is needed for this synthesis, and so the nucleotides are incorporated at random. The ability to synthesize RNA offered the exciting prospect of creating specific mRNA sequences and then seeing which amino acids they would specify. The first synthetic messenger obtained was made by mixing only uracil nucleotides with the RNA-synthesizing enzyme, producing …UUUU… [poly(U)]. In 1961, Marshall Nirenberg and Heinrich Matthaei mixed poly(U) with the protein-synthesizing machinery of E. coli in vitro and observed the formation of a protein. The main excitement centered on the question of the amino acid sequence of this protein. It proved to be polyphenylalanine—a string of phenylalanine molecules attached to form a polypeptide. Thus, the triplet UUU must code for phenylalanine:

For this discovery, Nirenberg was awarded the Nobel Prize.

Figure 9-5: The genetic code
Figure 9-5: The genetic code designates the amino acids specified by each codon.

Next, mRNAs containing two types of nucleotides in repeating groups were synthesized. For instance, synthetic mRNA having the sequence (AGA)n, which is a long sequence of AGAAGAAGAAGAAGA, was used to stimulate polypeptide synthesis in vitro (in a test tube that also contained a cell extract with all the components necessary for translation). The sequence of the resulting polypeptides was observed from a variety of such tests, with the use of different triplets residing in other synthetic RNAs. From such tests, many code words could be verified. (This kind of experiment is detailed in Problem 44 at the end of this chapter. In solving it, you can put yourself in the place of H. Gobind Khorana, who received a Nobel Prize for directing the experiments.)

Additional experimental approaches led to the assignment of each amino acid to one or more codons. Recall that the code was proposed to be degenerate, meaning that some amino acids had more than one codon assignment. This degeneracy can be seen clearly in Figure 9-5, which gives the codons and the amino acids that they specify. Virtually all organisms on Earth use this same genetic code. (There are just a few exceptions in which a small number of the codons have different meanings—for example, in mitochondrial genomes.)

329

Stop codons

You may have noticed in Figure 9-5 that some codons do not specify an amino acid at all. These codons are stop, or termination, codons. They can be regarded as being similar to periods or commas punctuating the message encoded in the DNA.

One of the first indications of the existence of stop codons came in 1965 from Brenner’s work with the T4 phage. Brenner analyzed certain mutations (m1m6) in a single gene that controls the head protein of the phage. He found that the head protein of each mutant was a shorter polypeptide chain than that of the wild type. Brenner examined the ends of the shortened proteins and compared them with the wild-type protein. For each mutant, he recorded the next amino acid that would have been inserted to continue the wild-type chain. The amino acids for the six mutations were glutamine, lysine, glutamic acid, tyrosine, tryptophan, and serine. These results present no immediately obvious pattern, but Brenner deduced that certain codons for each of these amino acids are similar. Specifically, each of these codons can mutate to the codon UAG by a single change in a DNA nucleotide pair. He therefore postulated that UAG is a stop (termination) codon—a signal to the translation mechanism that the protein is now complete.

UAG was the first stop codon deciphered; it is called the amber codon (amber is the English translation of the last name of the codon’s discoverer, Bernstein). Mutants that are defective owing to the presence of an abnormal amber codon are called amber mutants. Two other stop codons are UGA and UAA. Analogously to the amber codon, and continuing the theme of naming for colors and gems, UGA is called the opal codon and UAA is called the ochre codon. Mutants that are defective because they contain abnormal opal or ochre codons are called opal and ochre mutants, respectively. Stop codons are often called nonsense codons because they designate no amino acid.

In addition to a shorter head protein, Brenner’s phage mutants had another interesting feature in common: the presence of a suppressor mutation (su) in the host chromosome would cause the phage to develop a head protein of normal (wild-type) chain length despite the presence of the m mutation. We will consider stop codons and their suppressors further after we have dealt with the process of protein synthesis.