5.4 Determination of Primary Structure Facilitates an Understanding of Protein Function

Now that we have purified our protein, be it lactate dehydrogenase or the estrogen receptor, what is the next step in learning about the protein? An important means of characterizing a pure protein is to determine its primary structure, which can tell us much about the protein. Recall that the primary structure of a protein is the determinant of its three-dimensional structure, which ultimately determines the protein’s function. Comparison of the sequence of normal proteins with those isolated from patients with pathological conditions allows an understanding of the molecular basis of diseases.

Let us examine first how we can sequence a simple peptide, such as

The first step is to determine the amino acid composition of the peptide. The peptide is hydrolyzed into its constituent amino acids by heating it in strong acid. The individual amino acids can then be separated by ion-exchange chromatography and visualized by treatment with fluorescamine, which reacts with the α-amino group to form a highly fluorescent product (Figure 5.24). The concentration of an amino acid in solution is proportional to the fluorescence of the solution. The solution is then run through a column. The amount of buffer required to remove the amino acid from the column is compared with the elution pattern of a standard mixture of amino acids, revealing the identity of the amino acid in the solution (Figure 5.25). The composition of our peptide is

Figure 5.24: Fluorescent derivatives of amino acids. Fluorescamine reacts with the α-amino group of an amino acid to form a fluorescent derivative.
Figure 5.25: Determination of amino acid composition. Different amino acids in a peptide hydrolysate can be separated by ion-exchange chromatography on a sulfonated polystyrene resin (such as Dowex-50). Buffers (in this case, sodium citrate) of increasing ph are used to elute the amino acids from the column. The amount of each amino acid present is determined from the absorbance. Aspartate, which has an acidic side chain, is the first to emerge, whereas arginine, which has a basic side chain, is the last. The original peptide is revealed to be composed of one aspartate, one alanine, one phenylalanine, one arginine, and two glycine residues.

The parentheses denote that this is the amino acid composition of the peptide, not its sequence.

The sequence of a protein can then be determined by a process called the Edman degradation. The Edman degradation sequentially removes one residue at a time from the amino end of a peptide (Figure 5.26). Phenyl isothiocyanate reacts with the terminal amino group of the peptide, which then cyclizes and breaks off the peptide, yielding an intact peptide shortened by one amino acid. The cyclic compound is a phenylthiohydantoin (PTH)–amino acid, which can be identified by chromatographic procedures. The Edman procedure can then be repeated sequentially to yield the amino acid sequence of the peptide.

Figure 5.26: The Edman degradation. The labeled amino-terminal residue (PTH-alanine in the first round) can be released without hydrolyzing the rest of the peptide. Hence, the amino-terminal residue of the shortened peptide (Gly-Asp-Phe-Arg-Gly) can be determined in the second round. Three more rounds of the edman degradation reveal the complete sequence of the original peptide.

87

In principle, we should be able to sequence an entire protein by using the Edman method. In practice, the peptides cannot be much longer than about 50 residues, because the reactions of the Edman method are not 100% efficient and, eventually, the sequencing reactions are out of order. We can circumvent this obstacle by cleaving the original protein at specific amino acids into smaller peptides that can be sequenced independently. In essence, the strategy is to divide and conquer.

Specific cleavage can be achieved by chemical or enzymatic methods.Table 5.3 gives several ways of specifically cleaving polypeptide chains. The peptides obtained by specific chemical or enzymatic cleavage are separated, and the sequence of each purified peptide is then determined by the Edman method. At this point, the amino acid sequences of segments of the protein are known, but the order of these segments is not yet defined. How can we order the peptides to obtain the primary structure of the original protein? The necessary additional information is obtained from overlap peptides (Figure 5.27). A second cleavage technique is used to split the polypeptide chain at different sites. Some of the peptides from the second cleavage will overlap two or more peptides from the first cleavage, and they can be used to establish the order of the peptides. The entire amino acid sequence of the polypeptide chain is then known.

Table 5.3 Specific cleavage of polypeptides
Figure 5.27: Overlap peptides. The peptide obtained by chymotryptic digestion overlaps two tryptic peptides, establishing their order.

88

Mass Spectrometry Can Be Used to Determine a Protein’s Mass, Identity, and Sequence

Although Edman degradation has provided a wealth of sequence information, it has largely been supplanted by the powerful technique of mass spectrometry. Before we can examine how mass spectrometry can be used to sequence a protein, we will investigate how it can be used to determine a protein’s mass and identity.

Mass spectrometry is a technique for analyzing ionized forms of molecules in the gas phase. It is most readily applied to gases or to volatile liquids that easily release gas-phase ions. Mass measurements are obtained by determining how readily an ion is accelerated in an applied electric field. Consider two ions with the same overall charge but with different masses. In a given electric field, the same force will act on each ion. However, the acceleration of the more massive ion due to this force will be less, according to Newton’s third law, F = ma, where F is the force, m is the mass, and a is the acceleration. Thus, a measurement of the acceleration in a known applied force provides the mass.

Protein MassTwo widely used methods, matrix-assisted laser desorption– ionization (MALDI) and electrospray ionization (ESI), have been developed to determine a protein’s mass. We will focus on MALDI. In MALDI, the protein or peptide under study is coprecipitated with an organic compound that absorbs laser light of an appropriate wavelength (the “matrix”). The flash of a laser on the preparation expels molecules from the surface. These molecules capture electrons as they exit the matrix and hence leave as negatively charged ions.

After gas-phase ions have been generated, several approaches may be used to determine their mass. In time of flight (TOF) analysis, the ions are accelerated in an electric field toward a detector (Figure 5.28). The lighter ions are accelerated more, travel faster, and arrive at the detector first. Tiny amounts of biomolecules, as small as a few picomoles (pmol) to femtomoles (fmol), can be analyzed in this manner. A MALDI-TOF mass spectrum for a mixture of the proteins insulin and β-lactoglobulin is shown in Figure 5.29. The masses determined by MALDI-TOF are 5733.9 kDa and 18,364 kDa, respectively, compared with calculated values of 5733.5 kDa and 18,388 kDa. MALDI-TOF is indeed an accurate means of determining protein mass.

Figure 5.28: MALDI-TOF mass spectrometry. (1) The protein sample, embedded in an appropriate matrix, is ionized by the application of a laser beam. (2) An electric field accelerates the ions through the flight tube toward the detector. (3) The lightest ions arrive first. (4) The ionizing laser pulse also triggers a clock that measures the time of flight (TOF) for the ions.
Figure 5.29: MALDI-TOF mass spectrum of insulin and β-lactoglobulin. A mixture of 5 pmol each of insulin (I) and β-lactoglobulin (L) was ionized by MaLDI, which produces predominately singly charged molecular ions from peptides and proteins—the insulin ion (I + h)+ and the lactoglobulin ion (L + h)+. Molecules with multiple charges, such as those for β-lactoglobulin indicated by the blue arrows, as well as small quantities of a singly charged dimer of insulin (2 I + h)+, also are produced.

89

Protein IdentityAlthough protein masses serve as convenient name tags for distinguishing proteins, the mass of a given protein is usually not enough to uniquely identify it among all possible proteins within a cell. However, the mass of the parent protein along with the masses of several protein fragments produced by a specific cleavage method can provide unique identification. Suppose we wish to identify proteins within a two-dimensional gel such as that described in "Two-dimensional electrophoresis", within Section 5.2. After gel electrophoresis, the molecules in individual spots can be cleaved, often in the gel matrix itself, by using a protease such as trypsin. The mixture of fragments produced can then be analyzed by MALDI-TOF mass spectrometry. These peptide masses are matched against proteins in a database that have been “electronically cleaved” by a computer simulating the same fragmentation technique used for the experimental sample. In this way, the proteome within a given cell type or other sample can be analyzed in considerable detail.

90

Protein SequenceHow can we employ mass spectrometry to sequence a protein? The use of mass spectrometry for protein sequencing takes advantage of the fact that ions of proteins that have been analyzed by a mass spectrometer, the precursor ions, can be broken into smaller peptide chains by bombardment with atoms of an inert gas such as helium or argon. These new fragments, or product ions, can be passed through a second mass analyzer for further mass characterization. The utilization of two mass analyzers arranged in this manner is referred to as tandem mass spectrometry. Importantly, the product-ion fragments are formed in chemically predictable ways that can provide clues to the amino acid sequence of the precursor ion. For peptide analytes, product ions can be formed such that individual amino acid residues are cleaved from the precursor ion (Figure 5.30). Hence, a family of ions is detected; each ion represents a fragment of the original peptide with one or more amino acids removed from one end.

Figure 5.30: Peptide sequencing by tandem mass spectrometry. Within the mass spectrometer, peptides can be fragmented by bombardment with inert gaseous ions to generate a family of product ions in which individual amino acids have been removed from one end. As drawn here, the carboxyl fragment of the cleaved peptide bond is ionized.

Amino Acid Sequences Are Sources of Many Kinds of Insight

!quickquiz! QUICK QUIZ 4

Differentiate between amino acid composition and amino acid sequence.

A protein’s amino acid sequence is a valuable source of insight into the protein’s function, structure, and history.

  1. The sequence of a protein of interest can be compared with all other known sequences to ascertain whether significant similarities exist. Does this protein belong to an established family? A search for kinship between a newly sequenced protein and the millions of previously sequenced ones takes only a few seconds on a computer. If the newly isolated protein is a member of an established family of proteins, we can infer information about the protein’s structure and function. For instance, chymotrypsin and trypsin are members of the serine protease family, a clan of proteolytic enzymes that have a common catalytic mechanism based on a reactive serine residue. If the sequence of the newly isolated protein shows sequence similarity with trypsin or chymotrypsin, the result suggests that it, too, may be a serine protease.

    91

  2. Comparison of sequences of the same protein in different species yields a wealth of information about evolutionary pathways. Genealogical relations between species can be established from sequence differences between their proteins. We can even estimate the time at which two evolutionary lines diverged, thanks to the clocklike nature of random mutations. For example, a comparison of serum albumins found in primates indicates that human beings and African apes diverged 5 million years ago, not 30 million years ago as was once thought. Sequence analyses have opened a new perspective on the fossil record and the pathway of human evolution.

  3. Amino acid sequences can be searched for the presence of internal repeats. Such internal repeats can reveal the history of an individual protein itself. Many proteins apparently have arisen by the duplication of primordial genes. For example, calmodulin, a ubiquitous calcium sensor in eukaryotes (Chapter 13), contains four similar calcium-binding modules that arose by gene duplication (Figure 5.31).

    Figure 5.31: Repeating motifs in a protein chain. Calmodulin, a calcium sensor, contains four similar units (shown in red, yellow, blue, and orange) in a single polypeptide chain. Notice that each unit binds a calcium ion (shown in green).
  4. Many proteins contain amino acid sequences that serve as signals designating their destinations or controlling their processing. For example, a protein destined for export from a cell or for location in a membrane contains a signal sequence, a stretch of about 20 hydrophobic residues near the amino terminus that directs the protein to the appropriate membrane (Chapter 40). Another protein may contain a stretch of amino acids that functions as a nuclear localization signal, directing the protein to the nucleus.

  5. Sequence data allow a molecular understanding of diseases. Many diseases are caused by mutations in DNA that result in alterations in the amino acid sequence of a particular protein. These alterations often compromise the protein’s function. For instance, sickle-cell anemia is caused by a change in a single amino acid in the primary structure of the β chain of hemoglobin (Chapter 9). Approximately 70% of the cases of cystic fibrosis are caused by the deletion of one particular amino acid out of the 1480 amino acids in the protein that controls chloride transport across cell membranes. Indeed, a major goal of biochemistry is to elucidate the molecular basis of disease with the hope that this understanding will lead to effective treatment.

  6. Protein sequence is a guide to nucleic acid information. Knowledge of a protein’s primary structure allows access to genomic information. DNA sequences that correspond to a part of the amino acid sequence can be synthesized on the basis of the genetic code. These DNA sequences can be used as probes to isolate the gene encoding the protein or the DNA corresponding to the mRNA, called the cDNA or complementary DNA (Chapter 41).