7.2 WORKING WITH GENES AND THEIR PRODUCTS

The isolation of a gene, or any segment of genomic DNA, generally has one of two purposes. One is to examine the DNA itself, determine its sequence, study its structure and/or function, and compare it with other DNA segments. For example, researchers in physical biochemistry might be interested in the structure of an unusual repeated sequence. Evolutionary biologists and forensic scientists might be interested in comparing the sequence of the DNA segment with the same segment taken from other individuals in a population or with related DNA segments from other species. The other possible purpose is to work with the protein or RNA product of the isolated gene. These gene products are at the heart of every biological process. Genetic engineering provides tools not only for the isolation and study of proteins and RNA but also for their alteration for myriad purposes.

The isolation and examination of DNA segments has been greatly facilitated by PCR technology, and we discuss this first. We then explore modern DNA sequencing methods and a variety of techniques for expressing and altering gene products—primarily proteins—so as to understand their function and harness them for new purposes.

Gene Sequences Can Be Amplified with the Polymerase Chain Reaction

Genome projects continue worldwide, creating rapidly growing online databases containing the complete genome sequences of thousands of organisms. Such programs provide unprecedented access to gene se- quence information. In turn, progress on this front is simplifying the process of cloning individual genes for more detailed analysis. If we know the sequence of at least the end portions of a DNA segment we are interested in, we can hugely amplify the number of copies of that DNA segment with the polymerase chain reaction (PCR), a process conceived by Kary Mullis in 1983 (see the How We Know section at the end of this chapter). The amplified DNA can then be cloned by the methods described earlier or can be used in a variety of analytical procedures.

222

The PCR procedure has an elegant simplicity and relies on enzymes called DNA polymerases. DNA polymerases synthesize DNA strands on a pre-existing DNA template using free deoxyribonucleotides. Further, DNA polymerases do not synthesize DNA de novo, but instead must add nucleotides to preexisting strands, referred to as primers (as described in Chapter 11). Two synthetic oligonucleotides are prepared, complementary to sequences on opposite strands of the target DNA at positions defining the ends of the segment to be amplified. The oligonucleotides serve as replication primers that can be extended by a DNA polymerase. The 3′ ends of the hybridized primers are oriented toward each other and positioned to prime DNA synthesis across the targeted DNA segment (Figure 7-9a). Basic PCR requires four components: a DNA sample containing the segment to be amplified, the pair of synthetic oligonucleotide primers, deoxynucleoside triphosphates (dNTPs), and a DNA polymerase. The reaction mixture is heated briefly to denature the DNA, separating the two strands. The mixture is cooled so that the primers can anneal to the DNA. The high concentration of primers increases the likelihood that they will anneal to each strand of the denatured DNA before the two DNA strands (present at a much lower concentration) can reanneal to each other. The primed segment is then replicated selectively by the DNA polymerase, using the pool of dNTPs. The cycle of heating, cooling, and replication is repeated 25 to 30 times over a few hours in an automated process, amplifying the DNA segment between the primers until it can be readily analyzed or cloned. Each cycle doubles the amount of the DNA segment, so the concentration of this DNA grows exponentially. After 20 cycles, the DNA segment has been amplified up to 220, or a millionfold if reaction conditions are ideal. All other DNA in the sample remains unamplified. PCR uses a heat-stable DNA polymerase, such as the Taq polymerase, which remains active after every heating step and does not have to be replenished.

Figure 7-9: Amplification of a DNA segment by the polymerase chain reaction (PCR). PCR leads to specific amplification of DNA in a segment defined by the two designed DNA primers. If extra DNA sequences are included at the 5′ end of the synthetic primers (e.g., the sequence specifying a restriction site, as shown here), those sequences are incorporated into the final product.

By careful design of the primers used for PCR, the amplified segment can be altered by the inclusion, at each end, of additional DNA not present in the chromosome that is being targeted. For example, restriction endonuclease cleavage sites can be included to facilitate the subsequent cloning of the amplified DNA (Figure 7-9b).

This technology is highly sensitive: PCR can detect and amplify as little as one DNA molecule in almost any type of sample—including some quite ancient ones. The double-helical structure of DNA makes it a highly stable molecule (see Chapter 6), but DNA does degrade slowly over time (through reactions described in Chapter 12). PCR has allowed the successful cloning of rare, undegraded DNA segments from samples more than 40,000 years old. Investigators have used the technique to clone DNA fragments from the mummified remains of humans and extinct animals, such as the woolly mammoth, creating the fields of molecular archaeology and molecular paleontology. DNA from burial sites has been amplified by PCR and used to trace ancient human migrations. Epidemiologists can use PCR-enhanced DNA samples from human remains to trace the evolution of human pathogenic viruses. Thus, in addition to its usefulness for cloning DNA, PCR is a potent tool in forensic medicine (Highlight 7-1). It is also being used for detecting viral infections before they cause symptoms and for the prenatal diagnosis of a wide array of genetic diseases.

Given the extreme sensitivity of PCR methods, contamination of samples is a serious issue. In many applications, including forensic and ancient DNA tests, controls must be run to make sure the amplified DNA is not derived from the researcher or from contaminating bacteria.

Many specialized adaptations of PCR have increased the utility of the method. For example, sequences in RNA can be amplified if reverse transcriptase is used in the first PCR cycle (see Figure 7-8). After the DNA strand is made using the RNA as a template, the remaining cycles can be carried out with a DNA polymerase by normal PCR protocols. This reverse transcriptase PCR (RT-PCR) can be used, for example, to detect sequences derived from living cells (which are transcribing their DNA into RNA) as opposed to dead tissues.

PCR protocols can also be made quantitative for estimating the relative copy numbers of particular sequences in a sample. The approach is called quantitative PCR (qPCR). If a DNA sequence is present in higher than usual amounts in a sample—for example, certain genes may be amplified so that they are present in many copies in the cells that make up a cancerous tumor—quantitative PCR can reveal the increased representation of that sequence. In brief, the PCR is carried out in the presence of a probe that emits a fluorescent signal when the PCR product is present (Figure 7-10). If the sequence of interest is present at higher levels than other sequences in the sample, the PCR signal will reach a predetermined threshold faster. Reverse transcriptase PCR and quantitative PCR can be combined to determine the relative transcription levels of genes in a cell under different environmental conditions or to study the regulation of transcription of one or more genes.

Figure 7-10: Quantitative PCR. PCR can be used quantitatively, by carefully monitoring the progress of a PCR amplification and determining when a DNA segment has been amplified to a specific threshold level. (a) The amount of PCR product present is determined by measuring the level (fluorescence) of a fluorescent probe attached to a reporter oligonucleotide complementary to the DNA segment that is being amplified. Probe fluorescence is not detectable initially due to a fluorescence quencher attached to the same oligonucleotide. When the reporter oligonucleotide pairs with its complement in a copy of the amplified DNA segment, the fluorophore is separated from the quenching molecule and fluorescence results. (b) As the PCR reaction proceeds, the amount of the targeted DNA segment increases exponentially, and the fluorescent signal also increases exponentially as the oligonucleotide probes anneal to the amplified segments. After many PCR cycles, the signal reaches a plateau as one or more reaction components are exhausted. When a segment is present in greater amounts in one sample than another, its amplification reaches a defined threshold level earlier. The “No template” line follows the slow increase in background signal observed in a control that does not include added sample DNA. CT is the cycle number at which the threshold is first surpassed.

223

224

HIGHLIGHT 7-1 TECHNOLOGY: A Potent Weapon in Forensic Medicine

One of the most accurate methods for placing an individual at the scene of a crime is a fingerprint. But with the advent of recombinant DNA technology, a much more powerful tool became available: DNA genotyping (also called DNA fingerprinting or DNA profiling). As first described by English geneticist Alec Jeffreys in 1985, the method is based on sequence polymorphisms, slight sequence differences among individuals—1 in every 1,000 bp, on average. Each difference from the prototype human genome sequence (the first one obtained) occurs in some fraction of the human population; every person has some differences from this prototype.

Forensic work focuses on differences in the lengths of short tandem repeat (STR) sequences. An STR locus is a short DNA sequence, repeated many times in tandem at a specific location in a chromosome; usually, the repeated sequence is 4 bp long. The loci most often used in STR genotyping are short—4 to 50 repeats (16 to 200 bp for tetranucleotide repeats)—and have multiple length variants in the human population. More than 20,000 tetranucleotide STR loci have been characterized in the human genome. And more than a million STRs of all types may be present in the human genome, accounting for about 3% of all human DNA.

The length of a particular STR in a given individual can be determined with the aid of the polymerase chain reaction (see Figure 7-9). The use of PCR also makes the procedure sensitive enough to be applied to the very small samples often collected at crime scenes. The DNA sequences flanking STRs are unique to each type of STR and are identical (except for very rare mutations) in all humans. PCR primers are targeted to this flanking DNA and are designed to amplify the DNA across the STR (Figure 1a). The length of the PCR product then reflects the length of the STR in that sample. Because each human inherits one chromosome of each chromosome pair from each parent, the STR lengths on the two chromosomes are often different, generating two different STR lengths from one individual. The PCR products are subjected to electrophoresis on a very thin polyacrylamide gel in a capillary tube. The resulting bands are converted into a set of peaks that accurately reveal the size of each PCR fragment and thus the length of the STR in the corresponding allele. Analysis of multiple STR loci can yield a profile that is unique to an individual (Figure 1b). This is typically done with a commercially available kit that includes PCR primers unique to each locus, linked to colored dyes to help distinguish the different PCR products. PCR amplification enables investigators to obtain STR genotypes from less than 1 ng of partially degraded DNA, an amount that can be obtained from a single hair follicle, a small fraction of a drop of blood, a small semen sample, or samples that might be months or even many years old. When good STR genotypes are obtained, the chance of misidentification is less than 1 in 1018 (a quintillion).

FIGURE 1 (a) STR loci can be analyzed by PCR. Suitable PCR primers (with an attached dye to aid in subsequent detection) are targeted to sequences on each side of the STR, and the region between them is amplified. If the STR sequences have different lengths on the two chromosomes of an individual’s chromosome pair, two PCR products of different lengths result. (b) The PCR products from amplification of up to 16 STR loci can be run on a single capillary acrylamide gel (a “16-plex” analysis). Determination of which locus corresponds to which signal depends on the color of the fluorescent dye attached to the primers used in the process and on the size range in which the signal appears (the size range can be controlled by which sequences—those closer to or more distant from the STR—are targeted by the designed PCR primers). RFU = relative fluorescence units, measured against a standard supplied with the kit.

225

The successful forensic use of STR analysis required standardization, first attempted in the United Kingdom in 1995. The U.S. standard, called the Combined DNA Index System (CODIS), established in 1998, is based on 13 well-studied STR loci, which must be present in any DNA-typing experiment carried out in the United States (Table 1). The amelogenin gene is also used as a marker in the analyses. Present on the human sex chromosomes, this gene has a slightly different length on the X and Y chromosomes. PCR amplification across this gene thus generates different-sized products that can reveal the sex of the DNA donor. By the beginning of 2014, the CODIS database contained nearly 11 million STR genotypes and had assisted more than 220,000 forensic investigations.

DNA genotyping has been used to both convict and acquit suspects, and to establish paternity with an extraordinary degree of certainty. The impact of these procedures on court cases will continue to grow as standards are refined and as international STR genotyping databases grow. Even very old mysteries can be solved. In 1996, STR genotyping helped confirm the identification of the bones of the last Russian czar and his family, who were assassinated in 1918.

226

The Sanger Method Identifies Nucleotide Sequences in Cloned Genes

In its capacity as a repository of information, a DNA molecule’s most important property is its nucleotide sequence. Until the late 1970s, determining the sequence of a nucleic acid containing even 5 or 10 nucleotides was very laborious. The development of two techniques in 1977 (one by Allan Maxam and Walter Gilbert, the other by Frederick Sanger) made possible the sequencing of larger DNA molecules. The techniques depended on the improved understanding of nucleotide chemistry and DNA metabolism and on improved electrophoretic methods for separating DNA strands that differ in size by only one nucleotide (see Figure 6-32 for a description of gel electrophoresis). In work on short DNA oligonucleotides (up to a few hundred nucleotides), polyacrylamide is often used instead of agarose as the gel matrix, because it enables researchers to detect small size differences between DNA fragments.

Although the two methods are similar in approach, the Sanger method, also known as the dideoxy chain-termination method, has proved to be technically easier and is in more widespread use (Figure 7-11). This method makes use of the mechanism of DNA synthesis by DNA polymerases (see Chapter 11). It requires the enzymatic synthesis of a DNA strand complementary to the strand under analysis, using a radioactively labeled primer and dideoxynucleotides. In the reaction catalyzed by DNA polymerase, the 3′-hydroxyl group of the primer reacts with an incoming deoxynucleoside triphosphate (dNTP) to form a new phosphodiester bond (Figure 7-11a). The identity of the added deoxynucleotide is determined by its complementarity, through base pairing, to a base in the template strand. In the Sanger sequencing reaction, nucleotide analogs called dideoxynucleoside triphosphates (ddNTPs) interrupt DNA synthesis because they lack the 3′-hydroxyl group needed for the next step (Figure 7-11b). For instance, the addition of ddCTP to an otherwise normal reaction system causes some of the synthesized strands to be prematurely terminated at the position where dC would normally be added, opposite a template dG. Given the excess of dCTP over ddCTP, the chance that the analog will be incorporated instead of dC is small. But ddCTP is present in sufficient amounts to ensure that each new strand has a high probability of acquiring at least one ddC at some point during synthesis. The result is a solution containing a mixture of labeled fragments, each ending with a C residue. Each G residue in the template generates C-terminated fragments of a particular length. The different-sized fragments, separated by electrophoresis, reveal the location of C residues in the synthesized DNA strand.

Figure 7-11: The Sanger method for DNA sequencing. (a) DNA synthesis involves a reaction between the 3′-hydroxyl group of the primer dNTP and the phosphate group of an incoming dNTP. (b) The Sanger method uses ddNTPs, which lack the 3′-hydroxyl group, to halt DNA synthesis at a particular nucleotide. (c) In the sequencing reaction, DNA synthesis is carried out with a mixture of dNTPs and a ddNTP to extend a radiolabeled primer. A different ddNTP is used in each reaction. The products are analyzed by autoradiography to determine the nucleotide sequence.

227

228

This procedure is repeated separately for each of the four ddNTPs, and the sequence of the DNA strand can be read directly from an autoradiogram of the gel (Figure 7-11c). Because shorter DNA fragments migrate faster, the fragments near the bottom of the gel represent the nucleotide positions closest to the primer (the 5′ end), and the sequence is read (in the 5′→3′ direction) from bottom to top. Note that the sequence obtained is that of the strand complementary to the template strand being analyzed.

DNA sequencing was first automated by a variation of the Sanger method, in which each of the four dideoxynucleotides used for a reaction was labeled with a differently colored fluorescent tag (Figure 7-12). With this technology, researchers could sequence DNA molecules containing thousands of nucleotides in a few hours, and the entire genomes of hundreds of organisms were sequenced in this way. For example, in the Human Genome Project, researchers sequenced all 3.2 × 109 bp of the DNA in a human cell (see Chapter 8).

Figure 7-12: Automation of DNA sequencing reactions. In the Sanger method, each ddNTP can be linked to a fluorescent (dye) molecule that gives the same color to all the fragments terminating in that nucleotide, a different color for each nucleotide. All four labeled ddNTPs are added together. The resulting colored DNA fragments are separated by size in an electrophoretic gel in a capillary tube (a refinement of gel electrophoresis that allows for faster separations). All fragments of a given length migrate through the capillary gel together in a single band, and the color associated with each band is detected with a laser beam. The DNA sequence is read by identifying the color sequences in the bands as they pass the detector. This information is fed directly to a computer, and the sequence is determined. The amount of fluorescence in each band is represented as a peak in the computer output. Here, the nucleotide colors reflect the dyes actually used in the method, and thus deviate from the standard nucleotide colors used in other figures.

Genomic Sequencing Is Aided by New Generations of DNA Sequencing Methods

DNA sequencing technologies continue to evolve. A complete human genome can now be sequenced in a day or two, a bacterial genome in a few hours. The day when a personal genomic sequence might be a routine part of each individual’s medical record is fast approaching. These advances have been made possible by methods sometimes referred to as next-generation, or “next-gen” sequencing. The sequencing strategy is sometimes similar to and sometimes quite different from that used in the Sanger method. Innovations have allowed a miniaturization of the procedure, a massive increase in scale, and a corresponding decrease in cost.

229

A genomic sequence is generated in several steps. First, the genomic DNA is broken at random locations by shearing to generate fragments that are a few hundred base pairs long. Synthetic oligonucleotides are ligated to the ends of all the fragments, providing a known point of reference on every DNA molecule. The individual fragments are then immobilized on a solid surface, and each is amplified by PCR (see Figure 7-9). The solid surface is part of a channel that allows liquid solutions to flow over the samples. The result is a solid surface just a few centimeters wide, with millions of attached DNA clusters, each cluster containing multiple copies of a single DNA sequence derived from a random genomic DNA fragment. The efficiency comes from sequencing all of these millions of clusters at the same time, with the data from each cluster captured and stored in a computer.

Two widely utilized next-generation sequencers use different strategies to accomplish the sequencing reactions. One of these, known as 454 sequencing (the numbers refer to a code used during development of the technology and have no scientific meaning), uses a strategy called pyrosequencing in which the addition of nucleotides is detected by flashes of light (Figure 7-13). The four dNTPs (unaltered) are pulsed onto the reacting surface one at a time in a repeating sequence. The nucleotide solution is retained on the surface just long enough for DNA polymerase to add that nucleotide to any cluster where it is complementary to the next nucleotide in the template sequence. Excess nucleotide is destroyed quickly by the enzyme apyrase before the next nucleotide pulse. When a specific nucleotide is successfully added to the strands of a cluster, pyrophosphate is released as a byproduct. Another enzyme in the solution bathing the surface is sulfurylase, which converts the pyrophosphate to ATP. The appearance of ATP ultimately provides the signal that a nucleotide has been added to the DNA. Also present in the medium is the enzyme luciferase and its substrate molecule, luciferin (luciferase is the enzyme that generates the flash of light produced by fireflies). When ATP is generated, luciferase catalyzes a reaction with luciferin that results in a tiny flash of light. When many tiny flashes occur in a cluster, the emitted light can be recorded in a captured image. For example, when dCTP is added to the solution, flashes occur only at clusters where G is the next base in the template and C is the next nucleotide to be added to the growing DNA chain. If there is a string of two, three, or four G residues in the template, a similar number of C residues are added to the growing strand in one cycle. This is recorded as a “flash” amplitude at that cluster that is two, three, or four times greater than when only one C residue is added. Similarly, when dGTP is added, flashes occur at a different set of clusters, marking those as clusters where G is the next nucleotide added to the sequence. The length of DNA that can be reliably sequenced in a single cluster by this method—often referred to as the read length, or “read”—is typically 400 to 500 nucleotides, and is constantly increasing.

Figure 7-13: Next-generation pyrosequencing. (a) Pyrosequencing detects the addition of nucleotides on the DNA to be sequenced by flashes of light. (b) An image of a very small part of one cycle of a 454 sequencing run. Each individual segment of template DNA to be sequenced is attached to a tiny DNA capture bead, then amplified on the bead by PCR. Each bead is immersed in an emulsion and placed in a tiny (∼ 29 μm) well on a picotiter plate. The reaction of luciferin and ATP with luciferase produces light flashes when a nucleotide is added to a particular DNA cluster in a particular well. Circles represent the same cluster over multiple cycles. In this case, reading the top (or bottom) circle from left to right across each row gives the sequence for that cluster.

230

The second widely used method employs a technique known as reversible terminator sequencing (Figure 7-14), which lies at the heart of the Illumina sequencer. A special sequencing primer is added that is complementary to the oligonucleotides of known sequence that were ligated to the ends of the DNA fragments in each cluster (as described above). In addition, fluorescently labeled terminator nucleotides and DNA polymerase are added. The polymerase adds the appropriate nucleotide to the strands in each cluster, each type of nucleotide (A, T, G, or C) carrying a different fluorescent label. These terminator nucleotides have blocking groups attached to the 3′ ends that permit addition of only one nucleotide to each strand. Next, lasers excite all the fluorescent labels, and an image of the entire surface reveals the color (and thus the identity of the base) added to each cluster. The fluorescent label and the blocking groups are then chemically or photolytically removed, in preparation for adding a new nucleotide to each cluster. The sequencing proceeds stepwise. Read lengths are shorter for this method, typically 100 to 200 nucleotides per cluster, although refinements are ongoing.

Figure 7-14: Next-generation reversible terminator sequencing. (a) The reversible terminator method of sequencing uses fluorescent tags to identify nucleotides. Blocking groups on each fluorescently labeled nucleotide prevent multiple nucleotides from being added per cycle. (b) Six successive cycles from one very small part of an Illumina sequencing run. Each colored spot represents the location of an immobilized DNA oligonucleotide affixed to the surface of the flow cell. The circled clusters represent the same spot on the surface over successive cycles and give the sequences indicated. Data are automatically recorded and analyzed digitally. (c) Typical flow cell used for a next-generation sequencer. Millions of DNA fragments can be sequenced simultaneously in each of the eight channels.

231

These technologies are modern manifestations of an approach to genomic sequencing that is sometimes called shotgun sequencing. Many copies of the genomic DNA are sheared to generate each set of fragments. Thus, a particular short segment of the genome may be present in dozens or even hundreds of different sequenced clusters. However, there is no landmark on an individual fragment to indicate where in the genome it came from. Assembling the sequences of these millions of fragments into a genomic sequence requires the computerized alignment of overlapping fragment sequences (Figure 7-15). The number of times that a particular nucleotide in the genome is sequenced, on average, is referred to as the sequencing depth, or sequencing coverage. In most cases, a sufficiently large number of random fragments are sequenced so that each nucleotide in the genome is sequenced an average of 30 to 40 times (30–40 × coverage). Although the coverage of particular nucleotides may vary (some will be sequenced 100 times; perhaps a few not at all), this level of coverage ensures that most genomic nucleotides will be sequenced at least 10 times and most sequencing errors will be detected and eliminated. The overlaps allow the computer to trace the sequence through a chromosome, from one fragment to another. This permits the assembly of long contiguous sequences called contigs. In a successful genomic sequencing exercise, many contigs can extend over millions of base pairs. Special strategies are needed to fill in the inevitable gaps and to deal with repetitive sequences.

Figure 7-15: Sequence assembly. In a genomic sequence, each base pair of the genome is usually represented in many of the sequenced fragments, referred to as reads. Shown is a small part of the sequence of a new variant species of E. coli, with the reads generated by a 454 sequencer. The numbers at the top represent genomic base-pair positions, relative to an arbitrarily defined “0.” The sequences all come from a particular long contig designated 356. The reads themselves are represented by horizontal arrows, with computer-assigned identifiers listed for each one at the left. DNA strand segments are sequenced at random, with sequences obtained from one strand (5′ to 3′, left to right) represented by solid arrows and sequences obtained from the other strand (5′ to 3′, right to left) represented by dashed arrows. The latter sequences are automatically reported as their complement when they are merged with the overall dataset. The “coverage threshold” at the top is a measure of sequence quality. The wider green bar indicates sequences that have been obtained enough times to generate high confidence in the results. The depth of the coverage line indicates how many times a given base pair appears in a sequenced read. The vertical blue shaded line indicates a part of the sequence that is highlighted by thin blue brackets in the sequence line at the bottom of the page. The “SNP statistics report” (inset) is a listing of positions where single nucleotide polymorphisms (SNPs; see Chapter 8) appear to be present in some of the reads. These putative SNPs are often checked by additional sequencing. They are indicated in the reads by thin, blue vertical slash marks within the horizontal lines for each read.

232

For some applications, the amounts of genomic DNA to be sequenced are increased so that sequencing depth is increased to 100 × or even 1,000 ×. This approach, sometimes called deep sequencing, can help determine whether a mutation is present in a subset of an organism’s cells or whether other genomic variations are present. Deep sequencing is helpful in the characterization of genomic sequences in cancerous tumors, where the genome is highly unstable and changes frequently as the tumor grows.

DNA sequencing technologies continue to advance rapidly, and a few newer next-generation methods now complement the two described above and may eventually replace them for many applications. A method called ion torrent utilizes fragmented and immobilized DNA segments, much like 454 and Illumina sequencing. Deoxynucleotide triphosphates are introduced, one by one. Addition of a dNTP at a certain spot in the growing chain is detected by measuring the protons released in the reaction. More sensitive light-detection methods have given rise to yet another approach, called single molecule real time (SMRT) sequencing. Here, a single molecule of DNA polymerase is immobilized at the bottom of each of millions of precisely engineered pores on the flow cell. The DNA polymerase captures fragmented genomic segments as they diffuse into the pore. The labeled nucleotides then diffuse in, each new one releasing its colored fluorescent group as it is added to the chain. An innovative light-detection system records the color of the resulting light flash at the bottom of the pore, thereby revealing the identity of each added nucleotide. The method is accurate and can generate particularly long read lengths, up to nearly 10,000 base pairs.

Cloned Genes Can Be Expressed to Amplify Protein Production

Frequently, it is the product of a cloned gene, rather than the gene itself, that is of primary interest—particularly when the protein has commercial, therapeutic, or research value. Molecular biologists use purified proteins to elucidate protein function, study reaction mechanisms, generate antibodies, reconstitute complex cellular activities in the test tube with purified components, and examine protein binding partners, and for many other purposes. With an increased understanding of the fundamentals of DNA, RNA, and protein metabolism and their regulation in E. coli, investigators can now manipulate cells to express cloned genes in order to study their protein products. The general goal is to alter the sequences around a cloned gene to trick the host organism into producing the protein product of the gene, often at very high levels. This overexpression of a protein can make its subsequent purification a lot easier.

Here we use the expression of a eukaryotic protein in a bacterium as an example. Most cloned eukaryotic genes lack the DNA sequence elements required for their controlled expression in bacterial cells—promoters (sequences that instruct RNA polymerase where to bind to initiate mRNA synthesis), ribosome-binding sites (sequences that allow translation of the mRNA to protein), and additional regulatory sequences (see Chapter 15). Therefore, appropriate bacterial regulatory sequences for transcription and translation must be inserted at the correct positions, relative to the eukaryotic gene, in the vector DNA. In some cases, cloned genes are so efficiently expressed that their protein product represents 10% or more of the cellular protein. At these concentrations, some foreign proteins can kill the host cell (usually E. coli), so the cloned gene expression must be limited to the few hours before the planned harvesting of the cells.

Cloning vectors with the transcription and translation signals needed for the regulated expression of a cloned gene are called expression vectors. The rate of expression of the cloned gene is controlled by replacing the gene’s own promoter and regulatory sequences with more efficient and convenient versions supplied by the vector. Generally, a well-characterized promoter and its regulatory elements are positioned near several unique restriction sites for cloning, so that genes inserted at the restriction sites will be expressed from the regulated promoter elements (Figure 7-16). Some of these vectors incorporate other features, such as a bacterial ribosome-binding site to enhance translation of the mRNA derived from the gene (see Chapter 18) or a transcription-termination sequence (Chapter 15).

Figure 7-16: DNA sequences in a typical E. coli expression vector. The gene to be expressed is inserted into one of the restriction sites in the polylinker, near the promoter (P), with the end of the gene encoding the N-terminus of the protein positioned closest to the promoter. The promoter allows efficient transcription of the inserted gene, and the transcription-termination sequence sometimes improves the amount and stability of the mRNA produced. The operator (O) is a sequence bound by a protein called a repressor, which normally blocks gene expression from the adjacent gene. The ribosome-binding site provides sequence signals for the efficient translation of the mRNA derived from the gene. The selectable marker allows the selection of cells containing the recombinant DNA.

Many Different Systems Are Used to Express Recombinant Proteins

Every living organism has the capacity to express genes contained in its genomic DNA; thus, in principle, any organism can serve as a host to express proteins from a different (heterologous) species. Almost every sort of organism has indeed been used for this purpose, and each host type has a particular set of advantages and disadvantages.

233

Bacteria Bacteria, especially E. coli, remain the most common hosts for protein expression. The regulatory sequences that govern gene expression in E. coli and many other bacteria are well understood and can be harnessed to express cloned proteins at high levels. Bacteria are easy to store and grow in the laboratory, on inexpensive growth media. Efficient methods also exist to get DNA into bacteria and extract DNA from them. Bacteria can be grown in huge amounts in commercial fermentors, providing a rich source of the cloned protein. Problems do exist, however. When expressed in bacteria, some heterologous proteins do not fold correctly, and many do not undergo the postsynthetic modifications (covalent modification, proteolytic cleavage, etc.; see Chapter 18) necessary for their activity. A variety of gene sequence features also can make a particular gene difficult to express in bacteria. For these and many other reasons, some eukaryotic proteins are inactive when purified from bacteria, or they cannot be expressed at all.

There are many specialized systems for expressing proteins in bacteria. The promoter and regulatory sequences associated with the lactose operon (see Chapters 5 and 20) are often fused to the gene of interest to direct transcription. The cloned gene will be transcribed when lactose is added to the growth medium. However, regulation in the lactose system is “leaky”: it is not turned off completely when lactose is absent—a potential problem if the product of the cloned gene is toxic to the host cells. Transcription from the Lac promoter is also not efficient enough for some applications.

An alternative system uses a promoter and RNA polymerase found in a bacterial virus called bacteriophage T7. If the cloned gene is fused to a T7 promoter, it is transcribed not by the E. coli RNA polymerase but by the T7 RNA polymerase. The gene encoding this polymerase is separately cloned into the same cell in a construct that affords tight regulation (allowing controlled production of the T7 RNA polymerase). The polymerase is also very efficient and directs high levels of expression of most genes fused to the T7 promoter. This system has been used to express the RecA protein in bacterial cells (Figure 7-17).

Figure 7-17: Regulated expression of RecA protein in a bacterial cell. The gene encoding the RecA protein, fused to a bacteriophage T7 promoter, is cloned into an expression vector. Under normal growth conditions (uninduced, left lane), no RecA protein appears when cellular proteins are separated on a polyacrylamide gel (see Chapter 6) and stained with Coomassie Blue for visualization. When the T7 RNA polymerase is induced in the cell (right lane), the recA gene is expressed, and the large amounts of RecA protein produced are readily observed.

Yeast The yeast S. cerevisiae is probably the best understood eukaryotic organism and one of the easiest to grow and manipulate in the laboratory. Like bacteria, this yeast can be grown on inexpensive media. Yeast have tough cell walls that are difficult to breach in order to introduce DNA vectors, so bacteria are more convenient for doing much of the genetic engineering and vector maintenance. Several excellent shuttle vectors exist for this purpose.

The principles underlying the expression of a protein in yeast are the same as those in bacteria. Cloned genes must be linked to promoters that can direct high-level expression in yeast. For example, the yeast GAL1 and GAL10 genes are under cellular regulation such that they are expressed when yeast cells are grown in media with galactose but shut down when the cells are grown in media with glucose. Thus, if a heterologous gene is expressed using the same regulatory sequences, the expression of that gene can be controlled simply by choosing an appropriate medium for cell growth.

234

Some of the same problems that accompany protein expression in bacteria also occur with yeast. Heterologous proteins may not fold properly, yeast may lack the enzymes needed to modify the proteins to their active forms, or the expression of proteins may be made difficult by certain features of the gene sequence. However, because S. cerevisiae is a eukaryote, the expression of eukaryotic genes (especially yeast genes) is sometimes more efficient in this host than in bacteria. Folding and modification of the products may also be more accurate than for proteins expressed in bacteria.

Insects and Insect Viruses Baculoviruses are insect viruses with double-stranded DNA genomes. When they infect their insect larval hosts, they act as parasites, killing the larvae and turning them into factories for virus production. Late in the infection process, the viruses produce large amounts of two proteins (p10 and polyhedrin)—neither of which is needed for virus production in cultured insect cells, and thus both can be replaced with the gene of a heterologous protein. When the resulting recombinant virus is used to infect insect cells or larvae, the heterologous protein is often produced at very high levels—up to 25% of the total protein present at the end of the infection cycle.

Autographa californica multicapsid nucleopolyhedrovirus (AcMNPV) is the baculovirus most often used for protein expression. Its genome (134,000 bp) is too large for direct cloning. Virus purification is also cumbersome. These problems have been solved by the creation of bacmids, large circular DNAs that include the entire baculovirus genome along with sequences that allow replication of the bacmid in E. coli (Figure 7-18). The gene of interest is cloned into a smaller plasmid and combined with the larger plasmid by site-specific recombination in vivo (described in Chapter 14). The recombinant bacmid is then isolated and transfected into insect cells (the term transfection is used when the DNA used for transformation includes viral sequences and the process leads to viral replication), followed by recovery of the protein once the infection cycle is finished. A wide range of bacmid systems are available commercially. Baculovirus systems are not successful with all proteins. However, with these systems, insect cells sometimes successfully replicate the protein-modification patterns of higher eukaryotes and produce active, correctly modified eukaryotic proteins.

Figure 7-18: Cloning with baculoviruses. (a) The construction of a typical vector used for protein expression in baculoviruses. The gene of interest is cloned into a small plasmid between two sites (att) recognized by a site-specific recombinase, then introduced into the baculovirus vector by site-specific recombination. This generates a circular DNA product that is used to infect the cells of an insect larva. The gene of interest is expressed during the infection cycle, downstream of a promoter that normally expresses a baculovirus coat protein at very high levels. (b) Left: an insect larva infected with a recombinant baculovirus vector expressing a protein that produces a red color. Right: an uninfected larva.

235

Mammalian Cells in Culture The most convenient way to introduce cloned genes into a mammalian cell is with viruses. In this way, a molecular biologist can take advantage of the natural capacity of a virus to insert its DNA or RNA into a cell, and sometimes into the cellular chromosome. A variety of engineered mammalian viruses are available as vectors, including human adenoviruses and retroviruses. The gene of interest is cloned so that its expression is controlled by a virus promoter. The virus uses its natural infection mechanisms to introduce the recombinant genome into cells, where the cloned protein is expressed. These systems have the advantage that proteins can be expressed either transiently (if the viral DNA is maintained separately from the host cell genome and eventually degraded) or permanently (if the viral DNA is integrated into the host cell genome). With the correct choice of host cell, the proper posttranslational modification of the protein to its active form can be assured. However, the growth of mammalian cells in tissue culture is very expensive, and this technology is generally used to test the function of a protein in vivo rather than to produce a protein in large amounts.

Transgenic Animals Even large animals can be used for the commercial, large-scale production of recombinant proteins. The strategies are different from those discussed thus far and are designed to generate protein in a low-cost, renewable way, such as purification of a protein from the milk of transgenic dairy cattle (Figure 7-19). The gene of interest is cloned into a special vector, linked to a promoter that directs tissue-specific gene expression. For example, the gene can be placed under the control of regulatory sequences for a mammary gland–specific protein, such as casein lactoglobulin, which is normally secreted in milk in large quantities. The recombinant plasmid is injected into fertilized bovine oocytes, and some of them take up the plasmid and incorporate it into their genome. Genetic analysis or direct demonstration of heterologous protein expression then identifies animals in which the gene transfer has been successful, and these animals are bred. Heterologous proteins expressed in place of casein lactoglobulin can be secreted in the milk at levels above 50% of total milk proteins. Posttranslational protein modifications are not always carried out correctly for proteins expressed in this way, but protein production can be economical once a line of protein-expressing animals is established.

Figure 7-19: Cloning in transgenic animals. These cows, grazing in a field in New Zealand, were engineered to produce high levels of a recombinant protein in their milk.

Alteration of Cloned Genes Produces Altered Proteins

Cloning techniques can be used not only to overproduce proteins but to produce protein products subtly altered from their native forms. Specific amino acids may be replaced individually by site-directed mutagenesis. A variety of methods, based in large measure on techniques pioneered by Michael Smith and his colleagues in the late 1970s, are now used to enhance research on proteins by allowing investigators to make specific changes in the primary structure and examine the effects of these changes on the protein’s folding, three-dimensional structure, and activity. This powerful approach to studying protein structure and function changes the amino acid sequence by altering the DNA sequence of the cloned gene. If appropriate restriction sites flank the sequence to be altered, researchers can simply remove a DNA segment and replace it with a synthetic one identical to the original except for the desired change (Figure 7-20a).

Figure 7-20: Two approaches to site-directed mutagenesis. (a) A synthetic DNA segment replaces a fragment removed by a restriction endonuclease. (b) A pair of synthetic and complementary oligonucleotides with a specific sequence change at one position are hybridized to a circular plasmid with a cloned copy of the gene to be altered. The oligonucleotides act as primers for the synthesis of full-length double-stranded DNA (dsDNA) copies of the plasmid that contain the specified sequence change. These plasmid copies are then used to transform cells. (c) Results from an automated sequencer (see Figure 7-12), showing sequences from the wild-type recA gene (top) and from an altered recA gene (recA K72R, bottom) with the triplet (codon) at position 72 changed from AAA to CGC, specifying an Arg (R) instead of a Lys (K) residue. Here, the nucleotide colors reflect the dyes actually used in the method, and thus deviate from the standard nucleotide colors used in other figures.

When suitably located restriction sites are not present, oligonucleotide-directed mutagenesis, coupled to PCR, can create a specific DNA sequence change (Figure 7-20b). Two short, complementary synthetic DNA strands, each with the desired base change, are annealed to opposite strands of the cloned gene within a suitable circular DNA vector. The mismatch of a single base pair in 30 to 40 bp does not prevent annealing. The two annealed oligonucleotides serve to prime DNA synthesis in both directions around the plasmid vector, creating two complementary strands that contain the mutation. After several cycles of PCR, the mutation-containing DNA predominates in the population and can be used to transform bacteria. Most of the transformed bacteria will have plasmids carrying the mutation. If necessary, the nonmutant template plasmid DNA can be selectively eliminated by cleavage with the restriction enzyme DpnI. The template plasmid, usually isolated from wild-type E. coli, has a methylated A residue in every copy of the four-nucleotide palindrome GATC. The new DNA containing the mutation does not have methylated A residues, because the replication is done in vitro (with no methylating enzymes present). DpnI selectively cleaves DNA at the sequence GATC only if the A residue in one or both strands is methylated—that is, the enzyme breaks down only the template.

236

For an example, we can go back to the bacterial recA gene. The product of this gene, the RecA protein, has several activities (see Chapter 13). It binds to and forms a filamentous structure on DNA, aligns two DNAs of similar sequence, and hydrolyzes ATP. A particular amino acid residue in RecA (a 352-residue polypeptide), Lys72, is involved in ATP hydrolysis. By changing this Lys residue to an Arg, a variant of RecA protein is created that will bind, but not hydrolyze, ATP (Figure 7-20c). The engineering and purification of this variant RecA protein has facilitated research into the roles of ATP hydrolysis in the functioning of this protein.

237

Changes can be introduced into a gene that involve far more than one base pair. Large parts of a gene can be deleted by cutting out a segment with restriction endonucleases and ligating the remaining portions to form a smaller gene. Parts of two different genes can be ligated to create new combinations; the product of such a fused gene is called a fusion protein. Researchers have ingenious methods to bring about virtually any genetic alteration in vitro. After reintroducing the altered DNA into the cell, they can investigate the consequences of the alteration.

Terminal Tags Provide Handles for Affinity Purification

Affinity chromatography is one of the most efficient methods for purifying proteins (see Highlight 4-1). Unfortunately, many proteins do not bind a ligand that can be conveniently immobilized on a column matrix. With the use of fusion proteins, almost any protein can be purified by affinity chromatography.

The gene encoding the target protein is fused to a gene encoding a peptide or protein that binds a simple, stable ligand with high affinity and specificity. The peptide or protein used for this purpose is referred to as a tag. Tag sequences can be added to genes such that the resulting proteins have tags at their N- or C-terminus. Table 7-3 lists some of the peptides or proteins commonly used as tags.

The general procedure can be illustrated by focusing on a system that uses the glutathione-S-transferase (GST) tag (Figure 7-21). GST is a small enzyme (Mr 26,000) that binds tightly and specifically to glutathione. When the GST gene sequence is fused to a target gene, the fusion protein acquires the capacity to bind glutathione. The fusion protein is expressed in a host organism such as a bacterium, and a crude extract is prepared. A column is filled with a porous matrix consisting of the ligand (glutathione) immobilized on microscopic beads of a stable polymer such as cross-linked agarose. As the crude extract percolates through this matrix, the fusion protein becomes immobilized by binding to the glutathione. The other proteins in the extract are washed through the column and discarded. The interaction between GST and glutathione is tight but noncovalent, allowing the fusion protein to be gently eluted from the column with a solution containing either a higher concentration of salts or free glutathione to compete with the immobilized ligand for GST binding. The fusion protein is often obtained with good yield and high purity. In some commercially available systems, the tag can be entirely or largely removed from the purified fusion protein by a protease that cleaves a sequence near the junction between the target protein and its tag.

Figure 7-21: Use of tagged proteins in protein purification. (a) Glutathione-S-transferase (GST) is a small enzyme that binds glutathione (a glutamate residue to which a Cys–Gly dipeptide is attached at the carboxyl carbon of the Glu side chain, hence the abbreviation GSH). (b) The GST tag is fused to the C-terminus of the protein by genetic engineering. The tagged protein is expressed in the cell and is present in the crude extract when the cells are lysed. The extract is subjected to chromatography through a matrix with immobilized glutathione. The GST-tagged protein binds to the glutathione, retarding its migration through the column, while the other proteins are washed through rapidly. The tagged protein is subsequently eluted with a solution containing elevated salt concentration or free glutathione.

A shorter tag with widespread application consists of a simple sequence of six or more histidine residues. These histidine tags, or His tags, bind tightly and specifically to nickel ions. A chromatography matrix with immobilized Ni2+ can be used to quickly separate a His-tagged protein from other proteins in an extract. Some of the larger tags, such as maltose-binding protein, provide added stability and solubility, allowing the purification of cloned proteins that are otherwise inactive due to improper folding or insolubility.

This technology is powerful and convenient. The tags have been successfully used in thousands of published studies; in many cases, the protein would be impossible to purify and study without the tag. However, even very small tags can affect the properties of the proteins they are attached to, thereby influencing the study results. Even if the tag is removed by a protease, one or a few extra amino acid residues can remain behind on the target protein, which may or may not affect the protein’s activity. The types of experiments to be carried out, and the results obtained from them, should always be evaluated with the aid of well-designed controls to assess any effect of a tag on protein function.

238

239

SECTION 7.2 SUMMARY

  • Genes or other DNA segments can be amplified by the polymerase chain reaction. With specialized adaptations of PCR, investigators can amplify sequences in RNA and quantify the levels of particular RNA molecules in a cell.

  • Modern DNA sequencing methods enable researchers to determine the sequences of entire mammalian genomes in weeks or even days. Many thousands of genomic DNA sequences are now available in public databases.

  • Cloned genes can be expressed to provide large amounts of the gene product. Systems have been developed to express genes in bacteria, yeast, insects, mammalian cells, and even entire mammalian organisms.

  • Cloned genes can be altered. A gene sequence can be changed, sequences deleted, or sequences added. All changes affect the protein or RNA product of the gene.

  • Added sequences can produce protein products that include fused peptide segments, called tags. With the aid of these tags, the protein can be rapidly purified.