14.2 Obtaining the Sequence of a Genome

When people encounter new territory, one of their first activities is to create a map. This practice has been true for explorers, geographers, oceanographers, and astronomers, and it is equally true for geneticists. Geneticists use many kinds of maps to explore the terrain of a genome. Examples are linkage maps based on inheritance patterns of gene alleles and cytogenetic maps based on the location of microscopically visible features such as rearrangement break points.

The highest-resolution map is the complete DNA sequence of the genome—that is, the complete sequence of nucleotides A, T, C, and G of each double helix in the genome. Because obtaining the complete sequence of a genome is such a massive undertaking of a sort not seen before in biology, new strategies must be used, all based on automation.

Turning sequence reads into an assembled sequence

You’ve probably seen a magic act in which the magician cuts up a newspaper page into a great many pieces, mixes it in his hat, says a few magic words, and voila! an intact newspaper page reappears. Basically, that’s how genomic sequences are obtained. The approach is to (1) break the DNA molecules of a genome up into thousands to millions of more or less random, overlapping small segments; (2) read the sequence of each small segment; (3) computationally find the overlap among the small segments where their sequences are identical; and (4) continue overlapping ever larger pieces until all the small segments are linked (Figure 14-2). At that point, the sequence of a genome is assembled.

Figure 14-2: The logic of obtaining a genome sequence
Figure 14-2: To obtain a genome sequence, multiple copies of the genome are cut into small pieces that are sequenced. The resulting sequence reads are overlapped by matching identical sequences in different fragments until a consensus sequence of each DNA double helix in the genome is produced.

Why does this process require automation? To understand why, let’s consider the human genome, which contains about 3 × 109 bp of DNA, or 3 billion base pairs (3 gigabase pairs = 3 Gbp). Suppose we could purify the DNA intact from each of the 24 human chromosomes (X, Y, and the 22 autosomes), separately put each of these 24 DNA samples into a sequencing machine, and read their sequences directly from one telomere to the other. Obtaining a complete sequence would be utterly straightforward, like reading a book with 24 chapters—albeit a very, very long book with 3 billion characters (about the length of 3000 novels). Unfortunately, such a sequencing machine does not exist.

512

Rather, automated sequencing is the current state of the art in DNA sequencing technology. Initially based on the pioneering Sanger dideoxy chain-termination method (discussed in Chapter 10), automated sequencing now employs a variety of chemistries and optical-detection methods. The methods now available vary in the length of DNA sequence obtained, the bases determined per second, and raw accuracy. For large-scale sequencing projects that seek to analyze large individual genomes or the genomes of many different individuals or species, choosing a method requires balancing speed, cost, and accuracy.

Individual sequencing reactions (called sequencing reads) provide letter strings that, depending on the sequencing technique employed, range on average from about 100 to 5000 bases long. Such lengths are tiny compared with the DNA of a single chromosome. For example, an individual read of 300 bases is only 0.0001 percent of the longest human chromosome (about 3 × 108 bp of DNA) and only about 0.00001 percent of the entire human genome. Thus, one major challenge facing a genome project is sequence assembly—that is, building up all of the individual reads into a consensus sequence, a sequence for which there is consensus (or agreement) that it is an authentic representation of the sequence for each of the DNA molecules in that genome.

Let’s look at these numbers in a somewhat different way to understand the scale of the problem. As with any experimental observation, automated sequencing machines do not always give perfectly accurate sequence reads. Indeed, newer, higher-throughput sequencing technologies generate a greater frequency of errors than older methods; the error rate may range from less than 1 percent to as much as 10 percent, depending upon the technology. Thus, to ensure accuracy, genome projects conventionally obtain many independent sequence reads of each base pair in a genome. Many-fold coverage ensures that chance errors in the reads do not give a false reconstruction of the consensus sequence.

513

Given an average sequence read of about 100 bases of DNA and a human genome of 3 billion base pairs, 300 million independent reads are required to give 10-fold average coverage of each base pair. However, not all sequences are represented equally, and so the number of reads required is larger. The amount of information to be tracked is enormous. Thus, genome sequencing has required many advances in automation and information technology.

What are the goals of sequencing a genome? First, we strive to produce a consensus sequence that is a true and accurate representation of the genome, starting with one individual organism or standard strain from which the DNA was obtained. This sequence will then serve as a reference sequence for the species. We now know that there are many differences in DNA sequence between different individuals within a species and even between the maternally and paternally contributed genomes within a single diploid individual. Thus, no one genome sequence truly represents the genome of the entire species. Nonetheless, the genome sequence serves as a standard or reference with which other sequences can be compared, and it can be analyzed to determine the information encoded within the DNA, such as the inventory of encoded RNAs and polypeptides.

Like written manuscripts, genome sequences can range from draft quality (the general outline is there, but there are typographical errors, grammatical errors, gaps, sections that need rearranging, and so forth), to finished quality (a very low rate of typographical errors, some missing sections but everything that is currently possible has been done to fill in these sections), to truly complete (no typographical errors, every base pair absolutely correct from telomere to telomere). In the following sections, we will examine the strategy and some methods for producing draft and finished genome-sequence assemblies. We will also encounter some of the features of genomes that challenge genome-sequencing projects.

Whole-genome sequencing

The current general strategy for obtaining and assembling the sequence of a genome is called whole-genome shotgun (WGS) sequencing. This approach is based on determining the sequence of many segments of genomic DNA that have been generated by breaking the long chromosomes of DNA into many short segments. Two approaches to whole-genome shotgun sequencing are responsible for most genome sequences obtained to date. The fundamental differences between them are in how the short segments of DNA are obtained and prepared for sequencing and the sequencing chemistry employed. The first method, used to sequence the first human genome, relied on the cloning of DNA in microbial cells and employed the Sanger dideoxy sequencing technique. We will refer to this approach as “traditional WGS.” Methods in the second group are generally cell-free methods that employ new techniques for sequencing and are designed for very high throughput (referring to the number of reads per machine per unit time). We will refer to this group of methods as “next-generation WGS.”

Traditional WGS

The traditional WGS approach begins with the construction of genomic libraries, which are collections of these short segments of DNA, representing the entire genome. The short DNA segments in such a library have been inserted into one of a number of types of accessory chromosomes (nonessential elements such as plasmids, modified bacterial viruses, or artificial chromosomes) and propagated in microbes, usually bacteria or yeast. These accessory chromosomes carrying DNA inserts are called vectors.

514

Figure 14-3: End reads from multiple inserts may be overlapped to produce a contig
Figure 14-3: Sequencing reads are taken only of the ends of cloned inserts. The use of two different sequence-priming sites, one at each end of the vector, makes possible the sequencing of as many as 600 base pairs at each end of the genomic insert. If both ends of the same clone are sequenced, the two resulting sequence reads are called paired-end reads.

To generate a genomic library, a researcher first uses restriction enzymes, which cleave DNA at specific sequences, to cut up purified genomic DNA. Some enzymes cut the DNA at many places, whereas others cut it at fewer places; so the researcher can control whether the DNA is cut, on average, into longer or shorter pieces. The resulting fragments have short single strands of DNA at both ends. Each fragment is then joined to the DNA molecule of the accessory chromosome, which also has been cut with a restriction enzyme and which has ends that are complementary to those of the genomic fragments. In order for the entire genome to be represented, multiple copies of the genomic DNA are cut into fragments. By this means, thousands to millions of different fragment-vector recombinant molecules are generated.

The resulting pool of recombinant DNA molecules is then propagated, typically by introducing the molecules into bacterial cells. Each cell takes up one recombinant molecule. Then each recombinant molecule is replicated in the normal growth and division of its host so that many identical copies of the inserted fragment are produced for use in analyzing the fragment’s DNA sequence. Because each recombinant molecule is amplified from an individual cell, each cell is a distinct clone. (More details about DNA cloning are provided in Chapter 10.) The resulting library of clones is called a shotgun library because sequence reads are obtained from clones randomly selected from the whole-genome library without any information on where these clones map in the genome.

Next, the genome fragments in clones from the shotgun library are partially sequenced. The sequencing reaction must start from a primer of known sequence. Because the sequence of a cloned insert is not known (and is the goal of the exercise), primers are based on the sequence of adjacent vector DNA. These primers are used to guide the sequencing reaction into the insert. Hence, short regions at one or both ends of the genomic inserts can be sequenced (Figure 14-3). After sequencing, the output is a large collection of random short sequences, some of them overlapping. These sequence reads are assembled into a consensus sequence covering the whole genome by matching homologous sequences shared by reads from overlapping clones. The sequences of overlapping reads are assembled into units called sequence contigs (sequences that are contiguous, or touching).

Next-generation whole-genome shotgun sequencing

The goal of next-generation WGS is the same as that of traditional WGS—to obtain a large number of overlapping sequence reads that can be assembled into contigs. However, the methodologies used differ in several substantial ways from traditional WGS. Several different systems have been developed that, while they differ in their sequencing chemistry and machine design, each employ three strategies that have dramatically increased throughput:

  1. DNA molecules are prepared for sequencing in cell-free reactions, without cloning in microbial hosts.

    515

  2. Millions of individual DNA fragments are isolated and sequenced in parallel during each machine run.
  3. Advanced fluid-handling technologies, cameras, and software make it possible to detect the products of sequencing reactions in extremely small reaction volumes.

Since the field of genomic technology is evolving rapidly, we will not describe every next-generation system. However, we will examine one widely used approach that employs all of these features. One of the first next-generation systems was developed by the 454 Life Sciences Corporation. This approach illustrates the gains that have been made in throughput and what such gains enable geneticists to do. The approach can be considered to have three stages:

Stage 1. A DNA template library of single-stranded DNA molecules is constructed.

Stage 2. The DNA molecules in the template library are amplified into many copies, not by growing colonies as for traditional genomic libraries, but by using the polymerase chain reaction (PCR; see Chapter 10). First, single molecules are immobilized on individual beads. The molecules are then amplified by PCR such that single-stranded DNA molecules remain attached to the beads. Thus, each bead contains many identical DNA fragments. Each bead is then deposited individually into wells of a very small volume in a device that hosts the sequencing reactions (Figure 14-4).

Figure 14-4: Pyrosequencing reactions take place on beads in tiny wells
Figure 14-4: (a) In the 454 sequencing system, single strands of DNA are replicated on tiny beads in preparation for sequencing. (b) The sequencing reactions of pyrosequencing take place in tiny wells arranged on plates. The many wells in a plate, and the very small reaction volumes, allow massively parallel sequencing of DNA at modest cost.
[(b) © 2010 The Regents of the University of California, Lawrence Berkeley National Laboratory.]

516

Stage 3. The sequencing of each bead is performed using a novel “sequencing-by-synthesis” chemistry termed pyrosequencing (Figure 14-5). DNA polymerase and a primer are added to the wells to prime the synthesis of a complementary DNA strand. Each of the four deoxyribonucleotides dATP, dGTP, dTTP, and dCTP are made to flow through all of the wells, one at a time, in a specific order. When a nucleotide is added that is complementary to the next base in the template strand in a given well, it is incorporated and the reaction releases a pyrophosphate molecule. Two enzymes, sulfurylase and luciferase, which are also present, then act to convert the pyrophosphate signal to a visible-light signal (see Figure 14-5). The light is detected by a special camera. Hence, growing DNA strands that have A as the first base after the primer will yield a signal only when dATP is made to flow through the well and not when the other deoxynucleotides are made to flow through. The reaction is repeated for at least 100 cycles, and the signals from each well over all of the cycles are integrated to generate the sequence reads from each well.

Figure 14-5: Pyrosequencing is based on detecting synthesis reactions
Figure 14-5: In the pyrosequencing process, nucleotides are sequentially added to form the complementary strand of the single-stranded template, to which a sequencing primer has been annealed. The reactions are carried out in the presence of the enzymes DNA polymerase, sulfurylase, and luciferase. One molecule of pyrophosphate (PPi) is released for every nucleotide incorporated into the growing strand by the DNA polymerase and is converted to ATP by sulfurylase. Visible light is produced from luciferin in a luciferase-catalyzed reaction that utilizes the ATP produced by sulfurylase.

Other widely used platforms such as the Illumina sequencing systems and the Pacific Bioscience systems also detect the synthesis of DNA, but by different means. The Illumina system detects the incorporation of individual, fluorescently labeled dNTPs, while the Pacific Bioscience process detects bases being incorporated into a single, immobilized DNA molecule. The method chosen by investigators depends a great deal on the application. The Illumina system produces a larger number of shorter reads than the 454 system, while the Pacific Bioscience system provides the advantage of much longer individual reads than any other system, but with a higher error rate. The high throughput of each approach is the product of the massively parallel sequencing: several hundred thousand to more than 1 million reactions can be run simultaneously. Earlier sequencing machines were able to achieve just 384 sequencing reactions per run.

517

Whole-genome-sequence assembly

Whichever method of obtaining raw sequence is used, the challenge remains to assemble the contigs into the entire genome sequence. The difficulty of that process depends strongly on the size and complexity of the genome.

For instance, the genomes of bacterial species are relatively easy to assemble. Bacterial DNA is essentially single-copy DNA, with no repeating sequences. Therefore, any given DNA sequence read from a bacterial genome will come from one unique place in that genome. Owing to these properties, contigs within bacterial genomes can often be assembled into larger contigs representing most or all of the genome sequence in a relatively straightforward manner. In addition, a typical bacterial genome is only a few megabase pairs of DNA in size.

For eukaryotes, genome assembly often presents some difficulties. A big stumbling block is the existence of numerous classes of repeated sequences, some arranged in tandem and others dispersed. Why are they a problem for genome sequencing? In short, because a sequencing read of repetitive DNA fits into many places in the draft of the genome. Not infrequently, a tandem repetitive sequence is in total longer than the length of a maximum sequence read. In that case, there is no way to bridge the gap between adjacent unique sequences. Dispersed repetitive elements can cause reads from different chromosomes or different parts of the same chromosome to be mistakenly aligned together.

KEY CONCEPT

The landscape of eukaryotic chromosomes includes a variety of repetitive DNA segments. These segments are difficult to align as sequence reads.

Whole-genome shotgun sequencing is particularly good at producing draft-quality sequences of complex genomes with many repetitive sequences. As an example, we will consider the genome of the fruit fly D. melanogaster, which was initially sequenced by the traditional WGS method. The project began with the sequencing of libraries of genomic clones of different sizes (2 kb, 10 kb, 150 kb). Sequence reads were obtained from both ends of genomic-clone inserts and aligned by a logic identical to that used for bacterial WGS sequencing. Through this logic, sequence overlaps were identified and clones were placed in order, producing sequence contigs—consensus sequences for these single-copy stretches of the genome. However, unlike the situation in bacteria, the contigs eventually ran into a repetitive DNA segment that prevented unambiguous assembly of the contigs into a whole genome. The sequence contigs had an average size of about 150 kb. The challenge then was how to glue the thousands of such sequence contigs together in their correct order and orientation.

The solution to this problem was to make use of the pairs of sequence reads from opposite ends of the genomic inserts in the same clone—these reads are called paired-end reads. The idea was to find paired-end reads that spanned the gaps between two sequence contigs (Figure 14-6). In other words, if one end of an insert was part of one contig and the other end was part of a second contig, then this insert must span the gap between two contigs, and the two contigs were clearly near each other.

Figure 14-6: Paired-end reads may be used to join two sequence contigs
Figure 14-6: Paired-end reads can be used to join two sequence contigs into a single ordered and oriented scaffold.

Indeed, because the size of each clone was known (that is, it came from a library containing genomic inserts of uniform size, either the 2-kb, 100-kb, or 150-kb library), the distance between the end reads was known. Further, aligning the sequences of the two contigs by using paired-end reads automatically determines the relative orientation of the two contigs. In this manner, single-copy contigs could be joined together, albeit with gaps where the repetitive elements reside. These gapped collections of joined-together sequence contigs are called scaffolds (sometimes also referred to as supercontigs). Because most Drosophila repeats are large (3–8 kb) and widely spaced (one repeat approximately every 150 kb), this technique was extremely effective at producing a correctly assembled draft sequence of the single-copy DNA. A summary of the logic of this approach is shown in Figure 14-7.

Figure 14-7: Strategy for whole-genome shotgun sequencing assembly
Figure 14-7: In whole-genome shotgun sequencing, first, the unique sequence overlaps between sequence reads are used to build contigs. Paired-end reads are then used to span gaps and to order and orient the contigs into larger units, called scaffolds.

518

Next-generation WGS does not circumvent the problem of repetitive sequences and gaps. Since this approach is intended to circumvent the construction of libraries, which would otherwise facilitate the bridging of gaps between contigs via paired-end reads, next-generation WGS researchers had to devise a way to bridge these gaps without building genomic libraries in vectors. One solution was to build a library of circularized genomic DNA fragments of desired sizes. The circularization allows for short segments of previously distant sequences located at the ends of each fragment to be juxtaposed on either side of a linker sequence. Shearing of these circular molecules and amplification and sequencing of linker-containing fragments produces paired-end reads equivalent to those obtained from sequencing of traditional genomic-library inserts (Figure 14-8).

Figure 14-8: Paired-end reads can be produced by circularization
Figure 14-8: Paired-end reads for high-throughput sequencing can be produced without genomic-library construction. The figure is based on the paired-end protocol of the Roche GS FLX Titanium Series, Roche Applied Science, Mannheim, Germany.

519

In both traditional and next-generation whole-genome shotgun sequencing, some gaps usually remain. Specific procedures targeted to individual gaps must be used to fill the missing data in the sequence assemblies. If the gaps are short, missing fragments can be generated by using the known sequences at the ends of the assemblies as primers to amplify and analyze the genomic sequence in between. If the gaps are longer, attempts can be made to isolate the missing sequences as parts of larger inserts that have been cloned into a vector, and then to sequence the inserts.

Whether a genome is sequenced to “draft” or “finished” standards is a cost–benefit judgment. It is relatively straightforward to create a draft but very hard to complete a finished sequence.