Sequencing the Genome

INTRODUCTION

More than a thousand researchers throughout the world contributed to the first project to sequence the human genome. Their task, which they successfully completed in 2003, was to decode the DNA from each of 23 pairs of human chromosomes. Two groups undertook this sequencing task. One large, multinational, publicly funded group adopted an approach called hierarchical sequencing. This group's work is better known as the Human Genome Project. The other group, at a privately funded company, took an approach called shotgun sequencing.

This animated tutorial describes the methods used by the two groups, but keep in mind that technologies have changed a lot since then. The Human Genome Project was relatively slow, expensive, and labor intensive. It took 13 years and $2.7 billion to sequence one genome! In contrast, at the time of this writing, current methods allow researchers to sequence a human genome in just a few days for several thousand dollars.

ANIMATION SCRIPT

Hierarchical Sequencing

In the hierarchical sequencing method, researchers begin collecting cells. In humans, each cell contains 23 pairs of chromosomes. Here we specifically track the DNA from just one of the 23 pairs.

Chromosomes have a series of unique DNA sequences, called sequence-tagged sites (STSs), that are already known to researchers. These sequence-tagged sites serve as chromosomal landmarks during later stages of the hierarchical method.

The chromosomal DNA is purified, and the other cellular components are removed. The compacted chromosomes uncoil to form long strings of DNA.

Enzymes are used to cut the DNA into relatively large fragments, each about 250,000 base pairs (bp) in length. We will focus on the subset of these fragments (indicated in blue) that are derived from the chromosome on the right. The fragments contain the same sequence-tagged sites as the original chromosome.

The fragments are cloned into vectors, called bacterial artificial chromosomes, or BACs. The BACs allow the fragments to be replicated in bacteria, which provides enough sample material for the rest of the analysis.

Researchers can start from the ends of the vector to identify nearby sequences in the human DNA fragment. If these sequences are unique in the genome, they serve as additional sequence-tagged sites. These sequence-tagged sites can be mapped on the chromosome and used to determine whether the DNA fragments overlap.

The hierarchical sequencing method is directed at sequencing these large, overlapping fragments. Yet, before a large fragment can be sequenced, many copies of it are first randomly cut into smaller fragments.

After a multistep process of isolating and replicating the small fragments, the DNA of each fragment is sequenced. Although each fragment is sequenced from end to end, we color only small areas to indicate identical sequences among fragments.

Computers store the sequence data, and, using computer algorithms, compare the data for overlapping sequences. These common sequences allow the computer to merge information from more than one fragment into one long DNA sequence.

Each overlapping large fragment is analyzed until the entire genome sequence is determined.

Shotgun Sequencing

In the shotgun sequencing method, researchers begin by collecting cells. In humans, each cell contains 23 pairs of chromosomes. Here we will track the DNA from just one of the 23 pairs of human chromosomes.

The DNA is purified, and the other cellular components are removed. The compacted chromosomes uncoil to form long strings of DNA.

The DNA is sheared into millions of 500 to 800 base pair (bp) fragments. We will focus on the subset of these fragments that are derived from the chromosome on the right. These fragments are indicated in blue.

After a multi-step process of isolating and replicating the fragments, the DNA of each fragment is sequenced.

Computers store the sequence data and, using sophisticated algorithms, compare the data for identical sequences. These common sequences, indicated by small bands of color, allow the computer to merge information from more than one fragment.

As all of the fragments are sequenced and their common regions identified, researchers can piece together the sequences of the entire chromosomes.

CONCLUSION

The following are just some of the interesting facts that we have learned about the human genome:

• Of the 3.2 billion bp in the haploid human genome, an estimated 1.2 percent (about 21,000 genes) make up protein-coding regions.This was a surprise. Before sequencing began, humans were estimated to have 80,000–150,000 genes.

•The average gene has 27,000 bp. Gene sizes vary greatly, from about 1,000 bp to 2.4 million bp. Variation in gene size was expected given that human proteins (and RNAs) vary in size, from 100 to about 5,000 amino acids per polypeptide chain.

• Virtually all human genes have many introns.

• About half of the genome is made up of transposons and other highly repetitive sequences.

• When the genomes of two unrelated individuals are compared, most of the sequence—about 99.5 percent—is identical. Despite this apparent homogeneity, there are many differences, and as more genomes are sequenced, more variants are found. Current estimates suggest that each haploid genome contains about 3.3 million single nucleotide polymorphisms (SNPs), so these account for about one-fifth of the variation between two individuals. The remaining four-fifths are due to copy number variation: differences in sequence copy number that have arisen through chromosomal deletions, duplications, or translocations or through duplications caused by transposons.

• Genes are not evenly distributed over the genome. Chromosome 19 is packed densely with genes, whereas chromosome 8 has long stretches without coding regions. The Y chromosome has the fewest genes (about 230), and chromosome 1 has the most (about 3,000).