14.6 Comparative Genomics and Human Medicine

The human species, Homo sapiens, originated in Africa approximately 200,000 years ago. Around 60,000 years ago, populations left Africa and migrated across the world, eventually populating five additional continents. These migrating populations encountered different climates, adopted different diets, and combated different pathogens in different parts of the world. Much of the recent evolutionary history of our species is recorded in our genomes, as are the genetic differences that make individuals or populations more or less susceptible to disease.

533

Overall, any two unrelated humans’ genomes are 99.9 percent identical. That difference of just 0.1 percent still corresponds to roughly 3 million bases. The challenge today is to decipher which of those base differences are meaningful with respect to physiology, development, or disease.

Once the sequence of the first human genome was advanced, that accomplishment opened the door to much more rapid and less costly analysis of other individuals. The reason is that with a known genome assembly as a reference, it is much easier to align the raw sequence reads of additional individuals, and to design approaches to studying and comparing parts of the genome.

One of the first and greatest surprises that has emerged from comparing individual human genomes is that humans differ not merely at one base in a thousand, but also in the number of copies of parts of individual genes, entire genes, or sets of genes. These copy number variations (CNVs) include repeats and duplications that increase copy number and deletions that reduce copy number. Between any two unrelated individuals, there may be 1000 or more segments of DNA greater than 500 bp in length that differ in copy number. Some CNVs can be quite large and span over 1 million base pairs.

How such copy numbers may play a role in human evolution and disease is of intense interest. One case where increased copy number appears to have been adaptive concerns diet. People with high-starch diets have, on average, more copies of a salivary amylase (an enzyme that breaks down starch) gene than people with traditionally low-starch diets. In other cases, copy number variations have been associated with syndromes such as autism.

The exome and personalized genomics

WHAT GENETICISTS ARE DOING TODAY

Advances in sequencing technologies have reduced the cost of sequencing individual genomes from about $300 million in 2000, to $1 million in 2008, to about $5000 in 2013. But for many large-scale studies, that figure is still prohibitive. For some applications, it is more practical and cost effective, and can be just as informative, to sequence only part of the genome. For example, since many disease-causing mutations occur in coding sequences, strategies have been designed to sequence all of the exons, or the “exome,” of individuals, as was done in the case of Nicholas Volker.

The strategy for exome sequencing involves generating a library of genomic DNA that is enriched for exon sequences (Figure 14-18). The DNA is prepared by (1) shearing genomic DNA into short, single-stranded pieces, (2) hybridizing the single-stranded pieces to biotin-labeled probes complementary to exonic regions and purifying the biotin-labeled duplexes, (3) amplifying the exon-rich duplexes, and (4) sequencing the exon-rich duplexes. In this manner, 30-60 megabases of the human genome is targeted for sequencing, as opposed to the 3200 megabases of total sequence.

Figure 14-18: Exome sequencing
Figure 14-18: In order to sequence just the exon fraction of the genome, genomic DNA is fragmented and denatured, and exon-containing fragments are hybridized with biotin-labeled probes. Duplexes containing annealed probes are then purified and prepared for sequencing.

As of late 2013, the exomes of more than 100,000 individuals have been sequenced, at the current cost of only ~$500 per exome. One particularly important power of exome sequencing is to identify de novo mutations in individuals (mutations that are not present in either parent). Such mutations are responsible for many spontaneously appearing genetic diseases whose origins would not be revealed by traditional pedigree-based studies. As such, whole-exome sequencing is now a rapidly spreading clinical diagnostic tool.

And just as exome sequencing can be used to identify genetic differences between individuals, it can also be used to identify differences between normal and abnormal cells, such as cancer cells. Cancer is a suite of genetic diseases in which combinations of gene mutations typically contribute to the loss of growth control and metastasis. Understanding what genetic changes are common to particular cancers, or to subsets of cancers, will not only further our understanding of cancer, but also promises to impact diagnosis and treatment in powerful ways. Researchers across the world are collaborating to create an “atlas” of cancer genomes that compiles our expanding knowledge of the genetic mutations associated with many cancers. (See http://cancergenome.nih.gov/ for further information.)

534

The ability to rapidly analyze organisms’ genomes is also impacting other dimensions of medicine. We will look at one such case next.

Comparative genomics of nonpathogenic and pathogenic E. coli

Escherichia coli are found in our mouths and intestinal tracts in vast numbers, and this species is generally a benign symbiont. Because of its central role in genetics research, it was one of the first bacterial genomes sequenced. The E. coli genome is about 4.6 Mb in size and contains 4405 genes. However, calling it “the E. coli genome” is really not accurate. The first genome sequenced was derived from the common laboratory E. coli strain K-12. Many other E. coli strains exist, including several important to human health.

535

In 1982, a multistate outbreak of human disease was traced to the consumption of undercooked ground beef. The E. coli strain O157:H7 was identified as the culprit, and it has since been associated with a number of large-scale outbreaks of infection. In fact, there are an estimated 75,000 cases of E. coli infection annually in the United States. Although most people recover from the infection, a fraction develop hemolytic uremia syndrome, a potentially life-threatening kidney disease.

To understand the genetic bases of pathogenicity, the genome of an E. coli O157:H7 strain has been sequenced. The O157 and K-12 strains have a backbone of 3574 protein-coding genes in common, and the average nucleotide identity among orthologous genes is 98.4 percent, comparable to that of human and chimpanzee orthologs. About 25 percent of the E. coli orthologs encode identical proteins, similar to the 29 percent for human and chimpanzee orthologs.

Despite the similarities in many proteins, the genomes and proteomes differ enormously in content. The E. coli O157 genome encodes 5416 genes, whereas the E. coli K-12 genome encodes 4405 genes. The E. coli O157 genome contains 1387 genes that are not found in the K-12 genome, and the K-12 genome contains 528 genes not found in the O157 genome. Comparison of the genome maps reveals that the backbones common to the two strains are interspersed with islands of genes specific to either K-12 or O157 (Figure 14-19).

Figure 14-19: Two E.coli strains contain islands of genes specific to each strain
Figure 14-19: The circular genome maps of E. coli strains K-12 and O157:H7. The circle depicts the distribution of sequences specific to each strain. The colinear backbone common to both strains is shown in blue. The positions of O157:H7-specific sequences are shown in red. The positions of K-12-specific sequences are shown in green. The positions of O157:H7- and K-12-specific sequences at the same location are shown in tan. Hypervariable sequences are shown in purple.
[Data from N. T. Perna et al., “Genome Sequence of Enterohaemorrhagic Escherichia coli O157:H7,” Nature 409, 2001, 529-533. Courtesy of Guy Plunkett III and Frederick Blattner.]

Among the 1387 genes specific to E. coli O157 are many genes that are suspected to encode virulence factors, including toxins, cell-invasion proteins, adherence proteins, and secretion systems for toxins, as well as possible metabolic genes that may be required for nutrient transport, antibiotic resistance, and other activities that may confer the ability to survive in different hosts. Most of these genes were not known before sequencing and would not be known today had researchers relied solely on E. coli K-12 as a guide to all E. coli.

536

The surprising level of diversity between two members of the same species shows how dynamic genome evolution can be. Most new genes in E. coli strains are thought to have been introduced by horizontal transfer from the genomes of viruses and other bacteria (see Chapter 5). Differences can also evolve owing to gene deletion. Other pathogenic E. coli and bacterial species also exhibit many differences in gene content from their nonpathogenic cousins. The identification of genes that may contribute directly to pathogenicity opens new avenues to the understanding, prevention, and treatment of infectious disease.