An annotated genome summarizes knowledge, guides research, and reveals evolutionary relationships among organisms.

Genome annotation, which aims to identify all the functional and repeat sequences present in the genome, is an imperfect science. Even in a well-annotated genome, some protein-coding sequences or other important features may be overlooked, and occasionally the annotation of a sequence motif is incorrect.

Because researchers often have to rely on sequence motifs alone and not experimental data, their descriptions may be vague. For example, a common annotation in large genomes is “hypothetical protein.” In some cases, such as the genome of the malaria parasite, this type of annotation accounts for about 50% of the possible protein-coding genes. There is no hint of what a hypothetical protein may do or even whether it is actually produced, since it is determined solely by the presence of a putative ORF in the genomic sequence and not by the presence of actual mRNA or protein. Other annotations might be “DNA-binding protein,” “possible hairpin RNA,” or “tyrosine kinase”—with no additional detail about these motifs’ functions or relationships to other sequences. In short, although some genome annotations summarize experimentally verified facts, many others are hypotheses and guides to future research.

Genome sequences contain information about ancestry and evolution, and so comparisons among genomes can reveal how different species are related. For example, the sequence of the human genome is significantly more similar to that of the chimpanzee than to that of the gorilla, indicating a more recent common ancestry of humans and chimpanzees (Chapter 1).

Analysis of the similarities and differences in protein-coding genes and other types of sequence in the genomes of different species is an area of study called comparative genomics. Such studies help us understand how genes and genomes evolve. They can also guide genome annotation because the sequences of important functional elements are often very similar among genomes of different organisms. Sequences that are similar in different organisms are said to be conserved. A sequence motif that is conserved is likely to be important, even if its function is unknown, since it has changed very little over evolutionary time.