Genome annotation includes searching for sequence motifs.

276

image
FIG. 13.4 Some common sequence motifs useful in genome annotation. (a) An open reading frame; (b) a noncoding RNA molecule; (c) transcription factor binding sites.

Because genome annotation is essentially pattern recognition, it begins with the identification of patterns called sequence motifs, telltale sequences of nucleotides that indicate what types of function (or absence of function) may be encoded in a particular region of the genome (Fig. 13.4). Sequence motifs can be found in the DNA itself or in the RNA sequence inferred from the DNA sequence. Once identified, sequence motifs are typically confirmed by experimental methods. One sequence motif we have already encountered in Chapter 3 is a promoter, a sequence where RNA polymerase and associated proteins bind to the DNA to initiate transcription.

Another example of a sequence motif is an open reading frame (ORF) (Fig. 13.4a). The motif for an open reading frame is a long string of nucleotides that, if transcribed and processed into messenger RNA, would result in a set of codons for amino acids that does not contain a stop codon. The presence of an ORF motif by itself is enough to annotate the DNA segment as potentially protein coding. The qualifier “potentially” is necessary because ORFs identified in a DNA sequence do not necessarily code for protein. For this reason they are often called putative ORFs. A region containing a putative ORF may exist merely by chance (even a random sequence of nucleotides will contain ORFs averaging 21 codons in length); or a putative ORF may not be transcribed; or if a putative ORF is transcribed, it might be in a noncoding RNA or an intron of a protein-coding RNA. In the next section, we discuss how the analysis of messenger RNA sequences can determine whether a putative ORF is an actual ORF in DNA coding for protein.

Fig. 13.4b shows another type of sequence motif, this one also present in a hypothetical RNA transcript inferred from the DNA sequence. The nucleotide sequence at one end of the RNA is complementary to that at the other end, so the single-stranded molecule is able to fold back on itself and undergo base pairing to form a hairpin-shaped structure. Such hairpin structures are characteristic of certain types of RNA that function in gene regulation (Chapter 19). The DNA from which this RNA is transcribed has complementary sequences on either end as well.

Some sequence motifs are detected directly in the double-stranded DNA. Fig. 13.4c shows two copies of a short sequence that is a known binding site for DNA-binding proteins called transcription factors (Chapter 3), which initiate transcription. Transcription factor binding sites are often present in multiple copies and in either strand of the DNA. Sometimes they are located near the region of a gene where transcription is initiated because the transcription factor helps determine when the gene will be transcribed. However, they can also be located far upstream of the gene, downstream of the gene, or in introns, and so their identification is difficult.