Genes Can Be Identified Within Genomic DNA Sequences
The complete genomic sequence of an organism contains within it the information needed to deduce the sequence of every protein made by the cells of that organism. For organisms such as bacteria and yeast, whose genomes have few introns and short intergenic regions, most protein-coding sequences can be found simply by scanning the genomic sequence for open reading frames (ORFs) of significant length. An ORF is usually defined as a stretch of DNA containing at least 100 codons that begins with a start codon and ends with a stop codon. Because the probability that a random DNA sequence will contain no stop codons for 100 codons in a row is very small, most ORFs encode proteins.
ORF analysis correctly identifies more than 90 percent of the genes in yeast and bacteria. Some of the very shortest genes, however, are missed by this method, and occasionally long open reading frames that are not actually genes arise by chance. Both types of mis-assignments can be corrected by more sophisticated analysis of the sequence and by genetic tests for gene function. Of the Saccharomyces genes identified in this manner, about half were already known by some functional criterion such as mutant phenotype. The functions of some of the proteins encoded by the remaining putative (suspected) genes identified by ORF analysis have been assigned based on their sequence similarity to known proteins in other organisms.
Identification of genes in organisms with a more complex genome structure requires more sophisticated algorithms than searching for open reading frames. Because most genes in higher eukaryotes are composed of multiple, relatively short exons separated by often quite long noncoding introns, scanning for ORFs is a poor method for finding genes in these organisms. The best gene-finding algorithms combine all the available data that might suggest the presence of a gene at a particular genomic site. Relevant data include alignment of the query sequence to a full-length cDNA sequence; alignment to a partial cDNA sequence, generally 200–400 bp in length, known as an expressed sequence tag (EST); fitting to models for exon, intron, and splice-site sequences; and sequence similarity to genes from other organisms. Using these computer-based bioinformatic methods, computational biologists have identified approximately 21,000 protein-coding genes in the human genome.
A particularly powerful method for identifying human genes is to compare the human genomic sequence with that of the mouse. Humans and mice are sufficiently related to have most genes in common, although largely nonfunctional DNA sequences, such as intergenic regions and introns, tend to be very different because these sequences are not under strong selective pressure. Thus corresponding segments of the human and mouse genome that exhibit high sequence similarity are likely to be functionally important: exons, transcription-control regions, or sequences with other functions that are not yet understood.