By using automated DNA sequencing techniques and computer algorithms to piece together sequence data, researchers have determined vast amounts of DNA sequence, including nearly the entire genomic sequence of humans and of many key experimental organisms. This enormous volume of data, which is growing at a rapid pace, has been stored and organized by the National Center for Biotechnology Information (NCBI), US National Institutes of Health, the European Bioinformatics Institute at the European Molecular Biology Laboratory in Heidelberg, Germany, and the DNA Data Bank of Japan. These databases continuously exchange newly reported sequences and make them available to scientists throughout the world on the Internet. By now, the genomic sequences have been completely, or nearly completely, determined for hundreds of viruses and bacteria; scores of archaea; yeasts (eukaryotes); plants, including rice and maize; important model multicellular eukaryotes such as the roundworm C. elegans, the fruit fly Drosophila melanogaster, and mice; humans; and representatives of all of the 35 or so metazoan phyla. The cost of sequencing a megabase of DNA has fallen so low that the entire genomes of cancer cells have been sequenced and compared with the genomes of normal cells from the patients from which they came in order to determine all the mutations that have accumulated in that patient’s tumor cells. This approach is revealing genes that are commonly mutated in all cancers, as well as genes that are commonly mutated in tumors from different patients with the same type of cancer (e.g., breast or colon cancer). This approach may eventually lead to highly individualized cancer treatments tailored to the specific mutations in the tumor cells of a particular patient. The latest automated DNA sequencing techniques are so powerful that a project known as the “1000 Genomes Project” is currently under way, with the goal of sequencing most of the genomes of 2500 randomly chosen individuals from 25 populations around the world in order to determine the extent of human genetic variation as a basis for investigating the relationship between genotype and phenotype in humans. Moreover, privately owned companies have been founded that will sequence much of an individual’s genome for about $100 in order to search for sequence variations that may influence that individual’s probability of developing specific diseases.
324
In this section, we examine some of the ways in which researchers are mining this treasure trove of data to provide insights about gene function and evolutionary relationships, to identify new genes whose encoded proteins have never been isolated, and to determine when and where genes are expressed. This use of computers to analyze sequence data has led to the emergence of a new field of biology: bioinformatics.