Appendix B

Appendix B

Bioinformatic Resources for Genetics and Genomics

“You certainly usually find something, if you look, but it is not always quite the something you were after.” — The Hobbit, J. R. R. Tolkien

The field of bioinformatics encompasses the use of computational tools to distill complex data sets. Genetic and genomic data are so diverse that it has become a considerable challenge to identify the authoritative site(s) for a specific type of information. Furthermore, the landscape of Web-accessible software for analyzing this information is constantly changing as new and more powerful tools are developed. This appendix is intended to provide some valuable starting points for exploring the rapidly expanding universe of online resources for genetics and genomics.

1. Finding Genetic and Genomic Web Sites

Here are listed several central resources that contain large lists of relevant Web sites:

2. General Databases

Nucleic Acid and Protein Sequence Databases By international agreement, three groups collaborate to house the primary DNA and mRNA sequences of all species: the National Center for Biotechnology Information (NCBI) houses GenBank, the European Bioinformatics Institute (EBI) houses the European Molecular Biology Laboratory (EMBL) Data Library, and the National Institute of Genetics in Japan houses the DNA DataBase of Japan (DDBJ).

Primary DNA sequence records, called accessions, are submitted by individual research groups. In addition to providing access to these DNA sequence records, these sites provide many other data sets. For example, NCBI also houses RefSeq, a summary synthesis of information on the DNA sequences of fully sequenced genomes and the gene products that are encoded by these sequences.

Many other important features can be found at the NCBI, EBI, and DDBJ sites. Home pages and some other key Web sites are

This outstanding site contains the reference sequence and working draft assemblies for a large collection of genomes and a number of tools for exploring those genomes. The Genome Browser zooms and scrolls over chromosomes, showing the work of annotators worldwide. The Gene Sorter shows expression, homology, and other information on groups of genes that can be related in many ways. Blat is an alignment tool that quickly maps sequences to the genome. The Table Browser provides access to the underlying database.

The harsh reality is that, with so much biological information, the goal of making these online resources “transparent” to the user is not fully achieved. Thus, exploration of these sites will entail familiarizing yourself with the contents of each site and exploring some of the ways the site helps you to focus your queries so you get the right answer(s). For one example of the power of these sites, consider a search for a nucleotide sequence at NCBI. Databases typically store information in separate bins called “fields.” By using queries that limit the search to the appropriate field, a more directed question can be asked. Using the “Limits” option, a query phrase can be used to identify or locate a specific species, type of sequence (genomic or mRNA), gene symbol, or any of several other data fields. Query engines usually support the ability to join multiple query statements together. For example: retrieve all DNA sequence records that are from the species Caenorhabditis elegans AND that were published after January 1, 2000. Using the “History” option, the results of multiple queries can be joined together, so that only those hits common to multiple queries will be retrieved. By proper use of the available query options on a site, a great many false positives can be computationally eliminated while not discarding any of the relevant hits.

Because protein sequence predictions are a natural part of the analysis of DNA and mRNA sequences, these same sites provide access to a variety of protein databases. One important protein database is SwissProt/TrEMBL. TrEMBL sequences are automatically predicted from DNA and/or mRNA sequences. SwissProt sequences are curated, meaning that an expert scientist reviews the output of computational analysis and makes expert decisions about which results to accept or reject. In addition to the primary protein sequence records, SwissProt also offers databases of protein domains and protein signatures (amino acid sequence strings that are characteristic of proteins of a particular type). The SwissProt home page is http://www.ebi.ac.uk/swissprot.

Protein Domain Databases The functional units within proteins are thought to be local folding regions called domains. Prediction of domains within newly discovered proteins is one way to guess at their function. Numerous protein domain databases have emerged that predict domains in somewhat different ways. Some of the individual domain databases are Pfam, PROSITE, PRINTS, SMART, ProDom, TIGRFAMs, BLOCKS, and CDD. InterPro allows querying multiple protein domain databases simultaneously and presents the combined results. Web sites for some domain databases are

811

Protein Structure Databases The representation of three-dimensional protein structures has become an important aspect of global molecular analysis. Three-dimensional structure databases are available from the major DNA/protein sequence database sites and from independent protein structure databases, notably the Protein DataBase (PDB). NCBI has an application called Cn3D that helps in viewing PDB data.

3. Specialized Databases

Organism-Specific Genetic Databases In order to mass some classes of genetic and genomic information, especially phenotypic information, expert knowledge of a particular species is required. Thus, MODs (model organism databases) have emerged to fulfill this role for the major genetic systems. These include databases for Saccharomyces cerevisiae (SGD), Caenorhabditis elegans (WormBase), Drosophila melanogaster (FlyBase), the zebra fish Danio rerio (ZFIN), the mouse Mus musculus (MGI), the rat Rattus norvegicus (RGD), Zea mays (MaizeGDB), and Arabidopsis thaliana (TAIR). Home pages for these MODs can be found at

Human Genetics and Genomics Databases Because of the importance of human genetics in clinical as well as basic research, a diverse set of human genetic databases has emerged. This set includes a human genetic disease database called Online Mendelian Inheritance in Man (OMIM), a database of brief descriptions of human genes called GeneCards, a compilation of all known mutations in human genes called Human Gene Mutation Database (HGMD), a database of the current sequence map of the human genome called the Golden Path, and some links to human genetic disease databases:

Genome Project Databases The individual genome projects also have Web sites, where they display their results, often including information that doesn’t appear on any other Web site in the world. The largest of the publicly funded genome centers include

4. Relationships of Genes Within and Between Databases

Gene products may be related by virtue of sharing a common evolutionary origin, sharing a common function, or participating in the same pathway.

BLAST: Identification of Sequence Similarities Evidence for a common evolutionary origin comes from the identification of sequence similarities between two or more sequences. One of the most important tools for identifying such similarities is BLAST (Basic Local Alignment Search Tool), which was developed by NCBI. BLAST is really a suite of related programs and databases in which local matches between long stretches of sequence can be identified and ranked. A query for similar DNA or protein sequences through BLAST is one of the first things that a researcher does with a newly sequenced gene. Different sequence databases can be accessed and organized by type of sequence (reference genome, recent updates, nonredundant, ESTs, etc.), and a particular species or taxonomic group can be specified. One BLAST routine matches a query nucleotide sequence translated in all six frames to a protein sequence database. Another matches a protein query sequence to the six-frame translation of a nucleotide sequence database. Other BLAST routines are customized to identify short sequence pattern matches or pair-wise alignments, to screen genome-sized DNA segments, and so forth, and can be accessed through the same top-level page:

Function Ontology Databases Another approach to developing relationships among gene products is by assigning these products to functional roles based on experimental evidence or prediction. Having a common way of describing these roles, regardless of the experimental system, is then of great importance. A group of scientists from different databases are working together to develop a common set of hierarchically arranged terms—an ontology—for function (biochemical event), process (the cellular event to which a protein contributes), and subcellular location (where a product is located in a cell) as a way of describing the activities of a gene product. This particular ontology is called the Gene Ontology (GO), and many different databases of gene products now incorporate GO terms. A full description can be found at

Pathway Databases Still another way to relate products to one another is by assigning them to steps in biochemical or cellular pathways. Pathway diagrams can be used as organized ways of presenting relationships of these products to one another. Some of the more advanced attempts at producing such pathway databases include Kyoto Encyclopedia of Genes and Genomes (KEGG) and Signal Transduction Database (TRANSPATH):

812