Introduction to Protein Databases
Protein databases are specialized biological databases that store comprehensive information about proteins. They include data such as amino acid sequences, domain structures, biological functions, three-dimensional structures, and protein interactions.
Protein Sequence Databases
Protein sequence databases are crucial for storing and analyzing amino acid sequences, which determine a protein's structure and function. These databases help in identifying and studying proteins, their functions, and their roles in various biological processes. Here are some prominent protein sequence databases:
PIR (Protein Information Resource)
- PIR is a significant protein sequence database offering functionally annotated protein sequences.
- It maintains three key databases:
- Protein Sequence Database (PSD): Contains annotated protein sequences.
- Non-redundant Reference (NREF): Provides unique reference sequences.
- Integrated Protein Classification (iProClass): Includes information on protein families, functions, and structures.
Examples: Pseudomonas aeruginosa (common name: Pseudomonas), Escherichia coli (common name: E. coli), Saccharomyces cerevisiae (common name: Yeast).
SWISS-PROT
- SWISS-PROT is known for its detailed annotations on protein functions, domain structures, post-translational modifications, and variants.
- Managed jointly by the SIB (Swiss Institute of Bioinformatics) and the EBI (European Bioinformatics Institute).
- Notable for its:
- Annotations: Extensive functional and structural information.
- Minimal Redundancy: Each sequence is represented only once.
- Integration: Links to related databases for comprehensive information.
Examples: Homo sapiens (common name: Human), Mus musculus (common name: Mouse), Rattus norvegicus (common name: Rat).
TrEMBL
- TrEMBL is a computer-annotated extension of Swiss-Prot.
- It includes translations of EMBL (European Molecular Biology Laboratory) nucleotide sequences that are not yet integrated into Swiss-Prot.
- Follows the Swiss-Prot format for consistency.
Examples: Arabidopsis thaliana (common name: Thale cress), Drosophila melanogaster (common name: Fruit fly), Danio rerio (common name: Zebrafish).
Protein Structure Databases
Protein Data Bank (PDB)
The Protein Data Bank (PDB) is a global repository of 3D structural data for large molecules like proteins, nucleic acids, and other biological macromolecules. It provides 3D structural models obtained through three main experimental methods: X-ray crystallography, nuclear magnetic resonance spectroscopy (NMR), and electron microscopy (3DEM).
Examples:
- Hemoglobin (Homo sapiens)
- ATP Synthase (Escherichia coli)
- Ribosome (Saccharomyces cerevisiae)
Structural Classification of Proteins (SCOP)
SCOP (Structural Classification of Proteins) organizes proteins based on their secondary structure properties. It classifies proteins into families, superfamilies, and folds based on evolutionary relationships and structural similarities. Proteins with similar structures are grouped into families, and those with similar folds into superfamilies.
Examples:
- Enolase (Saccharomyces cerevisiae)
- Cytochrome c (Bos taurus)
- Globin (Homo sapiens)
CATH Database
CATH categorizes protein domains into a hierarchical system based on folding patterns. The hierarchy includes Class, Architecture, Topology, and Homologous Superfamily. Domains with similar folding patterns are grouped together, providing insights into protein structural classification.
Examples:
- Immunoglobulin domain (Mus musculus)
- Protease (Staphylococcus aureus)
- Sh3 domain (Homo sapiens)
Protein-Protein Interaction Databases
BIND (Biomolecular Interaction Network Database)
BIND is a comprehensive database that details interactions, molecular complexes, and pathways between biomolecules. It includes proteins, nucleic acids, and small molecules. BIND supports data mining and network mapping across species, useful for kinetic simulations.
Examples: Escherichia coli (E. coli), Saccharomyces cerevisiae (Baker’s yeast), Homo sapiens (Humans)
DIP (Database of Interacting Proteins)
DIP contains protein-protein interaction data compiled through manual curation and computational methods. It aids in understanding protein functions, their relationships, and network properties. DIP is valuable for evaluating interaction predictions and exploring evolutionary aspects of interactions.
Examples: Mus musculus (House mouse), Drosophila melanogaster (Fruit fly), Arabidopsis thaliana (Thale cress)
MINT (Molecular Interaction Database)
MINT stores data on functional interactions among biomolecules such as proteins, RNA, and DNA. It focuses on experimentally verified protein-protein interactions, considering both direct and indirect relationships. The database also covers enzymatic modifications of partner molecules.
Examples: Caenorhabditis elegans (Nematode worm), Danio rerio (Zebrafish), Gallus gallus (Chicken)
Protein Pattern and Profile Databases
InterPro
InterPro integrates data from major protein signature databases, including PROSITE, Pfam, PRINTS, ProDom, and SMART. It provides comprehensive information on protein families, domains, and functional sites. InterPro helps identify structural or functional features in proteins.
Examples: Bacillus subtilis (B. subtilis), Corynebacterium diphtheriae (C. diphtheriae), Mycobacterium tuberculosis (M. tuberculosis)
PROSITE
PROSITE offers a collection of protein signatures that identify patterns or profiles in proteins. These signatures help determine biological functions and are linked to annotation documents detailing protein families, domains, and 3D structures.
Examples: Helicobacter pylori (H. pylori), Pseudomonas aeruginosa (P. aeruginosa), Streptococcus pneumoniae (S. pneumoniae)
Metabolic Pathway Databases
Metabolic pathway databases provide critical information on enzymes, biochemical reactions, and metabolic pathways. These databases are essential for understanding how genes and molecules interact within cellular processes.
ENZYME Database
The ENZYME database focuses on enzyme nomenclature. It is widely used for naming enzymes and their reactions. Most metabolic and biomolecular databases refer to ENZYME for standard enzyme names and classifications.
KEGG Database
KEGG (Kyoto Encyclopedia of Genes and Genomes) offers a comprehensive view of molecular and cellular pathways. It includes pathway maps, molecule tables, gene tables, and genome maps, aiding in the construction of functional maps for metabolic and regulatory pathways.
Applications of Protein Databases
Protein databases serve various functions in scientific research:
- Sequence Analysis: Identify homologous sequences and predict protein functions based on similarity.
- Protein Structure Prediction: Compare amino acid sequences with known structures to predict protein structure.
- Protein-Protein Interactions: Study interactions between proteins using database tools.
- Protein Family Identification: Identify conserved motifs and classify protein families.
- Drug Discovery and Disease Research: Explore metabolic pathways involved in diseases to assist in drug development.
Comments