In the post genomic era, nucleotide and protein sequences from different organisms are available. It has paved the determination of secondary and 3-D structure of the proteins as well. This vast amount of information is processed and arranged systematically in different biological databases. The information present in these databases can be used to derive common feature of a sequence class and classification of a unknown sequence.
Primary Database
This the collection of the data obtained from the experiment such as sequence of DNA or Protein, 3-D structure of a protein.
Database of nucleic acid sequences
GenBank
- This is a public sequence database and it can be accessed through a web address http://www.ncbi.nlm.nih.gov/genbank/.
- The entry into the genbank is made through a login into the database with a pre-requisite of publication of the new sequence in any scientific journal.
- Each entry in the database has a unique accession number and it remains unchanged. A sample GenBank entry can be accessed via a link http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html.
- A typical GenBank entry has the information about the locus name, length of the sequence, type of the molecule (DNA/RNA), nucleotide sequence of the entry.
Entrez
- Entrez system is used to search all NCBI associated databases.
- It is a powerful tool to peform simple or complicated searches by combining key word with the logical operator (AND, NOT).
- For example, searching a protein kinase sequence in human can be done by the following search syntax: Homo sapiens [ORGN] AND protein kinase.
EMBL and DDBJ
- EMBL is the nucleotide sequence database present at European bioinformatics institute where as DDBJ is the DNA sequence database present at centre for information biology, Japan.
- EMBL can be accessed at http://www.embl.de/ where as DDBJ canbe accessed at http://www.ddbj.nig.ac.jp/.
- Everyday, GenBank, EMBL and DDBJ synchronize their nucleotide sequence and as a result searching of a nucleotide in any of the database is sufficient.
Database of protein sequences
SWISSPROT
- it is the collection of the annotted protein sequence of the swiss instituite of bioinformatics (SIB).
- SWISSPROT can be accessed at http://web.expasy.org/groups/swissprot/.
- The protein sequence entry in the swissprot is manually curated and if required it is compared with the available literature.
- Swissprot is part of the UniProt database and collectively known as UniProt Knowledgebase.
- A ‘niceprot’ view of the entry in swissprot database are graphically presented for better readability and hyperlinks are given for other databases as well.
NCBI protein database
- It is a compilation of the protein sequence present in other databases.
- The NCBI database contains the entries from the swissprot, PIR database, PDB database and other known databases.
UniProt
- EBI, SIB and Georgetown university together collected the protein information in the form of a centralized catalogue known as universal protein resource (UniProt).
- It contains the information about the 3-D structure, expression profile, secondary structures and biochemical function of the protein.
- UniProt consists of 3 parts: UniProt Knowledge database (UniProtKB), UniProt Reference (UniRef) and UniProt Archive (UniPArc).
- UniProtKB is a collection from SwissProt and TrEMBL database.
- UniRef is a nonredudant sequence database and it can allow to search similar sequences.
- UniRef 100, UniRef90 and UniRef50 are the three version of the database allow searching of sequences 100%, >90% and >50% identical ot the query sequence.
Secondary Database
The analysis of the primary data gives rise to the development of secondary database. Secondary structures, hydrophobicity plot and domains are present in the various secondary databases.
Prosite
- Prosite is one of the secondary biological database which contains motifs to classify the unknown sequence into the protein family or class of enzyme.
- It can be accessed with the web address http://prosite.expasy.org/.
- The database contains motifs derived from the multiple sequence alignment.
- The quert sequence is aligned against the multiple sequence alignment to determine the presence or absence of the motif.
- A typical expression in prosite has seven amino acid positions.
- For examples, [EFTNA]-[HFDAS]-[HYT]-{ADS}-X (2)-P.
- This expression can be understood as follows-
- 1st position can be E, F, T, N or A
- 2nd position can be H, F,D,A,S
- 3rd position can be HYT
- 4th position can be any amino acid except ADS
- 5th and 6th position, any amino acid can follow and the 7th position will be proline.
A query sequence can be analyzed using the algorithm ScanProsite. In addition, it may allow to search the sequence with similar pattern in SwissProt, TrEMBL and PDB databases.
References
- Manzoor, Shahid. (2014). Computational and comparative investigations of syntrophic acetate-oxidising bacteria (SAOB), genome-guided analysis of metabolic capacities and energy conserving systems.
- Müller, Heiko & Naumann, Felix. (2003). Data Quality in Genome Databases.. 269-284.