Biological Databases

In the post genomic era, nucleotide and protein sequences from different organisms are available. It has paved the determination of secondary and 3-D structure of the proteins as well. This vast amount of information is processed and arranged systematically in different biological databases. The information present in these databases can be used to derive common feature of a sequence class and classification of a unknown sequence.

Primary Database

This the collection of the data obtained from the experiment such as sequence of DNA or Protein, 3-D structure of a protein.

Database of nucleic acid sequences

GenBank

This is a public sequence database and it can be accessed through a web address http://www.ncbi.nlm.nih.gov/genbank/.
The entry into the genbank is made through a login into the database with a pre-requisite of publication of the new sequence in any scientific journal.
Each entry in the database has a unique accession number and it remains unchanged. A sample GenBank entry can be accessed via a link http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html.
A typical GenBank entry has the information about the locus name, length of the sequence, type of the molecule (DNA/RNA), nucleotide sequence of the entry.

Entrez

Entrez system is used to search all NCBI associated databases.
It is a powerful tool to peform simple or complicated searches by combining key word with the logical operator (AND, NOT).
For example, searching a protein kinase sequence in human can be done by the following search syntax: Homo sapiens [ORGN] AND protein kinase.

EMBL and DDBJ

EMBL is the nucleotide sequence database present at European bioinformatics institute where as DDBJ is the DNA sequence database present at centre for information biology, Japan.
EMBL can be accessed at http://www.embl.de/ where as DDBJ canbe accessed at http://www.ddbj.nig.ac.jp/.
Everyday, GenBank, EMBL and DDBJ synchronize their nucleotide sequence and as a result searching of a nucleotide in any of the database is sufficient.

Exemplary genome data entry from the EMBL DNA Sequence Database. Müller, Heiko & Naumann, Felix. (2003).

Database of protein sequences

SWISSPROT

it is the collection of the annotted protein sequence of the swiss instituite of bioinformatics (SIB).
SWISSPROT can be accessed at http://web.expasy.org/groups/swissprot/.
The protein sequence entry in the swissprot is manually curated and if required it is compared with the available literature.
Swissprot is part of the UniProt database and collectively known as UniProt Knowledgebase.
A ‘niceprot’ view of the entry in swissprot database are graphically presented for better readability and hyperlinks are given for other databases as well.

NCBI protein database

It is a compilation of the protein sequence present in other databases.
The NCBI database contains the entries from the swissprot, PIR database, PDB database and other known databases.

UniProt

EBI, SIB and Georgetown university together collected the protein information in the form of a centralized catalogue known as universal protein resource (UniProt).
It contains the information about the 3-D structure, expression profile, secondary structures and biochemical function of the protein.
UniProt consists of 3 parts: UniProt Knowledge database (UniProtKB), UniProt Reference (UniRef) and UniProt Archive (UniPArc).
UniProtKB is a collection from SwissProt and TrEMBL database.
UniRef is a nonredudant sequence database and it can allow to search similar sequences.
UniRef 100, UniRef90 and UniRef50 are the three version of the database allow searching of sequences 100%, >90% and >50% identical ot the query sequence.

Secondary Database

The analysis of the primary data gives rise to the development of secondary database. Secondary structures, hydrophobicity plot and domains are present in the various secondary databases.

Prosite

Prosite is one of the secondary biological database which contains motifs to classify the unknown sequence into the protein family or class of enzyme.
It can be accessed with the web address http://prosite.expasy.org/.
The database contains motifs derived from the multiple sequence alignment.
The quert sequence is aligned against the multiple sequence alignment to determine the presence or absence of the motif.
A typical expression in prosite has seven amino acid positions.
For examples, [EFTNA]-[HFDAS]-[HYT]-{ADS}-X (2)-P.
This expression can be understood as follows-
- 1^st position can be E, F, T, N or A
- 2^nd position can be H, F,D,A,S
- 3^rd position can be HYT
- 4^th position can be any amino acid except ADS
- 5^th and 6^th position, any amino acid can follow and the 7^th position will be proline.

A query sequence can be analyzed using the algorithm ScanProsite. In addition, it may allow to search the sequence with similar pattern in SwissProt, TrEMBL and PDB databases.

References

Manzoor, Shahid. (2014). Computational and comparative investigations of syntrophic acetate-oxidising bacteria (SAOB), genome-guided analysis of metabolic capacities and energy conserving systems.
Müller, Heiko & Naumann, Felix. (2003). Data Quality in Genome Databases.. 269-284.

Biological Databases

Primary Database

Database of nucleic acid sequences

GenBank

Entrez

EMBL and DDBJ

Database of protein sequences

SWISSPROT

NCBI protein database

UniProt

Secondary Database

Prosite

References

Leave a review Cancel reply

Related Articles

Primary Database

Database of nucleic acid sequences

GenBank

Entrez

EMBL and DDBJ

Database of protein sequences

SWISSPROT

NCBI protein database

UniProt

Secondary Database

Prosite

References

Sign Up For Daily Newsletter

Our resources that will help you excel in your academics and research.

Leave a review Cancel reply

Related Articles