SeqRecord objects from GenBank files

As in the SeqRecord object from FASTA files, we’re going to look at the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI, but this time as a GenBank file. Again, this file is included with the Biopython unit tests under the GenBank folder, or online NC_005816.gb from our biopython.org.

This file contains a single record (i.e. only one LOCUS line) and starts:

LOCUS NC_005816 9609 bp DNA circular BCT 21-JUL-2008
DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete
sequence.
ACCESSION NC_005816
VERSION NC_005816.1 GI:45478711
PROJECT GenomeProject:10638
...

we’ll use Bio.SeqIO to read this file in, and the code is almost identical to that for used above
for the FASTA file

>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'),
id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str.
91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])
>>> record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG')

The name comes from the LOCUS line, while the id includes the version suffix. The description comes
from the DEFINITION line:

>>> record.id
'NC_005816.1'
>>> record.name
'NC_005816'
>>> record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

#GenBank files don’t have any per-letter annotations:
>>> record.letter_annotations
{}

#Most of the annotations information gets recorded in the annotations dictionary, e.g.
>>> len(record.annotations)
13
>>> record.annotations["source"]
'Yersinia pestis biovar Microtus str. 91001'

All the entries in the features table (e.g. the genes or CDS features) get recorded as SeqFeature objects in the features list.