Topic Name – Nucleic Acid Sequence
Databases
• GenBank, EMBL and DDBJ are the
three primary nucleotide sequence
databases.
• They include nucleic acid sequences
BASIC NUCLEIC submitted directly by scientists and
ACID/NUCLEOTIDE genome sequencing group, and
SEQUENCE sequences taken from literature and
DATABASES patents.
• The entries in the GenBank, EMBL and
DDBJ databases are synchronized on
a daily basis, and the accession
numbers are managed in a consistent
manner between these three centers.
• An annotated collection of all publicly
available nucleotides.
• The GenBank nucleotide database is
maintained by the National Center for
Biotechnology Information (NCBI),
which is part of the National Institute of
GENBANK Health (NIH), a federal agency of the
US government.
• Maintained since 1992 NCBI
(Bethesda).
• www.ncbi.nlm.nih.gov/Genbank/
GENBANK
• An annotated collection of all
publicly available nucleotide and
protein sequences
• Created in 1980 at the European
EMBL- Molecular Biology Laboratory in
NUCLEOTIDE Heidelberg.
SEQUENCE
DATABASE • Maintained since 1994 by EBI-
Cambridge.
• http://www.ebi.ac.uk/embl.html
• An annotated collection of all
publicly available nucleotide and
protein sequences
DDBJ – DNA • Started, 1984 at the National
DATA BANK Institute of Genetics (NIG) in
Mishima.
OF JAPAN
• Still maintained in this institute a
team led by Takashi Gojobori.
• http://www.ddbj.nig.ac.jp
• UniGene www.ncbi.nlm.nih.gov/UniGene/
• The UniGene system attempts to process the GenBank
sequence data into a non-redundant set of gene-oriented
clusters.
• SGD genome-www.stanford.edu/Saccharomyces/
• The Saccharomyces Genome Database (SGD) is a
OTHER scientific database of the molecular biology and genetics
of the yeast Saccharomyces cerevisiae.
NUCLEOTIDE •
•
EBI Genomes www.ebi.ac.uk/genomes/
This web site provides access and statistics for the
completed genomes, and information about ongoing
DATABASES projects.
• Genome Biology www.ncbi.nlm.nih.gov/Genomes/
• The Genome Biology site at NCBI contains information
about the available complete genomes.
• Ensembl www.ensembl.org
• Ensembl is a joint project between EMBL-EBI and the
Sanger Centre to develop a software system which
produces and maintains automatic annotation of
eukaryotic genomes.
Protein Information Resource(PIR)
Uniprot - Protein Knowledge Database
PROTEIN/PROTEOMICS
DATABASES
Pfam - Protein Family And Domain
Prosite - Protein Family And Domain
• The Swiss-Prot, TrEMBL, and PIR protein
database activities have united to form the
Universal Protein Resource (UniProt)
– Uniprot Knowledgebase (UniprotKB):
curated Sequence information,
annotations, linked to other
UNIPROT
databases.
– Uniprot Reference Clusters (UniRef):
removing sequence redundancy by
Database merging sequences that are 100%,
90% and 50%, no annotations, linked
to Knowledgebase and UniParc
records.
– Uniprot Archive (UniParc): history of
sequences, no annotation, linked to
source records.
UNIPROT SEQUENCE DATABASES
UniProt Archive (UniParc) UniProt Reference (UniRef)
Stable, comprehensive, non-redundant Three non-redundant collections based
collection of all protein sequences ever on sequence similarity clusters
published • UniRef100 has all identical and
Merged from PIR, SwissProt, TREMBL, identical overlapping subsequences
DDBJ/EMBL/GenBank proteins and merged into one entry in UniRef100
proteomes, PDB, International Protein • UniRef90 merges all protein sequence
Index, RefSeq translations and other clusters with 90% sequence identity
organism proteomes not yet in into a single entry.
DDBJ/EMBL/GenBank • UniRef50 merges all protein sequence
clusters with 50% sequence identity
into a single entry
UniProt Sequence Databases (cont.)
•UniProt Archive (UniProt)
• UniProt/SwissProt
• Manually curated highly-annotated sequences from SwissProt & PIRSF
including descriptions, taxonomy, citations, GO terms, motifs, functional
and structural classifications, residue specific annotations including
variations.
• Some automatic rule-based annotations including InterPro domains and
motifs, PROSITE, PRINTS, Prodom, SMART, PFAM, PIRSF, Superfamily and
TIGRFAMS classifications.
• UniProt/TREMBL
• Automatically translated from genomes including predicted as well as
RefSeq genes.
• Automated rule-based annotations.
• PIR was established in 1984 by the
National Biomedical Research
Foundation (NBRF) as a resource to
assist researchers in the identification
PROTEIN and interpretation of protein sequence
INFORMATION information.
• The Protein Information Resource (PIR)
RESOURCE is an integrated public bioinformatics
resource to support genomic,
proteomic and systems biology
research and scientific studies
PFAM
PFAM IS A DATABASE OF CURATED PROTEIN FAMILIES, IN PFAM, THE PROFILE HMM IS SEARCHED AGAINST A
EACH OF WHICH IS DEFINED BY TWO ALIGNMENTS AND A LARGE SEQUENCE COLLECTION, BASED ON UNIPROT
PROFILE HIDDEN MARKOV MODEL (HMM). KNOWLEDGEBASE (UNIPROTKB), TO FIND ALL INSTANCES
OF THE FAMILY.
PROSITE DATABASE
PROSITE is a database of protein families and domains. It is based
on the observation that, while there is a huge number of different
proteins, most of them can be grouped, on the basis of similarities
in their sequences, into a limited number of families.
Proteins or protein domains belonging to a particular family
generally share functional attributes and are derived from a
common ancestor.
PROSITE DATABASE