All genomes sequenced at TIGR are initially assigned
annotation through an automated process. Names and functional
annotation are then manually
curated. As direct experimental evidence rarely exists for each
gene in a sequenced organism, name assignments are usually based on
sequence similarity. Therefore, all TIGR name assignments
should be regarded as provisional. We strive to annotate each
gene with as much information as we can confidently impart, but are
also wary of inferring too much from sequence similarity. We
prefer to err on the side of caution and we have devised a nomenclature
scheme that reflects our degree of confidence in a particular
assignment.
We encourage feedback from the community to help identify errors or to
provide suggestions to improve the annotation of our genes.
Information used during manual curation
|
- Pairwise search
results
Protein translations of all genes are searched vs. a non-redundant
amino acid database to generate a file of pair-wise
alignments. Matches to experimentally characterized
proteins are given special consideration.
- HMM matches
Protein translations of all genes are searched against Hidden Markov
Models (HMMs) built at TIGR (TIGRFAMs) and at Sanger
(Pfams). HMMs are
statistical models built from multiple alignments of proteins which
share sequence similarity. TIGR classifies HMMs into more than a
dozen 'isology' types, each of which represents a different degree of
confidence about function.
- Paralogous families
Each predicted protein translation is searched against the complete
protein set for a genome to identify protein families found within the
organism.
- Biologically significant
motifs and sites
Protein translations of all genes are searched vs. PROSITE for biologically
significant patterns. Potential transmembrane domains are
predicted by TmHMM.
For enzymes, curators review active site information from matches in SwissProt, MEROPS, and other databases.
- Gene context
Physical location of a gene within a gene cluster or putative operon
can
be significant in some assignments - particularly transporters or
enzymes involved in biosynthetic or metabolic pathways.
- Genome Properties
Each of TIGR's Genome
Properties comprises a suite of a genes that function in a
known metabolic pathway, cellular activity, or cellular
structure. Genes are evaluated by HMM matches and context-based
rules, and assigned to the appropriate Genome Property.
|
Levels of Database
Match
Descriptors for annotation include at minimum a common name, role
category, and Gene
Ontology (GO) 'function' and 'process' terms for each gene, and may
also include a gene
symbol, an Enzyme Commission number, and public comments.
Each gene is assigned as many descriptors as are
relevant. In
the course of reviewing data we have developed the following criteria
regarding assignments.
- Specific function: indicated by a specific
name and gene symbol
- The protein translation has a good database match to a
protein that whose function and process have been experimentally
characterized. Both pairwise and multiple protein sequence alignments
reveal a high degree of identity/similarity (typically >35%
identity) along the entire
length of the protein. There may be an essentially full-length
match to a highly specific (i.e.,
'equivalog' isology type)
HMM. Active sites, substrate or cofactor binding
sites, or motifs that are characteristic of a protein should be
conserved. Strong conservation of gene context (e.g. operon structure) is also
taken as evidence for certain function. For genes with a certain
function we use the most widely-recognized name and gene symbol. Highly
specific GO function and process terms are used if available. Enzymes
of certain function are annotated with their full IUBMB number; the
IUBMB enzyme name may also be used for clarity
if it seems more informative than its common name.
- Likely (or unlikely) function: indicated by
"putative" or "homolog" in the name
- If one or more lines of evidence is weak, but most of
the data agrees, we conclude the gene is likely performing the function
the name implies, and the name is preceded with "putative". In such
cases, the percent identity (e.g.,
30-35% identity)
or HMM score (score is between the trusted and noise cutoffs) is not
quite high enough
to impart certainty. GO terms may be more general than for gene models
of
certain function, and with few exceptions, gene symbols are not used.
For an enzyme with a putative specific function partial IUBMB numbers
may be used.
- When there are strong lines of conflicting evidence, we
consider the function indicated by the common name to be unlikely, and add 'homolog' to
the common name. Such assignment
can arise from two situations. In the first situation, sequence
homology is very
strong, but unlike a ‘putative’ match, we do NOT
believe the query protein has the same function as the match. This
might be because some critical piece of evidence is absent (e.g.,
non-conservation of catalytic residues in an enzyme), or because the
function is not predicted to exist in this particular organism (e.g.,
photosynthetic enzyme matches in a non-photosynthetic organism).
In the second situation, there is an essentially full-length
match to a set of genes whose names are the same or similar, at least
one of which has some
experimental characterization, but because the sequence conservation (e.g., 25-30% identity) falls below
even the 'putative' range, we consider functional conservation to be
unlikely. Furthermore, there are no family or domain names
available. In this case we use the matching proteins' name but
add 'homolog' to it, and apply descriptors appropriate for a protein of
unknown function.
- Note that while using 'homolog' to denote non-conserved
function in high-quality matches has been a long-time practice of TIGR
annotators, using
'homolog' to retain the names of lower-quality matches that might
otherwise be called 'conserved
hypothetical proteins' is a relatively recent practice. Also, the
criteria for 'putative' annotation
has been made more rigorous. Therefore, it is likely that some
older
gene models that were called 'putative' or 'conserved hypothetical"
would be called 'homolog' by
the newer naming criteria.
- Generic function: indicated by protein family name or
domain name.
- When the best (or only) annotation evidence indicates
membership in a defined family,
but does not justify more specific naming, we use family names defined
by a TIGR or Pfam HMM(s), curated databases such as SwissProt, or in
the literature, e.g.,
"carbohydrate kinase, FGGY family".
- When the extent of sequence homology is limited to a
defined protein domain (usually modelled as an HMM), rather than a
defined family or full-length characterized protein, we may use the
domain name, e.g., "ABC1
domain protein". Since domains are themselves often used to define a
family in the
literature, the distinction between family and domain based names is
not rigid.
- Note that the cellular function or process associated
with a protein family or domain may be experimentally defined to some
degree;
alternately, they may be functionally uncharactized, in which case the
family
or domain name connotes nothing more than sequence-based
homology. The
degree of functional characteriziation of the family or domain will be
reflected in the degree of specificity of the role categories, enzyme
number, and GO term descriptors assigned to the gene model. Thus, when
the family or domain
has no defined function, the family/domain name are used but the GO
terms and role category are the same as for a protein of unknown
function (see below).
- Unknown function: conserved hypothetical proteins,
conserved domain proteins, proteins of unknown function, and
hypothetical proteins
- When the best evidence associated with a protein
translation consists only of full-length matches to conceptual
translations in other species -- i.e., there
are no experimentally characterized matches above ~25% identity, no
trustworthy HMM matches, no family or domain or gene context based
names that can be reasonably derived from the evidence -- the gene
model is annotated as a "conserved hypothetical protein". These are
assigned the most general GO terms. (An exception to this naming
practice is made when there is a match to a lipoprotein motif or
detection of substantial hydrophobic regions; these are called
'putative lipoprotein' and 'putative membrane protein', respectively.)
When the conditions for "conserved hypothetical" apply, but the
conserved region is considerably less than the full length of the
matches, the gene model is named a "conserved domain protein" instead,
but again given the most general GO terms. Note that when matches are
only to genes from different strains of the same species, the gene
model is considered a "hypothetical protein" (see below).
- Where good matches exist to proteins that are likely to
be real and have been assigned a gene symbol, but for which there are
no clues about function -- e.g.,
the match has been sequenced and shown to be translated in vitro or in vivo, and may even be part of an
operon, but no function or genetic effect has yet been determined for
it -- the gene model is considered likely to be a real protein of
unknown function. We assign the current common name and gene symbol to
our gene model, but still use the most general GO terms. As indicated
above under "Generic function", proteins of unknown function may have
functionally uninformative HMMs associated with them, which may also
serve as a basis for names, gene symbols, and general GO terms.
- When a gene model is identified by the gene-finding
algorithm but has no significant sequence similarity to any
characterized or uncharacterized genes from another species or to any
defined HMMs, nor an informative gene context, we consider it dubious
and name it a "hypothetical protein". Such gene models do not get GO
terms or any other descriptors.
Alternative gene structures
(functional)
- programmed frameshifts
- When a gene model contains an in-frame termination
codon and a naturally-occurring frameshift prior to the
termination codon regulates translation of the gene model, we add
"programmed frameshift" to the common name.
- selenocysteine-containing proteins
- In certain organisms the 'stop' codon TGA encodes the
amino acid selenocysteine. The genome must contain a
selenocysteine-tRNA and the enzyme selenide, water dikinase. Proteins
which meet these criteria have 'selenocyteine-containing' added to
their common names.
- intein- or intron-containing proteins
- When a gene model contains an intein -- a segment of a
protein that is able to excise itself and rejoin the remaining portions
(the exteins) with a peptide bond -- we add "intein-containing" to the
common name. Inteins are also known as "protein introns".
Disrupted genes (nonfunctional)
- authentic frameshifts/authentic point mutations
- When a gene model is disrupted by a either a single
frameshift or point mutation, confirmed by manual review of the
genome assembly, we simply add "authentic frameshift" or "authentic
point mutation" to the common name.
- multiple/mixed frameshifts and point mutations:
indicated by 'degenerate'
- When a gene model is disrupted by multiple frameshifts
or a mixture of frameshifts and point mutations we assume that the gene
model is not functionally expressed and we denote this with the term
"degenerate" after the common name.
- interruptions
- Interruptions are cases in which conserved N- and
C-terminal portions of a gene model are separated by some
other untranslated sequence, such as a transposon. Such gene models are
split into two parts with the same common name, with "interruption-N"
and "interruption-C" added for the part N-terminal and C-terminal to
the interruption, respectively.
- truncations, fragments, and internal deletions
- When a significant segment of the gene model is missing
from the N- or C-terminal end -- enough so that we believe that it is
no longer
functionally expressed -- we add 'truncation' to the common name.
- In contrast to truncations, where either the N- or
C-terminal region is present, fragments are coterminous only with an
internal region of a conserved gene; they lack both the N- and the
C-terminal regions of the match. For such gene models we add 'fragment'
to the common name.
- An internal deletion is the opposite of a fragment: the
N- and C-terminal regions are conserved, but the region between them
has been deleted. Internal
deletions tend to be shorter than interruptions, but are long enough
such that we expect the deletion to impair function. We denote
them by adding 'internal deletion' to the common name.
- fusions
- Two proteins which have been fused into one reading
frame by an event which deleted a C-terminal portion of protein 1 and
an N-terminal portion of protein 2 are denoted by 'fusion' in the
common name.
|