UniProt-GOA README ------------------ 1. Contents ------------ 1. Contents 2. Introduction 3. Differences in the UniProtKB gene association file from GO and GOA ftp sites 4. List of files and file formats 5. Assigment of GO terms to UniProtKB data 6. Addition of manual annotation in UniProt-GOA. 7. Addition of GO assignments from other data sources 8. Further information on the PDB association file 9. Contacts 10. Copyright Notice 2. Introduction ---------------- UniProt-GOA (UniProt GO Annotation) is a project run by the European Bioinformatics Institute that aims to provide assignments of proteins to the Gene Ontology (GO) resource. The goal of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms, even while the knowledge of the gene product roles in cells is still accumulating and changing. In the UniProt-GOA project, this vocabulary is applied to all proteins described in the UniProt (Swiss-Prot and TrEMBL) Knowledgebase. UniProt-GOA also provides species-specific annotation sets using the UniProtKB Complete Proteome sets that have undergone an additional electronic annotation filtering to remove redundancy. Currently Arabidopsis thaliana, Gallus gallus, Bos taurus, Dictyostellium discoideum, Drosophila melanogaster, Homo sapiens, Mus musculus, rattus norvegicus, Danio rerio, Canis familiaris, Caenorhabditis elegans, Saccharomyces cerevisiae and Sus scrofa datasets are available. Additional, non-filtered species-specific sets are available from the proteomes sets, which include separate annotation files for all species whose genome has been fully sequenced, where the sequence is publicly available, and where the proteome contains >25% GO annotation. UniProtKB manual GO annotations are created by UniProtKB curators from the EBI and the Swiss Insitute of Bioinformatics. The dataset is supplemented with manual GO annotation from external model organism databases: AgBase, BHF-UCL, CGD,DictyBase, Ensembl, FlyBase, GDB, GeneDB(S.pombe, P.falciparum), Gramene, HGNC, MGI, MTBbase, PAMGO, Reactome, RefGenome, RGD, Roslin, SGD, TAIR, TIGR, WormBase, ZFIN, the IntAct protein-protein interaction database,LIFEdb and the Proteome Inc dataset (see section 9). The original source of an annotation is always indicated in column 15 ('assigned by') of an association file. For manual annotation, curators aim to capture the most recent data from curated papers that provide experimental evidence for the unique features of a given protein. Our approach is protein-centric rather than paper-centric, as we don't read all papers that might be used to assign the same GO term. However when experimental evidence is read which further experimentally verifies a function, redundant annotations to a term using different references are created as this can provide greater confidence to a GO annotation. For further information please refer to our web site at: http://www.ebi.ac.uk/GOA External Contributors to the UniProt-GOA Gene Association Files: AgBase http://www.agbase.msstate.edu BHF-UCL http://www.cardiovasculargeneontology.com CGD http://www.candidagenome.org DictyBase http://dictybase.org EcoCyc http://www.ecocyc.org EcoWiki http://ecowiki.org Ensembl http://www.ensembl.org FlyBase http://www.flybase.org GDB (Human Genome Database) http://www.gdb.org GeneDB (S.pombe) http://www.genedb.org/genedb/pombe GeneDB (P. falciparum) http://www.genedb.org/genedb/malaria GOC (inferred annotations from GO OBO v1.2) http://www.geneontology.org Gramene http://www.gramene.org HGNC (HUGO Gene Nomenclature Committee) http://www.gene.ucl.ac.uk/nomenclature Human Protein Atlas http://www.proteinatlas.org IntAct http://www.ebi.ac.uk/intact (see also section 9.) InterPro http://www.ebi.ac.uk/interpro LifeDB http://www.lifedb.de MGI (Mouse Genome Informatics) http://www.informatics.jax.org MTBbase http://www.ark.in-berlin.de/Site/MTBbase.html PAMGO project; Agrobacterium Genome Consortium http://www.agrobacterium.org Proteome Inc. (see section 9.) Reactome http://www.reactome.org RefGenome (GO Consortium Reference Genomes project) http://www.geneontology.org/GO.refgenome.shtml RGD (Rat Genome Database) http://rgd.mcw.edu Roslin Institute http://www.ri.bbsrc.ac.uk SGD (Saccaromyces Genome Database) http://www.yeastgenome.org TAIR (The Arabidopsis Information Resource) http://www.arabidopsis.org TIGR (The Insitute for Genomic Research) http://www.tigr.org WormBase http://www.wormbase.org ZFIN (Zebrafish Information Network) http://zfin.org 3. Differences in the UniProtKB gene association file from GO and GOA ftp sites. ------------------------------------------------------------------------------ Please note that both the filtered and unfiltered versions of the GOA UniProtKB gene association file are available from the GO Consortium ftp site (ftp.geneontology.org). The filtered version does not contain annotations for those species where a different Consortium group is primarily responsible for providing GO annotations. If you would like to download an unfiltered GOA UniProtKB gene association file, please use either the GOA ftp site: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz Or the submissions folder in the GO Consortium ftp site: ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_uniprot.gz Species which are not present in the filtered version of the gene_association.goa_uniprot.gz file on the GO Consortium site include: Danio rerio, Drosophila melanogaster, Mus musculus, Rattus norvegicus, Arabidopsis thaliana, all rice species, Bacillus anthracis str. Ames, Campylobacter jejuni RM1221, Candida albicans, Caenorhabditis elegans, Coxiella burnetii RSA 493, Dehalococcoides ethenogenes 195, Dictyostelium sp., Dictyostelium discoideum, Geobacter sulfurreducens PCA, Glossina morsitans morsitans, Leishmania major, Listeria monocytogenes str. 4b F2365, Methylococcus capsulatus str. Bath, Pseudomonas syringae pv. tomato str. DC3000, Plasmodium falciparum, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Shewanella oneidensis MR-1, Silicibacter pomeroyi DSS-3, Trypanosoma brucei and Vibrio cholerae O1 biovar eltor. Further information on this filtering script can be found at: http://www.geneontology.org/GO.annotation.shtml#taxon 4. List of files and file formats ---------------------------------- The UniProt-GOA project produces the following gene association files: i) gene_association.goa_uniprot Locations: ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_uniprot.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz This file contains all GO annotations for proteins in the UniProt KnowledgeBase (UniProtKB). ii) gene_association.goa_human Locations:ftp://ftp.geneontology.org/pub/go/gene-associations/gene_association.goa_human.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz * Please note that the human file on the ftp.geneontology.org site may contain newer annotations (up to two weeks different) than the human file on the ftp.ebi.ac.uk site listed above, due to a more regular release schedule of the human and chicken files for the GO Consortium's Reference Genomes Project. This file contains the GO assignments for the proteins of the human UniProtKB Complete Proteome. iii) gene_association.goa_mouse Locations:ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_mouse.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/gene_association.goa_mouse.gz This file contains the GO assignments for the proteins of the mouse UniProtKB Complete Proteome. iv) gene_association.goa_rat Locations:ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_rat.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/RAT/gene_association.goa_rat.gz This file contains the GO assignments for the proteins of the rat UniProtKB Complete Proteome. v) gene_association.goa_arabidopsis Locations:ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_arabidopsis.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ARABIDOPSIS/gene_association.goa_arabidopsis.gz This file contains the GO assignments for the proteins of the arabidopsis UniProtKB Complete Proteome. v1) gene_association.goa_chicken Locations:ftp://ftp.geneontology.org/pub/go/gene-associations/gene_association.goa_chicken.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/CHICKEN/gene_association.goa_chicken.gz * Please note that the chicken file on the ftp.geneontology.org site may contain newer annotations (up to two weeks different) than the chicken file on the ftp.ebi.ac.uk site listed above, due to a more regular release schedule of the human and chicken files for the GO Consortium's Reference Genomes Project. This file contains the GO assignments for the proteins of the chicken UniProtKB Complete Proteome. vii) gene_association.goa_cow Locations:ftp://ftp.geneontology.org/pub/go/gene-associations/gene_association.goa_cow.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/COW/gene_association.goa_cow.gz This file contains the GO assignments for the proteins of the cow UniProtKB Complete Proteome. viii) gene_association.goa_zebrafish Locations:ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_zebrafish.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ZEBRAFISH/gene_association.goa_zebrafish.gz This file contains the GO assignments for the proteins of the zebrafish UniProtKB Complete Proteome. ix) gene_association.goa_dicty.gz Location: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/DICTY/gene_association.goa_dicty.gz This file contains the GO assignments for the proteins of the Dictyostellium discoideum UniProtKB Complete Proteome. x) gene_association.goa_dog.gz Locations:ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/DOG/gene_association.goa_dog.gz ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_dog.gz This file contains the GO assignments for the proteins of the Canis familiaris UniProtKB Complete Proteome. xi) gene_association.goa_fly.gz Location:ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/FLY/gene_association.goa_fly.gz This file contains the GO assignments for the proteins of the Drosophila melanogaster UniProtKB Complete Proteome. xii) gene_association.goa_pig.gz Locations: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/PIG/gene_association.goa_pig.gz ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_pig.gz This file contains the GO assignments for the proteins of the Sus scrofa UniProtKB Complete Proteome. xiii)gene_association.goa_worm.gz Location:ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/gene_association.goa_worm.gz This file contains the GO assignments for the proteins of the Caenorhabditis elegans UniProtKB Complete Proteome. xiv)Caenorhabditis elegans Location: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/YEAST/gene_association.goa_yeast.gz This file contains the GO assignments for the proteins of the Saccharomyces cerevisiae UniProtKB Complete Proteome. xv) gene_association.goa_pdb Locations:ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_pdb.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/PDB/gene_association.goa_pdb.gz This file contains the GO assignments for the PDB entries in the PDB database. We comply with the file format described by the Gene Ontology Consortium for annotation files GO Consortium documentation: http://www.geneontology.org/GO.format.annotation.shtml N.B. This readme describes the GAF 2.0 file format. Since we deal with proteins rather than genes, the semantics of some fields in our files may be slightly different to other gene association files. 1. DB Database from which annotated entry has been taken. For the UniProtKB and UniProtKB Complete Proteomes gene associaton files: UniProtKB For the PDB association file: PDB Example: UniProtKB 2. DB_Object_ID A unique identifier in the database for the item being annotated. Here: an accession number or identifier of the annotated protein (or PDB entry for the gene_association.goa_pdb file) For the UniProtKB and UniProtKB Complete Proteomes gene association files: a UniProtKB Accession. Examples O00165 3. DB_Object_Symbol A (unique and valid) symbol (gene name) that corresponds to the DB_Object_ID. An officially approved gene symbol will be used in this field when available. Alternatively, other gene symbols or locus names are applied. If no symbols are available, the identifier applied in column 2 will be used. Examples: G6PC CYB561 MGCQ309F3 4. Qualifier This column is used for flags that modify the interpretation of an annotation. If not null, then values in this field can equal: NOT, colocalizes_with, contributes_to, NOT | contributes_to, NOT | colocalizes_with Example: NOT 5. GO ID The GO identifier for the term attributed to the DB_Object_ID. Example: GO:0005634 6. DB:Reference A single reference cited to support an annotation. Where an annotation cannot reference a paper, this field will contain a GO_REF identifier. See section 8 and http://www.geneontology.org/doc/GO.references for an explanation of the reference types used. Examples: PMID:9058808 DOI:10.1046/j.1469-8137.2001.00150.x GO_REF:0000002 GO_REF:0000020 GO_REF:0000004 GO_REF:0000003 GO_REF:0000019 GO_REF:0000023 GO_REF:0000024 GO_REF:0000033 7. Evidence One of either EXP, IMP, IC, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, NR, ND or RCA. Example: TAS 8. With An additional identifier to support annotations using certain evidence codes (including IEA, IPI, IGI, IMP, IC and ISS evidences). Examples: UniProtKB:O00341 InterPro:IPROO1878 RGD:123456 CHEBI:12345 Ensembl:ENSG00000136141 GO:0000001 EC:3.1.22.1 9. Aspect One of the three ontologies, corresponding to the GO identifier applied. P (biological process), F (molecular function) or C (cellular component). Example: P 10. DB_Object_Name Name of protein The full UniProt protein name will be present here, if available from UniProtKB. If a name cannot be added, this field will be left empty. Examples: Glucose-6-phosphatase Cellular tumor antigen p53 Coatomer subunit beta 11. Synonym Gene_symbol [or other text] Alternative gene symbol(s), IPI identifier(s) and UniProtKB/Swiss-Prot identifiers are provided pipe-separated, if available from UniProtKB. If none of these identifiers have been supplied, the field will be left empty. Example: RNF20|BRE1A|IPI00690596|BRE1A_BOVIN IPI00706050 MMP-16|IPI00689864 12. DB_Object_Type What kind of entity is being annotated. Here: protein (or protein_structure for the gene_association.goa_pdb file). Example: protein 13. Taxon_ID Identifier for the species being annotated. Example: taxon:9606 14. Date The date of last annotation update in the format 'YYYYMMDD' Example: 20050101 15. Assigned_By Attribute describing the source of the annotation. One of either UniProtKB, AgBase, BHF-UCL, CGD, DictyBase, EcoCyc, EcoWiki, Ensembl, FlyBase, GDB, GeneDB_Spombe,GeneDB_Pfal, GOC, GR (Gramene), HGNC, Human Protein Atlas, JCVI, IntAct, InterPro, LIFEdb, PAMGO_GAT, MGI, Reactome, RGD, Roslin Institute, SGD, TAIR, TIGR, ZFIN, PINC (Proteome Inc.) or WormBase. Example: UniProtKB 16. Annotation_Extension Contains cross references to other ontologies/databases that can be used to qualify or enhance the GO term applied in the annotation. The cross-reference is prefaced by an appropriate GO relationship; references to multiple ontologies can be entered. Example: part_of(CL:0000084) occurs_in(GO:0009536) has_input(CHEBI:15422) has_output(CHEBI:16761) has_participant(UniProtKB:Q08722) part_of(CL:0000017)|part_of(MA:0000415) 17. Gene_Product_Form_ID The unique identifier of a specific spliceform of the protein described in column 2 (DB_Object_ID) Example:O43526-2 5. Assignment of GO terms to UniProtKB data ------------------------------------------------------------ In this release, we have used eleven data source types to assign GO terms to proteins. A) PMID:nnnnnnnn All such annotations are manually curated and can contain any of the evidence codes available, except 'IEA' (see section 4). Curators have read the abstract or full paper with the PubMed identifier nnnnnnnn and assigned the GO terms manually. B) Digital Object Identifiers (DOI:10.nnnn/*) All such annotations are manually curated and can contain any of the evidence codes available, except 'IEA' (see section 4). Curators have read the abstract or full paper with the DOI identifier and assigned the GO terms manually. C) Reactome:REACT_nnnn All such annotations are manually curated by the Reactome team and apply the TAS evidence code. Reactome entries are curated (from published papers and expert knowledge), then peer reviewed by domain experts. D) GO_REF:0000002 Transitive assignment of GO terms based on InterPro classification. For any protein that has been annotated with one or more InterPro domains, the corresponding GO terms are obtained from a translation table of InterPro entries to GO terms (interpro2go) generated manually by the InterPro team at EBI. The mapping file is available at: http://www.geneontology.org/external2go/interpro2go E) GO_REF:0000020 GO terms are manually assigned to each HAMAP family rule. HAMAP family rules are a collection of orthologous microbial protein families, from bacteria, archaea and plastids, generated manually by expert curators. The assigned GO terms are then transferred to all the proteins that belong to each HAMAP family. Only GO terms from the molecular function and biological process ontologies are assigned. GO annotations using this technique will receive the evidence code Inferred from Electronic Annotation (IEA). These annotations are updated monthly by HAMAP and are available for download on both GO and GOA EBI ftp sites. HAMAP (High-quality Automated and Manual Annotation of Microbial proteins) is a project based at the Swiss Institute of Bioinformatics (Gattiker et al. 2003, Comp. Biol and Chem. 27: 49-58). For further information, please see: http://www.expasy.org/sprot/hamap F) GO_REF:0000004 Transitive assignment using Swiss-Prot keywords. This method is used for any database record that has one or more Swiss-Prot keywords assigned. Each keyword is mapped to the corresponding GO term in the spkw2go file, which was originally constructed manually by MGI curators and is now maintained by the GOA team at EBI. The mapping file is available at: http://www.geneontology.org/external2go/spkw2go G) GO_REF:0000003 Transitive assignment using Enzyme Commission identifiers. This method is used for any database entry, such as a protein record in Swiss-Prot or TrEMBL, that has had an Enzyme Commission number assigned. The corresponding GO term is determined using the EC cross-references in the GO molecular function ontology. Also see Hill et al., Genomics (2001) 74:121-128. The mapping file is available at: http://www.geneontology.org/external2go/ec2go H) GO_REF:0000019 GO terms from a source species are projected onto one or more target species based on gene orthology obtained from the Ensembl Compara system. Only one to one and apparent one to one orthologies are used, and only GO annotations with an evidence type of IDA, IEP, IGI, IMP or IPI are projected. Projected GO annotations using this technique will receive the evidence code, inferred from electronic anotation, 'IEA'. The UniProtKB protein accession of the annotation source will be indicated in the 'With' column of the GOA association file. I)GO_REF:0000023 Transitive assignment of GO terms based on Swiss-Prot Subcellular Location vocabulary annotation. The UniProt Consortium has developed a Subcellular Location vocabulary (SPSL) to annotate UniProt Knowledgebase entries (in CC_SUBC LOCATION lines). The UniProt-GOA curators at EBI have manually mapped this vocabulary to the GO cellular component ontology. This mapping file, spsl2go, is used to obtain corresponding GO terms for any UniProtKB entry that has SPSL annotation; the mapping file is available is available from: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/spsl2go J)GO_REF:0000024 Method for transferring manual annotations to an entry based on a curator's judgment of its similarity to a putative ortholog which has annotations with experimental evidence. Annotations are created when a curator judges that the sequence of a protein shows high similarity to another protein that has annotation(s) supported by experimental evidence (IDA, IGI, IMP, IPI or IEP). Annotations resulting from the transfer of GO terms display the 'ISS' evidence code and include an accession for the protein from which the annotation was projected in the 'with' field (column 8). This field can contain either a UniProtKB Accession or an IPI (International Protein Index) identifier. Further information on this method can be found at: http://www.ebi.ac.uk/GOA/ISS_method.html K)GO_REF:0000033 GO terms based on experimental data from the scientific literature are used to annotate ancestral genes in phylogenetic trees from the PANTHER database by sequence similarity (ISS), and unannotated descendants of these ancestral genes are inferred to have inherited these same GO annotations by descent. The annotations are done using a tool called PAINT (Phylogenetic Annotation and INference Tool). Further information on this method can be found at: http://gocwiki.geneontology.org/index.php/PAINT L) Source='GOC' Annotations automatically generated using the Molecular Function->Biological Process inter-ontology relationships present in the GO OBO v1.2 format. As many GO users do not currently reason over these relationships, a set of inferred annotations are being generated. Such GO annotations are produced when an annotation has been made (either manually or electronically) to a Molecular Function term that, either directly or via one of its parent terms, has an relationship to a Biological Process term, and where this Process term (or one of its children) has not already been used in the annotation set for the same gene product identifier. This inferred annotation set applies the same gene product identifier, reference and evidence code as the asserted function annotation and are generated from all sources of GO annotations, with only 'NOT'-qualified annotations being excluded. 6. Additional information on Manual Annotation in UniProt-GOA ----------------------------------------------------- For information on manual annotation guidelines and the usage of manual evidence codes please see: http://www.geneontology.org/GO.annotation.html http://www.geneontology.org/GO.evidence.html Usage of the ISS code within UniProt-GOA There are three ways in which a curator can use the ISS evidence code: 1. If a curator reads a paper that provides functional information for a protein and also states an orthology between it and another protein, then manual annotation can be transferred to the ortholog. The ortholog's annotation will contain the evidence code 'ISS' and the original literature identifier is displayed in the DB:reference field (column 6). Any information previously in the 'with' column of the original protein's annotation is replaced in that of the sequence identifier (UniProt accession) of the original protein's accession number. This allows the source of the 'ISS' annotation to be traced. 2. If a curator is confident that a protein shows high similarity to another protein (e.g. from using BLAST) and it seemed reasonable to infer that the two proteins have a common function, then manual annotation can be transferred to an ortholog. The ortholog's annotation will contain the evidence code 'ISS', an accession for the protein from which the annotation was projected will be present in the 'with' field (column 8) and the reference field (column 6) will contain the GO_REF:0000024. Further information on this method can be found at: http://www.ebi.ac.uk/GOA/ISS_method.html 3. If sequence similarity and functional information is reported in two different papers, then the primary annotation can be transferred to an ortholog. The ortholog's annotation will contain the evidence code 'ISS', the identifier of the paper which describes the sequence similarity is displayed in the DB:reference field (column 6) and any information that was previously contained in the 'with' column of the original entry is changed in that of the ortholog to contain the original entry's accession number. This allows the source of the annotation to be traced. N.B. For all of the methods described above, only annotations that have an experimental evidence code (either: IDA, IEP, IGI, IMP or IPI) can be further transferred to other proteins. In addition, annotations having the 'NOT' qualifier cannot be transferred by ISS. 7. Addition of GO assignments from other data sources ------------------------------------------------------- The UniProt-GOA dataset has also been supplemented with the last (2001) public release of manual annotation from Proteome Incorporated. The replacement of this subset with more up-to-date and detailed GO annotation is one of UniProt-GOA's priorities. UniProt-GOA has integrated annotations from the EBI's IntAct protein-protein interaction database. Only those interactions which are of high enough quality to be integrated into the UniProt database have been included (this is decided on experimental method type). All GO terms in these annotations are children of the protein binding term (GO:0005515), use the 'IPI' evidence code along with the sequence identifier of the protein's binding partner in column 8 ('with'). 8. Further information on the PDB association file ---------------------------------------------------- The 'gene_association.goa_pdb' gene association file provided by the UniProt-GOA group contains GO assignments to PDB entries. In this file PDB entries are only assigned GO terms based on matching InterPro domains. 9. Contacts ----------- Please direct any questions to goa@ebi.ac.uk We welcome any feedback. 10. Copyright Notice -------------------- UniProt-GOA - UniProt GO Annotation Copyright 2012 (C) The European Bioinformatics Institute. This README and the accompanying databases may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. $Date: 2012/01/06 10:31:37 $