UniProt-GOA multi-species files README ------------------ 1. Contents ------------ 1. Contents 2. Introduction 3. Data types 4. List of files and their included data types 5. Contacts 6. Copyright Notice 2. Introduction ---------------- The UniProt GO annotation project at the European Bioinformatics Institute aims to provide assignments of gene products to the Gene Ontology (GO) resource. The goal of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms, even while the knowledge of the gene product roles in cells is still accumulating and changing. In the UniProt-GOA project, this vocabulary is applied to all proteins described in the UniProt (Swiss-Prot and TrEMBL) Knowledgebase. For full information on the UniProt-GOA project, please go to: http://www.ebi.ac.uk/GOA This readme describes the formats and contents of the UniProt multi-species files. 3. Data types ---------------- a) DB Database from which annotated entity has been taken. Examples: UniProtKB, PDB b) DB_Object_ID A unique identifier in the database for the item being annotated. Examples: O00165, 10GS_B c) DB_Object_Symbol A unique and valid symbol (gene name) that corresponds to the DB_Object_ID. An officially approved gene symbol will be used in this field when available. Alternatively, other gene symbols or locus names are applied. If no symbols are available, the DB_Object_ID will be used. Examples: G6PC CYB561 MGCQ309F3 10GS_B d) Qualifier In the gene_association file format, this column is used for flags that modify the interpretation of an annotation. The values that may be present in this field are: NOT, colocalizes_with, contributes_to, NOT|contributes_to, NOT|colocalizes_with. In the gp_association file format, this column is used for explicit relations between the protein and the GO term. An entry in this column is required in this file format. The default relations are part_of (for Cellular Component), involved_in (for Biological Process) or enables (for Molecular Function). Other values that may be present in this field are: colocalizes_with and contributes_to. Any of these relations can be additionally qualified with 'NOT'. Example: NOT|involved_in e) GO ID The GO identifier for the term attributed to the DB_Object_ID. Example: GO:0005634 f) DB:Reference A single reference cited to support an annotation. Where an annotation cannot reference a paper, this field will contain a GO_REF identifier. See http://www.geneontology.org/doc/GO.references for an explanation of the reference types used. Examples: PMID:9058808 DOI:10.1046/j.1469-8137.2001.00150.x GO_REF:0000002 GO_REF:0000020 g) Evidence Code In the gene_association file format, this column is used for one of the evidence codes supplied by the GO Consortium (http://www.geneontology.org/GO.evidence.shtml). Example: IDA In the gp_association file format, this column is used for identifiers from the Evidence Code Ontology (http://evidenceontology.googlecode.com/svn/trunk/eco.obo) Example: ECO:0000320 h) With (or) From Additional identifier(s) to support annotations using certain evidence codes (including IEA, IPI, IGI, IMP, IC and ISS evidences). Examples: UniProtKB:O00341 InterPro:IPROO1878 RGD:123456 CHEBI:12345 Ensembl:ENSG00000136141 GO:0000001 EC:3.1.22.1 i) Aspect One of the three ontologies, corresponding to the GO identifier applied. P (biological process), F (molecular function) or C (cellular component). Example: P j) DB_Object_Name The full UniProt protein name will be present here, if available from UniProtKB. If a name cannot be added, this field will be left empty. Examples: Glucose-6-phosphatase Cellular tumor antigen p53 Coatomer subunit beta k) DB_Object_Synonym Alternative gene symbol(s) or UniProtKB identifiers are provided pipe-separated, if available from UniProtKB. If none of these identifiers have been supplied, the field will be left empty. Example: RNF20|BRE1A|BRE1A_BOVIN MMP-16 l) DB_Object_Type The kind of entity being annotated. Examples: protein, protein_structure m) Taxon Identifier for the species being annotated or the gene product being defined. In the gene_association file format, an interacting taxon ID (see n) below) may be included in this column using a pipe to separate it from the primary taxon ID. Example: taxon:9606 n) Interacting_Taxon_ID This field is only supplied by the gp_association.goa_uniprot file and has been separated from the dual taxon ID format allowed in the gene_association.goa_uniprot file. This taxon ID should inform on the other organism involved in a multi-species interaction. An interacting taxon identifier can only be used in conjunction with terms that have the biological process term 'GO:0051704; multi-organism process' or the cellular component term 'GO:0043657; host cell' as an ancestor. This taxon ID should inform on the other organism involved in the interaction. For further information please see: http://www.geneontology.org/GO.annotation.conventions.shtml#interactions Example: taxon:9606 o) Date The date of last annotation update in the format 'YYYYMMDD' Example: 20050101 p) Assigned_By Attribution for the source of the annotation. Examples: UniProtKB, AgBase q) Annotation_Extension Contains cross references to other ontologies/databases that can be used to qualify or enhance the GO term applied in the annotation. The cross-reference is prefaced by an appropriate GO relationship; references to multiple ontologies can be entered as linked (comma separated) or independent (pipe separated) statements. Examples: part_of(CL:0000084) occurs_in(GO:0009536) has_input(CHEBI:15422) has_output(CHEBI:16761) has_regulation_target(UniProtKB:P12345)|has_regulation_target(UniProtKB:P54321) part_of(CL:0000017),part_of(MA:0000415) r) Gene_Product_Form_ID The unique identifier of a specific spliceform of the DB_Object_ID. Example: O43526-2 s) Annotation_Properties This column is reserved for internal use; it will not be populated in public files t) Parent_Object_ID This field supplies the relationship between the DB_Object_ID and the canonical UniProtKB accession number, where the DB_Object_ID is an isoform identifier. Example: UniProtKB:P21678 u) DB_Xref(s) This field supplies alternative identifiers (cross-references) for the DB_Object_ID. This field will not be populated in the UniProt-GOA files. v) Gene_Product_Properties This field can be populated with information concerning the DB_Object_ID. The syntax of the field will conform to a pipe-separated list of "property_name=property_value". There is a controlled vocabulary for the property names. The UniProt-GOA files will use this field to indicate: i) DB_Subset The database subset from which the entity being described has been taken. This information will only be supplied for UniProtKB, where this field will be one of Swiss-Prot or TrEMBL. Examples: db_subset=Swiss-Prot db_subset=TrEMBL ii) Annotation_Target_Set A description of the list in which the protein has been included for prioritized annotation. Examples: target_set=BHF-UCL target_set=KRUK target_set=ReferenceGenome iii) GO_Annotation_Complete The date when a curator has indicated that the protein's GO annotation record was comprehensively curated. Example: go_annotation_complete=20080131 4. List of files and their included data types ---------------- All files for the UniProt dataset are gzipped to reduce the size and are located at ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT. The files currently produced are: i) gene_association.goa_uniprot This file contains all GO annotations and protein information for proteins in the UniProt KnowledgeBase (UniProtKB). If a particular protein accession is not annotated with GO, then it will not appear in this file. The file is provided as GAF2.0 format, which contains the following columns: Column No. Contents 1. DB 2. DB_Object_ID 3. DB_Object_Symbol 4. Qualifier 5. GO_ID 6. DB:Reference 7. Evidence Code 8. With (or) From 9. Aspect 10. DB_Object_Name 11. DB_Object_Synonym 12. DB_Object_Type 13. Taxon 14. Date 15. Assigned_By 16. Annotation_Extension 17. Gene_Product_Form_ID ii) gene_association.goa_ref_uniprot (based on UniProtKB reference proteomes). This file contains all GO annotations and protein information for species' subsets of proteins in the UniProt KnowledgeBase (UniProtKB). If a particular protein accession is not annotated with GO, then it will not appear in this file. The file is provided as GAF2.0 format exactly as described in i) above. iii) gp_association.goa_uniprot This file provides the minimal information required to describe a GO annotation. If a particular protein accession is not annotated with GO, then it will not appear in this file. The file is provided as GPAD1.1 format, which contains the following columns: Column No. Contents 1. DB 2. DB_Object_ID 3. Qualifier 4. GO_ID 5. DB:Reference 6. Evidence Code 7. With (or) From 8. Interacting_Taxon_ID 9. Date 10. Assigned_By 11. Annotation_Extension 12. Annotation_Properties iv) gp_information.goa_uniprot This file supplies additional information on the proteins (DB_OBJECT_ID) that are provided with GO annotations in the gp_association.goa_uniprot file. Protein accessions are represented in this file even if there is no associated GO annotation. The file is provided as GPI1.1 format, which contains the following columns: Column No. Contents 1. DB_Object_ID 2. DB_Object_Symbol 3. DB_Object_Name 4. DB_Object_Synonym 5. DB_Object_Type 6. Taxon 7. Parent_Object_ID 8. DB_Xref(s) 9. Gene_Product_Properties v) gp_association.goa_ref_uniprot (based on UniProtKB reference proteomes). This file provides the minimal information required to describe a GO annotation. If a particular protein accession is not annotated with GO, then it will not appear in this file. The file is provided as GPAD1.1 format exactly as described in iii) above. vi) gp_information.goa_ref_uniprot (based on UniProtKB reference proteomes). This file supplies additional information on the proteins (DB_OBJECT_ID) that are provided with GO annotations in the gp_association.goa_ref_uniprot file. Protein accessions are represented in this file even if there is no associated GO annotation. The file is provided as GPI1.1 format exactly as described in iv) above. 5. Contacts ----------- Please direct any questions to goa@ebi.ac.uk We welcome any feedback. 6. Copyright Notice -------------------- UniProt-GOA - GO Annotation@EBI Copyright 2013 (C) The European Bioinformatics Institute. This README and the accompanying databases may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy.