Samifier ======== Tools to enable a nexus between proteomic and genomic analysis. See https://github.com/IntersectAustralia/ap11_samifier/wiki for building, deployment and user guide. For scientific background, see http://intersectaustralia.github.com/ap11/ for details. The code is licensed under the GNU GPL v3 license - see LICENSE.txt The documentation (contained in the Github wiki and this README) is licensed under [Creative Commons Attribution-Share Alike](http://creativecommons.org/licenses/by-sa/2.5/au/) Building ======== $ ant dist This builds 4 command line toold and 2 helpers (undocumented). The following describes briefly each tool parameters. We encourage users to download the user guide and read it. Samifier ======== Converts a search result from the Mascot protein search engine (or compatible) into SAM format, so it can be displayed in a genomics viewer. $ java -jar dist/samifier.jar usage: samifier -r -c -g -m -o [-l ] [-b ] [-s ] -r Mascot search results file in txt format -c Directory containing the chromosome files in FASTA format for the given genome -m File mapping protein identifier to ordered locus name -g Genome file in gff format -o Filename to write the SAM format file to -l Filename to write the log into -b Filename to write IGV regions of interest (BED) file to -s Minimum confidence score for peptides to be included E.g. $ java -jar samifier.jar -r results.txt -c saccharomyces_cerevisiae -g saccharomyces_cerevisiae_R64-1-1_20110208.gff -m accession.txt -o test.sam Results analyser ================ Similar to *samifier* but instead of generating a SAM file, it generates a column with found peptides. This table that can be queried using SQL to extract a number of reports. $ java -jar dist/results_analyser.jar usage: result_analyser -c -g -m -o -r [-rep ] [-replist ] [-sql ] -c Directory containing the chromosome files in FASTA format for the given genome -g Genome file in gff format -m File mapping protein identifier to ordered locus name -o Filename to write the SAM format file to -r Mascot search results file in txt format -rep Access a built in report query -replist A file containing all the pre-built SQL queries -sql Filters the result through the use of a SQL statement to the output file Protein generator ================= Having as input a genome, it generates a FASTA file with "proteins" suitable to be used as a database in Mascot. It operates in two modes, using Glimmer predicted genes, or simply by splitting the genome into overlaping regions of given length. Both _Predicted Protein Generator_ (Glimmer gene prediction) and _Virtual Protein Generator_ (six-frame translation) are implemented under the command line tool ‘protein_generator.jar’ as both tools shares similar input files. However, the Predicted Protein Generator can be accessed via command line parameter ‘-g ’ to identify the input Glimmer prediction file. The Virtual Protein Generator is accessed using the command line parameter ‘-i ’, which indicates the length of the overlapping virtual proteins. $ java -jar dist/protein_generator.jar usage: protein_generator -d -f [-g ] [-i ] -o [-p ] [-q ] [-t ] -d Database name -f Genome file in FASTA format -g Glimmer txt file. Can't be used with the -i option. -i Size of the intervals (number of codons) into which the genome will be split. Can't be used with the -g option. -o Filename to write the FASTA format file to -p Filename to write the GFF file to -q Filename to write the accession file to -t File containing a mapping of codons to amino acids, in the format used by NCBI. Virtual protein merger ====================== When using the database generated by the Protein Generator in interval mode the generated "proteins" are most likely wrong. However, Mascot still uses them and can report back found peptides in such sequences. The Virtual Protein Merger takes such search result against a "virtual protein" database and tries to rebuild the intervals by searching for stop and start codons in the sequence. $ java -jar dist/virtual_protein_merger.jar usage: virtual_protein_merger -c -g -o -r [-t ] -c Directory containing the chromosome files in FASTA format for the given genome -g Genome file in gff format -o Filename to write the gff file to -r Mascot search results file in txt format -t File containing a mapping of codons to amino acids, in the format used by NCBI.