- Addition of general input file (for non-MASCOT/MaxQuant users)
- User defined usage of CPUs now possible
- Batch search remembers isoBLASTED peptides (decrease in computing time throughout batch searches)
- Protein distance calculations are now faster, precalcutated distance file for pecora can be found in the MISC folder
- Addition of sunburst plot, containing species not present in the ClassiCOL database
- Addition of easy to navigate csv output file, including rescored values
- Addition of summary output file for batch searches
- 'Bos javanicus','Bubalus kerabau','Capricornis sumarensis', 'Daubentonia madagascariensis', 'Eulemur rufifrons', 'Macaca thibetana thibetana', 'Mustela lutreola', 'Mustela nigripes', 'Ovis canadensis', 'Ovibos moschatus', 'Petaurus breviceps papuanus', and 'Tachyglossus aculeatus' were added to the ClassiCOL collagen database. Homo sapiens COL1A1 and COL1A2 Uniprot reference sequences were exchanged.
Welcome to the user guide to ClassiCOL. Here will be explained how to use the algorithm and how to interprete the results. If you have any additional questions please contact maarten.dhaenens@ugent.be
When using ClassiCOL please cite: Engels, I. et al. ClassiCOL: LC-MS/MS analysis for ancient species Classification via Collagen peptide ambiguation. bioRxiv 2024.10.01.616034 (2024) doi:10.1101/2024.10.01.616034.
- Download the code in this repository. This includes: a) The ClassiCOL python script b) The Demo folder (if you want to run the demo) c) The MISC folder (contains distance csv and the unimod database) d) The BoneDB folder, which contains the curated ClassiCOL collagen fasta files e) Download the requirements.txt file to install all additional packages Put all these folders in the ClassiCOL_version_x_x_x folder downloaded from GitHub
- Open Anaconda command Prompt and navigate to the location of the folder to where you downloaded the ClassiCOL folders.
- Install the required packages using
pip install -r requirements.txt
.
Use the following command to start the algorithm with the demo data:
$ python ClassiCOL.py -d path_to_the_script -l path_to_folder_containing_your_search_results -s MASCOT -t Mammalia
You can use the arguments as follows:
-l
folder location containing your personal Mascot *.csv, MaxQuant *.txt, or Manual *.csv output files. In case you want to test the algorithm a MASCOT output file is provided in the Demo folder. Accessable by using-l Demo
$ python ClassiCOL.py -d path_to_the_script -l Demo -s MASCOT
-s MASCOT
,MaxQuant
orManual
(specify the search engine used)-t
(optional) you can restrict the taxonomy by specifying it, e.g., Pecora or for species: Bos_taurus or both: Homo_sapiens/Canis-m
specify the fixed modification used during protein extraction, e.g., C,45.987721 or multiple with C,45.98/M,...-f
(optional) location of the folder containing a custom database in fasta format-d
the directory to where the ClassiCOL algorithm is located on your computer-c
(optional) The amount of CPUs you want to use default = 3 less than available on your computer
- Input files:
- MASCOT.csv: Download your results directly from MASCOT in csv format
- MaxQuant.txt Use the output datafile containing peptides and locational data from MaxQuant in txt format
- Manual.csv: A manual csv can be made and used as input. This file should include a sequence and if present the modification with locational information. N-term location =0, first amino acid has location 1, and C-term uses -1 as location number e.g.:
seq,modifications
GAAGLPGPK,6|Oxidation
GFSGLDGAK,
AGPPGPPGPAGK,3|Oxidation|9|Oxidation
-
Batch searches:
- MASCOT: Place all MASCOT csv files in the same folder. The algoithm will automatically analyse all files in this folder
- MaxQuant: Similar to MASCOT you can place all files in the same folder. Additioanlly if 1 output file contains multiple experiments, the algorithm will automatically recognise this and analyse each experiment individually
- Manual: Same as MASCOT
-
The ClassiCOL output: ClassiCOL will put all the results in the folder 'ClassiCOL_outputs', here each experiment will get its own folder for easy access. This will contain the heatmap, sunburst plot, sunburst plot with species missingness, rescored_barplot, rescored_lineplot, temporary csv output files and the final csv output file. For batch searches there will be a summary output file outputed in the ClassiCOL_output folder.
-
Interpretation of the results: ClassiCOL will provide an estimation of taxonomy based on the available sequences in the ClassiCOL database and peptides from your search engine. It is always up to the user to interprete what these results mean!
-
The Heatmap: The heatmap shows the path the algorithm will take given the NCBI taxonomy (y axis) and how the protein related to each other (x axis). The colors show abundance in peptides assigned to each protein after isoBLAST.
-
The sunburst: This figure shows an interactive overview of the output of your ClassiCOL search. A color scheme is used to highlight to most likely classification (the more yellow the more likely). By hovering of the sunburst plot you can see the amount of attributed peptides and the amount of isoBLASTed peptides. You can zoom in by clicking on the sunburst plot, and zoom out by clicking on the center node (or by refreshing).
-
The sunburst with missingness: This plot shows exactly the same results as the sunburst plot, however now it includes all known species by NCBI that were not present during the ClassiCOL analysis. Only branches neighboring the main branch are shown up to the Order level. e.g. attached to the Family node, all missing genus (no represenative in the database used) will be shown.
-
The temporary output csv: This csv is generated after the initial classification. The species/taxa are ranked to likelyhood and peptides-proteins are shown that were used during the classification.
-
Rescored barplot: For each of the classification the top result is taken and rescored. This rescoring is based on uniqueness within the top scoring group of species, meaning that all peptides shared amongst these species will be neglected. The overlap that has uniqueness is shown in this barplot.
-
Rescored lineplot: This lineplot shows how the scoring changes amongst top scoring candidates. When a dropoff is noticed after rescoring, these candidates can be considered as discardable. When no drop-off is noticable, the sample can be comprised of a physical and/or genetic mixture.
-
The final output csv: This is an easy to navigate output after rescoring. This includes peptide-protein information and classicifation information.
-
The batch summary csv: This is a minimal information file that gives an overview of the top results alongside some meta data from the batch search.
WARNING_1: Depending on the amount of unique peptides in your sample and the amount of species you want to consider the isoBLAST calculations could take a while (about 2min for +/- 1000 unique peptides per species). An overnight search is recommended.Batch searches will go much quicker towards the end.
WARNING_2: The algorithm can use a substantial amount of the available CPU and memory. When not enough is free, there is a chance the algorithm will go into error.