Pepfeature is a Python package providing routines for calculating peptide features on a given amino acid sequence.
A use for this package would be for epitope prediction; here, this package would be used in the feature extraction stage of a machine learning pipeline for classification purposes.
Package makes use of ‘multiprocessing’ for the purpose of parallelising calculations on multiple cores.
A nice feature of this package is that it allows easy parallelisation of the calculations on multiple cores by just passing the number of cores you want to use as a parameter.
The features it can calculate for a given Amino Acid string sequence are:
No. | Feature | Explanation and references to be found in this section of Dissertation.pdf (GitHub repo) | Calculated in Pepfeature package's Python Module |
---|---|---|---|
1 | Proportion of Individual Amino Acids in sequence | 2.2.1 | aa_proportion.py |
2 | k-mer Composition | 2.2.2 | aa_kmer_composition.py |
3 | Conjoint Triad Frequencies | 2.2.3 | aa_CT.py |
4 | Sequence Entropy | 2.2.4 | aa_seq_entropy.py |
5 | Frequency of Amino Acid types | 2.2.5 | aa_composition.py |
6 | Number of atoms | 2.2.6 | aa_num_of_atoms.py |
7 | Molecular Weight | 2.2.7 | aa_molecular_weight.py |
8 | Amino Acid Descriptors | 2.2.8 | aa_descriptors.py |
Additionally a module named aa_feat_all.py also exists and it contains functions to calculate all the eight features in one go.
Required Software/Tools:
- Tested on Python 3.8 (other Python 3 versions probably work too)
Required Package Dependencies: (Pepfeature has been tested on these versions of the dependancies. More recent versions of these dependancies may also be compatible with the Package.)
- et-xmlfile v1.1.0
- setuptools v56.0.0
- numpy v1.20.2
- openpyxl v3.0.7
- pandas v1.2.4
- python-dateutil v2.8.1
- pytz v2021.1
- six v1.15.0
pip install Pepfeature
(All dependancies are expected to be automatically installed asswell with this 'pip install pepfeature' command.) The source code is currently hosted on GitHub at: https://github.com/essakh/pepfeature
NOTE: The Github contains an 'examples.py' in the root folder with many example use cases
Ensure at all times that any lines of code that utilize this package are executed within the code block:
if __name__ == '__main__':
Example:
import pepfeature as pep
import pandas
df = pd.read_csv('pepfeature/data/Sample_Data.csv')
#Use of pepfeature
if __name__ == '__main__':
#Calculate all features on df
df_feat = pep.aa_all_feat.calc_df(dataframe=df, aa_column='Info_window_seq', Ncores=4, k=2)
print(df_feat) #print the data frame to console
The API interface consists of calling two functions from 9 possibile modules, an overview of the modules and their two callable functions are illustrated in the figure below:
Thus, if in your python script you:
import pepfeature
Then you will have the following possible API interfacing options, as illustrated in the image below:
Please see pepfeature/examples.py on the Github repo for example use cases.
The interface functions are calc_csv & calc_df. They have been detailed in respect to each module in the following section "Functions documentation",
Both the interface functions, viz. calc_csv & calc_df always take an argument 'dataframe' and 'aa_column' in all cases.
The 'dataframe' parameter of both the calc_csv() & calc_df() functions require a pandas Data frame* with at least one column that consists of amino acid sequences; this column's name you must pass as the 'aa_column' parameter into calc_csv() & calc_df() aswell. Note: The Amino Acid sequences to calculate features on can be of varying sizes/lengths.
*in the example code shown in the 'Example Use' section of this documentation. The line
df = pd.read_csv('pepfeature/data/Sample_Data.csv')
converts Sample_Data.csv into a pandas Data Frame to then feed into calc_csv() & calc_df(). This Sample_Data.csv exists in the location pepfeature/data/Sample_Data.csv relative from the root of the Github repo. This csv can be used as sample data to try out the package and to gauge what is meant by "A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).".
This module contains methods to Calculate all features that this package is capable of calculating in one go, the functions callable either return results as a pandas DataFrame or are exportes as a CSV.
The features calculated by the functions are:
- Proportion of Individual Amino Acids in sequence
- k-mer Composition
- Conjoint Triad Frequencies
- Sequence Entropy
- Frequency of Amino Acid types
- Number of atoms
- Molecular Weight
- Amino Acid descriptors
Calculates all 8 features that this package calculates at once chunk by chunk from the inputted 'dataframe'. It saves each processed chunk as a CSV(s).
This is a Ram efficient way of calculating the Features as the features are calculated on a single chunk of the dataframe (of chunksize number of rows) at a time and when a chunk has been been processed and saved as a CSV, then the chunk is deleted freeing up RAM.
Results appended as a new column to input dataframe.
pepfeature.aa_all_feat.calc_csv(dataframe, k, save_folder, aa_column = 'Info_window_seq', Ncores = 1, chunksize = None)
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
k
:int
- Length of subsequences (this is used to calculate k-mer composition features)
save_folder
:str
- Path to folder for saving the output as CSV
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
chunksize
:int
,Default=None
- Number of rows to be processed at a time. (Where a 'None' object denotes no chunks but the entire dataframe to be processed)
Calculate all 8 features that this package calculates at once Results appended as a new column to input dataframe.
pepfeature.aa_all_feat.calc_df(dataframe, k, Ncores = 1, aa_column= 'Info_window_seq')
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
k
:int
- Length of subsequences (this is used to calculate k-mer composition features)
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Returns:
Pandas DataFrame object
- A Pandas DataFrame containing the calculated features appended as new columns.
This module contains functions to calculate Frequency of Amino Acid types for given amino acid sequences.
Calculates Frequency of Amino Acid types for given amino acid sequences chunk by chunk from the inputted 'dataframe'. It saves each processed chunk as a CSV(s).
This is a Ram efficient way of calculating the Features as the features are calculated on a single chunk of the dataframe (of chunksize number of rows) at a time and when a chunk has been been processed and saved as a CSV, then the chunk is deleted freeing up RAM.
Results appended as a new columns named feat_Prop_{group-value} e.g. feat_Prop_Tiny, feat_Prop_Small etc.
pepfeature.aa_composition.calc_csv(dataframe, save_folder, aa_column = 'Info_window_seq', Ncores = 1, chunksize = None)
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
save_folder
:str
- Path to folder for saving the output as CSV
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
chunksize
:int
,Default=None
- Number of rows to be processed at a time. (Where a 'None' object denotes no chunks but the entire dataframe to be processed)
Calculates Frequency of Amino Acid types for given amino acid sequences For each sequence calculates nine features corresponding to the proportion (out of 1) of each Amino Acid type in the sequences
Results appended as a new columns named feat_Prop_{group-value} e.g. feat_Prop_Tiny, feat_Prop_Small etc.
pepfeature.aa_all_feat.calc_df(dataframe, Ncores = 1, aa_column= 'Info_window_seq')
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Returns:
Pandas DataFrame object
- A Pandas DataFrame containing the calculated features appended as new columns.
This module contains functions to calculate conjoint triads features for given amino acid sequences.
Calculates conjoint triads features chunk by chunk from the inputted 'dataframe'. It saves each processed chunk as a CSV(s).
Results appended as a new column named feat_CT_{subsequence} e.g. feat_CT_305 etc.
This is a Ram efficient way of calculating the Features as the features are calculated on a single chunk of the dataframe (of chunksize number of rows) at a time and when a chunk has been been processed and saved as a CSV, then the chunk is deleted freeing up RAM.
pepfeature.aa_CT.calc_csv(dataframe, save_folder, aa_column = 'Info_window_seq', Ncores = 1, chunksize = None)
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
save_folder
:str
- Path to folder for saving the output as CSV
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
chunksize
:int
,Default=None
- Number of rows to be processed at a time. (Where a 'None' object denotes no chunks but the entire dataframe to be processed)
Calculates conjoint triads features
Results appended as a new column named feat_CT_{subsequence} e.g. feat_CT_305 etc.
pepfeature.aa_CT.calc_df(dataframe, Ncores = 1, aa_column= 'Info_window_seq')
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Returns:
Pandas DataFrame object
- A Pandas DataFrame containing the calculated features appended as new columns.
This module contains functions to calculate Amino Acid descriptors features for given amino acid sequences.
Calculates Amino Acid descriptors features for given amino acid sequences chunk by chunk from the inputted 'dataframe'. It saves each processed chunk as a CSV(s).
Results appended as a new columns named feat_{property} e.g. feat_BLOSUM9
This is a Ram efficient way of calculating the Features as the features are calculated on a single chunk of the dataframe (of chunksize number of rows) at a time and when a chunk has been been processed and saved as a CSV, then the chunk is deleted freeing up RAM.
pepfeature.aa_descriptors.calc_csv(dataframe, save_folder, aa_column = 'Info_window_seq', Ncores = 1, chunksize = None)
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
save_folder
:str
- Path to folder for saving the output as CSV
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
chunksize
:int
,Default=None
- Number of rows to be processed at a time. (Where a 'None' object denotes no chunks but the entire dataframe to be processed)
Calculates Amino Acid descriptors features
Results appended as a new columns named feat_{property} e.g. feat_BLOSUM9
pepfeature.aa_descriptors.calc_df(dataframe, Ncores = 1, aa_column= 'Info_window_seq')
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Returns:
Pandas DataFrame object
- A Pandas DataFrame containing the calculated features appended as new columns.
This module contains functions to calculate frequency of each k-length contiguous combination of subsequence of amino acid letters in the sequence.
Calculates frequency of each k-length contiguous combination of subsequence of amino acid letters in the sequence chunk by chunk from the inputted 'dataframe'. It saves each processed chunk as a CSV(s).
Since there are 20 valid Amino Acid letters, there can be 400 ( 20x20) possible 2-letter combination, 8000 (20x20x20) 3-letter combinations, etc.
Results appended as a new column named feat_Prop_{subsequence} e.g. feat_Prop_AB, feat_Prop_BC etc.
This is a Ram efficient way of calculating the Features as the features are calculated on a single chunk of the dataframe (of chunksize number of rows) at a time and when a chunk has been been processed and saved as a CSV, then the chunk is deleted freeing up RAM.
pepfeature.aa_kmer_composition.calc_csv(k, dataframe, save_folder, aa_column = 'Info_window_seq', Ncores = 1, chunksize = None)
Parameters:
k
:int
- Length of subsequences
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
save_folder
:str
- Path to folder for saving the output as CSV
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
chunksize
:int
,Default=None
- Number of rows to be processed at a time. (Where a 'None' object denotes no chunks but the entire dataframe to be processed)
Calculates frequency of each k-length contiguous combination of subsequence of amino acid letters in the sequence. (k-mers in a sequence are all the subsubsequence of length k.)
Since there are 20 valid Amino Acid letters, there can be 400 ( 20x20) possible 2-letter combination, 8000 (20x20x20) 3-letter combinations, etc.
Results appended as a new column named feat_Prop_{subsequence} e.g. feat_Prop_AB, feat_Prop_BC etc.
pepfeature.aa_kmer_composition.calc_df(k, dataframe, Ncores = 1, aa_column= 'Info_window_seq')
Parameters:
k
:int
- Length of subsequences
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Returns:
Pandas DataFrame object
- A Pandas DataFrame containing the calculated features appended as new columns.
This module contains functions to calculate total molecular weight for given amino acid sequences.
Calculates total molecular weight of the amino acid sequence chunk by chunk from the inputted 'dataframe'. It saves each processed chunk as a CSV(s).
Results appended as a new column named feat_molecular_weight This is a Ram efficient way of calculating the Features as the features are calculated on a single chunk of the dataframe (of chunksize number of rows) at a time and when a chunk has been been processed and saved as a CSV, then the chunk is deleted freeing up RAM.
pepfeature.aa_molecular_weight.calc_csv(dataframe, save_folder, aa_column = 'Info_window_seq', Ncores = 1, chunksize = None)
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
save_folder
:str
- Path to folder for saving the output as CSV
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
chunksize
:int
,Default=None
- Number of rows to be processed at a time. (Where a 'None' object denotes no chunks but the entire dataframe to be processed)
Calculates total molecular weight of the sequence.
Calculated as a simple weighted sum of amino acid counts, with Amino Acid weights data. Results appended as a new column named feat_molecular_weight
pepfeature.aa_molecular_weight.calc_df(dataframe, Ncores = 1, aa_column= 'Info_window_seq')
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Returns:
Pandas DataFrame object
- A Pandas DataFrame containing the calculated features appended as new columns.
This module contains functions to calculate for each given sequence the total number of atoms of each type in that sequence (which is essentially a weighted sum of the aminoacid numbers).
Calculates for each given sequence the total number of atoms of each type in that sequence (which is essentially a weighted sum of the aminoacid numbers) chunk by chunk from the inputted 'dataframe'. It saves each processed chunk as a CSV(s).
Results appended as a new columns named feat_C_atoms, feat_H_atoms, feat_N_atoms, feat_O_atoms, feat_S_atoms
This is a Ram efficient way of calculating the Features as the features are calculated on a single chunk of the dataframe (of chunksize number of rows) at a time and when a chunk has been been processed and saved as a CSV, then the chunk is deleted freeing up RAM.
pepfeature.aa_num_of_atomst.calc_csv(dataframe, save_folder, aa_column = 'Info_window_seq', Ncores = 1, chunksize = None)
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
save_folder
:str
- Path to folder for saving the output as CSV
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
chunksize
:int
,Default=None
- Number of rows to be processed at a time. (Where a 'None' object denotes no chunks but the entire dataframe to be processed)
Calculates for each given sequence the total number of atoms of each type in that sequence (which is essentially a weighted sum of the aminoacid numbers)
Results appended as a new columns named feat_C_atoms, feat_H_atoms, feat_N_atoms, feat_O_atoms, feat_S_atoms
pepfeature.aa_num_of_atoms.calc_df(dataframe, Ncores = 1, aa_column= 'Info_window_seq')
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Returns:
Pandas DataFrame object
- A Pandas DataFrame containing the calculated features appended as new columns.
This module contains functions to calculate all the proportion (out of 1) of each Amino Acid in the peptide.
Calculates the proportion (out of 1) od each Amino-Acid in the peptides (Amino Acid Sequences) chunk by chunk of the inputted 'dataframe'. It saves each processed chunk as a CSV(s).
This results in 20 new features per chunk, appended as new columns named feat_Prop_{Amino-Acid letter} e.g. feat_Per_A, feat_Prop_C, ..., feat_Prop_Y.
This is a Ram efficient way of calculating the Features as the features are calculated on a single chunk of the dataframe (of chunksize number of rows) at a time and when a chunk has been been processed and saved as a CSV, then the chunk is deleted freeing up RAM.
pepfeature.aa_porportion.calc_csv(dataframe, save_folder, aa_column = 'Info_window_seq', Ncores = 1, chunksize = None)
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
save_folder
:str
- Path to folder for saving the output as CSV
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
chunksize
:int
,Default=None
- Number of rows to be processed at a time. (Where a 'None' object denotes no chunks but the entire dataframe to be processed)
Calculates the proportion (out of 1) of each aminoacid in the peptides (Amino Acid Sequences).
Results appended as a new column named feat_Prop_{aa letter} e.g. feat_Prop_A, feat_Prop_C, ..., feat_Prop_Y.
pepfeature.aa_porportion.calc_df(dataframe, Ncores = 1, aa_column= 'Info_window_seq')
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Returns:
Pandas DataFrame object
- A Pandas DataFrame containing the calculated features appended as new columns.
This module contains functions to calculate the entropy of given amino acid sequence
Calculates the entropy of given amino acid sequences chunk by chunk from the inputted 'dataframe'.
It saves each processed chunk as a CSV(s).
Results appended as a new column named feat_seq_entropy
This is a Ram efficient way of calculating the Features as the features are calculated on a single chunk of the dataframe (of chunksize number of rows) at a time and when a chunk has been been processed and saved as a CSV, then the chunk is deleted freeing up RAM.
pepfeature.aa_seq_entropy.calc_csv(dataframe, save_folder, aa_column = 'Info_window_seq', Ncores = 1, chunksize = None)
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
save_folder
:str
- Path to folder for saving the output as CSV
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
chunksize
:int
,Default=None
- Number of rows to be processed at a time. (Where a 'None' object denotes no chunks but the entire dataframe to be processed)
Calculates the entropy of given amino acid sequences
Results appended as a new column named feat_seq_entropy
pepfeature.aa_seq_entropy.calc_df(dataframe, Ncores = 1, aa_column= 'Info_window_seq')
Parameters:
dataframe
:Pandas DataFrame object
- A pandas DataFrame that contains a column/feature that is composed of purely Amino-Acid sequences (pepides).
Ncores
:int
,Default=1
- Number of cores to use for executing function (multiprocessing).
aa_column
:str
,Default='Info_window_seq'
- Name of column in dataframe input consisting of the Amino-Acid sequences to process.
Returns:
Pandas DataFrame object
- A Pandas DataFrame containing the calculated features appended as new columns.
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.