Datasets and Algorithm

Peptidoglycan hydrolase databases and machine learning models
Identification and characterization of novel peptidoglycan hydrolases in the completely sequenced genomes becomes difficult due to the lack of homology of these hydrolases with the previously well characterized peptidoglycan hydrolases. The present work uses integrative approach of random forest to identify the presence of these novel peptidoglycan hydrolases from genomic and metagenomic data. After optimizations libSVM and Random Forest modules were generated for binary and multiclass classifications. Using these models three different tools were developed and evaluated using 250 known peptidoglycan hydrolases. Strategy for the selection of tool for final pipeline of hype provided below:

Support Vector Machine and RandomForest Implementation
Support Vector Machine has been implemented using libSVM package and Random Forest (RF) has been implemented via R package ( http://cran.r-project.org/). RF uses an ensemble of classification trees since the results from an ensemble model are more satisfactory as compared to an individual model. This algorithm has the options to adjust the number of variables (mtry) selection at each node and number of trees (ntree) to be grown in the forest. Bootstrapping is used to grow classification trees in the forest from the training dataset. RF works on the principle that finds the best split into two children nodes via recursive binary partitioning (same as classification trees) at each split node. The amino acid and dipeptide frequency of the protein sequences were calculated using formula below.
SVM and Random Forest modules performances was compared using following parameters discussed below.
Parameters used for Performance Evaluation
The performance of RF was checked via following threshold dependent parameters:

Sensitivity (Sn):Sensitivity measures the ability of the process to predict correct results

Specificity (Sp):Specificity measures the ability of a process to predict incorrect results.
Accuracy (Acc):Accuracy measures the degree of correctness of the predicted results to its actual value or the experimental value
Mathews Correlation Coefficient (MCC): In the machine learning MCC measures the degree to which the binary classification is correct.
Symbols and their meaning for example class A:
TP: Correctly identified
FP: Incorrectly identified
TN: Correctly rejected
FN: Incorrectly rejectd

Work flow of HyPe pipeline
The web server for HyPe is developed using the standalone HyPe application. The web server can be used for the identification of peptidoglycan hydrolases from complete genomic or metagenomic ORFs. Query sequence will pass through the RF module (random forest module for multiclass classification selected above as tool3) to predict positive hits (peptidoglycan hydrolases) and also to categorize the resultant peptidoglycan hydrolases into their respective classes. Schematic representation for HyPe pipeline provided below:
Performance of HyPe was evaluated using twenty four recently sequeced genomes from EMBL-EBI site and twenty four human gut metagenomcis samples. The performance of HyPe was satisfactory for both genomic and metagenomic sequences.