Datasets and Algorithm

The protein sequences and associated information of orthologous groups of genes were retrieved from eggNOGv3.0 (Powell, 2012). eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes, constructed by combining complete proteomes from RefSeq, Ensembl, UniProt, GiardiaDB, JGI and TAIR. The training dataset was constructed by manual curation consists of protein sequences belonging to 22 main functional class from bacterial genomes.

RandomForest Implementation
Random Forest (RF) has been implemented via R package ( RF uses an ensemble of classification trees since the results from an ensemble model are more satisfactory as compared to an individual model. This algorithm has the options to adjust the number of variables (mtry) selection at each node and number of trees (ntree) to be grown in the forest. Bootstrapping is used to grow classification trees in the forest from the training dataset. RF works on the principle that finds the best split into two children nodes via recursive binary partitioning (same as classification trees) at each split node. The dipeptide frequency of the protein sequences was used as the input for training RF.
The performance (in terms of OOB error rate) of RF was evaluated by internal cross validation procedure.
RAPSearch2 Implementation
RAPSearch2 ( is a protein similarity search tool, however, it is much faster and therefore used in the similarity-based module of Woods.
Hybrid Implementation
To achieve faster and accurate functional assessment compared to either a prediction-based or similarity-based approach, a combined approach was adopted by integrating both RF and RAPSearch2 for the development of Woods. In case of genomic proteins, the query sequence is analyzed using the RF module and classified into any one of the 22 functional classes. In the second step, which is a confirmatory, involves a similarity-based search using RAPSearch2. In case of metagenomic datasets, RF provides satisfactory (79.97%) predictions for complete proteins and protein fragments of at least 500 amino acids in length. Therefore, from the query sequences, the complete ORFs are identified using the output of MetaGeneMark and are analyzed using Woods using the same methodology as followed in the case of genomic proteins. The criteria used to carry out the assignments is shown below.
Parameters used for Performance Evaluation
The performance of RF was checked via following threshold dependent parameters:

Sensitivity (Sn):Sensitivity measures the ability of the process to predict correct results

Specificity (Sp):Specificity measures the ability of a process to predict incorrect results.
Accuracy (Acc):Accuracy measures the degree of correctness of the predicted results to its actual value or the experimental value
Mathews Correlation Coefficient (MCC): In the machine learning MCC measures the degree to which the binary classification is correct.
Symbols and their meaning for example class A:
tp: The sequence of class A classified in A class
fp: The sequence of any other class classified in class A
tn: The sequence of any other class and classified and not classified in class A
fn: The sequence of class A classified in any other class

Performance of Woods on different datasets

Genomic Datasets
The RF model and Woods was tested on the independent dataset consisting of 50 selected bacterial genomes. The performance of Woods was almost 7 % higher as compared to RF model alone.

Simulated Metagenomic Datasets
The peformance of RF model and Woods was also evaluated on the simulated metagenomic datasets. These datasets of different fragment length were constructed from the protein sequences of 50 bacterial genomes using in-house perl scripts.

Real Metagenomic Datasets
The performance of Woods was also examined using real metagenomic datasets. The human gut microbiome datasets for two European individuals (MH0006 and MH0012) were obtained from ( A total of 308,223 and 324,939 ORFs were predicted in datasets using MetaGeneMark. Woods was run on the predicted ORFs of the two datasets and results was compared with BLAST results. Time (in CPU hours) taken by Woods was 7.75 and 8.81 whereas the time taken by BLAST to complete the same task was 679.53 and 779.65 , respectively, using Intel Xeon 2.4 GhZ CPU. These results validate the efficiency and capability of Woods in predicting the functional class of the proteins in metagenomic datasets.