Complete 16s rRNA sequences along with their taxonomy information were obtained from Greengenes database (DeSantis, et al., 2006). The database consisted of a total of 1,262,986 sequences. Different hypervariable regions were extracted from the 16S sequences using specific primers. Clustering was performed to remove redundancy and representational bias of sequences using CD-hit. The hypervariable region datasets for each individual region were then used for making the Random Forest models.
"randomForest" package available in R (http://cran.r-project.org//) was used due to its robustness, ease of usage, speed and high accuracy of prediction. Nucleotide compositions were evaluated as input features for the training RF module. k-mers from 2-6 bp were evaluated separately. These frequency of each possible kmer in any given sequence was calculate by dividing total number of occurances of that kmer in the sequence by total number of all kmers present in the sequence. 4-mer was selected as the input for RF.
RF uses ensemble learning method for the classification and regression. Multiple decision trees are made and combined to give more accuracy and confidence level to prediction.
The following figure shows the various statistical measures derived for all HVR models for the developemnt of 16S Classifier.
For each hypervariable region (HVR), clustering was performed to construct the training dataset using unique representatives from each cluster. The test dataset was derived by taking at least 10% of the total sequences from each cluster. To simulate the sequencing errors, atleast 1% mutation was introduced in the HVR of every sequence in the test dataset. Test datasets were constructed using this appraoch for all HVR models. Performance of each RF model was accessed separately on individual test dataset for that HVR. The performance of 16S Classifier on different data sets is shown below.
Publicly available sequence datsets for different HVRs were obtained from SRA databse of NCBI for as many regions as possible either directly or by extracting the requisite regions from larger sequences. Primers were removed from all individual datasets before testing. Blast and RDP Classifier were used to evaluate the performance of each RF model. The results of Blast program were considered as the reference to determine the correct taxonomic lineage of the test sequences obtained from SRA database. Performance of 16S Classifier and RDP classifier were compared on different test datasets using the results of BLAST as reference. For all test datasets, the results of 16S classifier were found more accurate as compared to RDP classifier. The detailed comparison is shown below.