Datasets and Algorithm

Datasets
Protein sequences of 2,420 bacterial genomes were retrieved from NCBI and their complete taxonomic information was retrieved from Greengenes database. BLAST was performed for all proteins of the selected 2,406 genomes. EggNOG ids were extracted for all proteins and the unique ids were stored for each genome.

Model Construction
The EggNOG ids of all the species in a particular phylum were clubbed together and uniquely sorted. The EggNOG ids occurring in all the phyla were compared to each other and the ids which are unique to each phylum were extracted. Using this methodology, a list of EggNOG ids which are unique for a phylum was obtained. Similarly, the unique EggNOG ids for each taxonomic level (class, order, family and genus) were extracted and stored.

Prediction Strategy
Microtaxi carries out taxonomic assignment of a genome using the sequences of proteins encoded by the genome. The EggNOG ids of each protein present in a query genome are extracted by performing BLAST against the eggNOG 4.0 database . The EggNOG ids of the query genome are searched against the unique EggNOG ids of each phylum. The phylum which shows the maximum number of matches with the query genome is selected. For the selected phylum, the EggNOG ids of each class present in the phylum are compared with the EggNOG ids of the query genome and the class which shows the maximum number of matches is selected.

Performance of microtaxi on different datasets
Self-Testing
Since only a fraction of the total EggNOG ids (unique EggNOG ids) were used for the training of ProTaxi, all 2,406 genomes were examined to evaluate its prediction accuracy. microtaxi showed 100% accuracy at all levels of taxonomy.

Validation on Left-Out Dataset
In this set, 56 genomes were randomly selected from those genus which contained ≥ 9 species. Thus, 56 species were selected for testing and the remaining 2,350 species were used for training. The results are given below in the table.

Validation on Newly Published Genomes
In this set, 20 recently published complete genomes which were not included in the training dataset were taken. The results are shown in the table below:

p: phylum, c: class, o: order, f: family, g: genus, s: species
Validation on Genomes with Incomplete Taxonomy
In this set, 17 bacterial genomes for which the complete taxonomy is not known were taken. The results are shown in the table below:

p: phylum, c: class, o: order, f: family, g: genus, s: species, CND: CanNot Determine