This tutorial discusses how to load data from a large tree efficiently using TopiaryExplorer and QIIME. This process requires filtering the tree to include fewer tips that are representative of the full diversity of the tree. The figures used in this tutorial are derived from applying this process to the full Greengenes tree (408,135 tips).
To apply this process you’ll need your tree of interest in newick format, as well as the unaligned sequences represented by the tips in the tree in fasta format. The full tip identifiers must be the first space-separated field of the sequence identifiers in the fasta file. For example, if a record in your fasta file looks like:
>sequence1 some comment about the sequence
AAACCCCCCCCCCCCCCCCCAAAAAAAAAAATTTTTTTTT
The tip representing this sequence in the tree must be sequence1.
Assuming your input sequence collection is called inseqs_full.fna, run the following command to cluster into 99% OTUs:
pick_otus.py -m uclust -D -i inseqs_full.fna -s 0.99 -o otus
Next we want to select the centroid of each OTU cluster as the tip we want to keep in the tree to represent the corresponding cluster of 99% identical sequences:
awk 'BEGIN {FS="\t"};{print $2}' otus/inseqs_full_otus.txt > tips_to_keep.txt
Finally we’ll filter the full tree to contain only the representative tips for each of the 99% OTUs:
filter_tree.py -i full.tre -o filtered_99.tre -t tips_to_keep.txt
Open TopiaryExplorer and create a new project using the new project dialog. Open the tree and any metadata you have about the tips.
Color the branches by a metadata field of interest. In this example we’re coloring by CATEGORY1.
Note
Using the interpolate colors function with large trees can take a while (this example took a few minutes to color).