Metagenomics of Acid Soil: a study of Nanopore long-reads and Acidobacteria

tl;dr: check out a summarised version of the dissertation.

Developed a tool to classify unclassified Acidobacteria, acidoseq, which is available to read in more detail in a project page.

Introduction

Since I enjoyed my Summer Bioinformatics research position (2017), for my Masters dissertation I hoped to work with Amanda Clare again.

Amanda spoke to the same research group, who had new sequences for me to look at. These reads were derived using Nanopore sequencing from Aberystwyth soil. The output includes ~2 million sequences: made of ACGT basepairs.

Week 1

Acidobacteria is a phylum of bacteria belonging in the bacteria kingdom. It was only recognised in 2012 despite the most abundant and diverse on Earth soils. It has been observed in mines, soils, and metal-contaminated soils; coincidently there are metal mines near Aberystwyth that have contaminated the Rheidol streams possibly resulting in metal contaminated soils.

Week 2

Amanda and I discussed looking into a variety of tools for demographics of the data: read count, quality, and time-yield plots. We also discussed using BLAST to look at any present species.

I used Kaiju and found that Acidobacteria was present with a major subset classified as “unclassified” (not yet placed in a class group/subdivision). Furthermore, the GC content of the genomes are consistent within their subdivisions. For example: in subdivision 3 the GC content for those species will have the same GC coverage. Subdivisions are also dependent on pH, e.g. a pH of 4 means subdivision 1, 2, 3, and 13 will be more likely to appear.

Week 3

Amanda and I discussed finding a way to extract the Acidobacteria sequences which Kaiju classified. Despite Kaiju providing an output with annotated sequence IDs, we can’t determine which are Acidobacteria due to absence of taxonIDs.

This is where the idea of acidoseq emerged: using a Kaiju output file with a list of Acidobacteria taxonomy IDs and merge. I started development with Python.

Week 4

As mentioned, literature highlighted that GC content in some subdivisions were consistent. I downloaded the full and partial genomes of Acidobacteria and found that the GC content were somewhat consistent in all subdivisions.

To expand on acidoseq it was discussed to include a way GC content in Acidobacteria sequences could be investigated for patterns and plotting subdivisions.

Week 5

We found that the BLAST job of the 2 million reads took a month to process only 400,000. Amanda recommended Blast2Go that runs locally and looks at the genes in further detail.

Week 6

Using Blast2Go meant we were able to create a database of Acidobacteria genomes and run a local BLAST to find the sequences which identified as Acidobacteria. In acidoseq, I included the ability to look at AT comparing GC content (high AT suggests unstable DNA).

Week 7

I added a feature to acidoseq that outputs subdivisions of sequences which have that particular GC content.

Week 8

Amanda and I discussed assembly: building up the sequences into larger ones. Amanda suggested the tool, Miniasm.

Week 9

The assembly job with Miniasm was unsuccessful: due to soil being diverse, the output didn’t build up larger sequences. The largest being only 16,000 base-pairs long.

Week 10

I started to expand acidoseq into a usable tool for the scientific community and looked into command line options with Click and how I would package with PIP.

Week 11

I filtered the data to have at least a quality score of 12 and read-length of 2500: 89 reads. We decided to use Blast2Go to do a final run and look into the genes. The output for Acidobacteria was annotated with the Gene Ontology, however, due to lack of time I didn’t have time to explore the results. For the final two weeks the time was mostly focused on writing up my dissertation.

Week 12

During my final meeting, Amanda and I discussed corrections, and she provided great feedback. Three days later, I submitted!

Wrapping-up

The tool I developed, acidoseq, was packaged up and made available on GitHub.

My Masters is complete and it feels great! The dissertation is available to read and includes more information! A summarised version is also availale (the aim was to publish, which unfortunately we didn’t find the time to complete).

I had such fun with this project that I made a Twitter bot, acidobot, that dispenses facts about Acidobacteria once a day! discontinued

I would like to thank my supervisor Amanda for providing such a fun research experience and Arwyn Edwards and team for the intellectual engagement and access to data.

After submission, I only had 4 days until the start of my PhD. Almost a year after submitting my Masters dissertation…I had my Masters graduation!