Blobology
---------

The steps for generating taxon-annotated 'blob' plots are as follows:

-  Subsample your contigs to a reasonable number (to speed the process for very big assemblies)
-  Perform rough taxonomic assignment (e.g. using BLASTN against NCBI's
   ``nt`` database)
-  Map reads back to contigs (e.g. using Bowtie2 or BWA)
-  Generate a file with GC, taxon and coverage data for plotting
-  Plot graph in R using ggplot2

Full instructions for running Blobology and more advanced usage can be
found at:

https://github.com/sujaikumar/assemblage

Also see the paper:

    Sujai Kumar, Martin Jones, Georgios Koutsovoulos, Michael Clarke and Mark Blaxter Blobology: exploring raw genome data for contaminants, symbionts, and parasites using taxon-annotated GC-coverage plots  doi: 10.3389/fgene.2013.00237  http://www.frontiersin.org/Journal/10.3389/fgene.2013.00237/abstract

Although this tutorial uses the assemblage github repository, the repository for this paper is at https://github.com/blaxterlab/blobology
The scripts at blaxterlab/blobology can be easier to use if you already have bam or coverage files.

https://github.com/sujaikumar/assemblage also has other scripts/workflows relevant to genome assembly, annotation, finding conserved non-coding elements, etc.

Also please see Sujai Kumar's recent presentation at Beatles and
Bioinformatics:

http://www.youtube.com/watch?v=tSul\_qDwvN4&t=3h10m50s

Installing dependencies
~~~~~~~~~~~~~~~~~~~~~~~

BLAST+ for taxonomic assignment

::

    sudo apt-get install ncbi-blast+

Install R and required packages (requires R >=2.14).

::

    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base-core-dbg_3.0.2-1_amd64.deb
    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base-core_3.0.2-1_amd64.deb
    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base-dev_3.0.2-1_all.deb
    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base-html_3.0.2-1_all.deb
    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-base_3.0.2-1_all.deb
    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-doc-html_3.0.2-1_all.deb
    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-doc-info_3.0.2-1_all.deb
    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-doc-pdf_3.0.2-1_all.deb
    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-mathlib_3.0.2-1_amd64.deb
    wget http://athyra.ged.msu.edu/~mcrusoe/debs/oneiric/r-recommended_3.0.2-1_all.deb
    dpkg -i r-base* r-base-core_* r-recommended* r-mathlib*
    R
    install.packages(c("codetools", "MASS", "ggplot2"))

After ths point, the following dependencies have already been installed on AMI public snapshot (in us-east): ``snap-78cf1764`` so if you mount this snapshot as a volume you can skip the following steps, but you do need to add the programs to your $PATH:

::

    cd /mynewmount
    source env.sh

We will use seqtk for subsampling contigs (can also be used for reads)

::

    git clone https://github.com/lh3/seqtk.git
    cd seqtk; make; cd ..

BWA for read mapping

::

    sudo apt-get install bwa

Sujai Kumar's Assemblage for the scripts we need:

::

    git clone https://github.com/sujaikumar/assemblage

The NCBI nt files, plus taxonomy information

::

    wget ftp://ftp.ncbi.nih.gov/blast/db/nt.??.tar.gz
    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Running blobology
~~~~~~~~~~~~~~~~~

Assume you have a contig file from your assembly called 'contigs.fasta'

First, subsample 10,000 contigs from the file:

::

    seqtk sample contigs.fasta 10000 > contigs_10000.fasta

BLAST the reads

::

    blastn -task megablast -query contigs_10000.fasta -db blast/nt -max_target_seqs 1 -outfmt 6 > contigs_10000.megablast.nt

Produce a taxonomy file

::

    blast_taxonomy_report.pl \
    -b contigs_10000.megablast.nt \
    -nodes tax/nodes.dmp \
    -names tax/names.dmp \
    -gi_taxid_file tax/gi_taxid_nucl.dmp.gz \
    -t genus=1 -t order=1 -t family=1 -t superfamily=1 -t kingdom=1 \
    >contigs_10000.megablast.nt.taxon

Create a database of your contigs for BWA

::

    bwa index contigs.fasta

Align the reads you used to make the assembly to the contigs database (for paired-end reads)

::

    bwa mem contigs.fasta pair1.fasta.gz pair2.fasta.gz > samfile

If you have interleaved reads, or single-end, just omit the second FASTQ file

::

    bwa mem contigs.fasta reads.fasta.gz > samfile

Create a file with coverage and GC information, this will be named
according to your samfile name with .lencovgc.txt as the suffix.

Ensure you change the name of the files below before running the
command.

::

    sam_len_cov_gc_insert.pl -s samfile -f contigs.fasta

    cat contigs_10000.megablast.nt.taxon samfile.lencovgc.txt | 
    perl -anF"\t" -e '
        chomp;
        /^(\S+).*\torder\t([^\t]+)/ and $o{$1} = $2 and next;
        if ($F[2] =~ /^\d+$/ ) { print "$_\t" . (exists $o{$F[1]} ? $o{$F[1]} : "Not annotated") . "\n" }
    ' >> lencovgc.taxon

If you want a different taxonomic level, e.g. species, download this script:

::

    wget http://static.xbase.ac.uk/files/results/nick/make_blobology_file.py
    python make_blobology_file.py contigs_10000.megablast.nt.taxon samfile.lencovgc.txt order > lencovgc.taxon

Just replace 'order' with another taxonomic level when you run make_blobology_file.py, e.g.: species, genus, family, superfamily, kingdom

Create the blobology plot

::

    Rscript --vanilla ../bin/assemblage/blobology.R lencovgc.taxon 0.005 1 2

The final file is called lencovgc.taxon.png - you will need to download
this to view it (e.g. with `sftp`, `scp` run on your local machine).

::

    scp -i identity.pem root@server:/path/to/lencovgc.taxon.png .