You’ll need khmer and its dependencies installed:
cd /root
pip install -e git+https://github.com/ged-lab/khmer.git@kmer_error_profile#egg=khmer
The khmer src (including sandbox scripts and data) will then be placed in /root/src/khmer/.
If installing elsewhere you may need to modify install [...] to install –user [...]
We’ll be using the khmer command line scripts.
Download two data sets:
cd /mnt
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m.fastq.gz
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz
(The sequencing data set is the first 5m reads from Chitsaz et al., 2011.)
There are a bunch of graphs in an IPython Notebook; you can view the rendered notebook in your Web viewer or you can download it to work with on your Amazon machine (https:// + machine name, password ‘beacon’, and on the command line:
cd /usr/local/notebooks
curl -O https://raw.github.com/ngs-docs/2013-davis-assembly/master/2013-davis-assembly-khmer.ipynb
See Eel Pond mRNAseq assembly protocol and Kalamazoo metagenome assembly protocols.
Also see NGS course materials and other workshop materials.
Note ‘Using screen’ and ‘Installing Dropbox’.
If you want to estimate the non-repetitive sequence content of your sample, you can use the three-pass assembly protocol (see Kalamazoo)
normalize-by-median.py -k 20 -C 20 -x 2e9 -N 4 --savehash C20.kh <FASTA/FASTQ/GZ/BZ2 files>
filter-abund.py C20.kh *.keep
normalize-by-median.py -k 20 -C 5 -x 2e9 -N 4 *.keep.abundfilt
Finally, get the complete stats on the remaining reads:
python /path/to/khmer/sandbox/readstats.py *.keep.abundfilt.keep
Divide the total number of bases remaining by 5 and you’ll have an estimate of your total genome size. (Technically, the total number of nodes in your De Bruijn graph. :)
Note that the ‘-x’ and ‘-N’ parameters, above, should multiple to be under the total amount of memory on your computer. Scripts will complain if this is set too low.
Run:
cd /mnt
mkdir richness
cd richness
normalize-by-median.py -k 20 -C 20 -x 5e8 -N 4 --savehash C20.kh ../ecoli_ref-5m.fastq.gz
filter-abund.py C20.kh *.keep
normalize-by-median.py -k 20 -C 5 -x 5e8 -N 4 *.keep.abundfilt
python /root/src/khmer/sandbox/readstats.py *.keep.abundfilt.keep
For this data, you should get:
33151414 bp / 385233 seqs; 86.1 average length -- total
which works out to:
385233 * 86.1 / 5 = 6633712.26
so the predicted genome size of E. coli is 6.6 Mbp. Not too far off... the true answer is 4.5 Mbp.
First, we need to install the BWA aligner:
cd /root
wget -O bwa-0.7.5.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2/download
tar xvfj bwa-0.7.5.tar.bz2
cd bwa-0.7.5a
make
cp bwa /usr/local/bin
We also need a new version of samtools:
cd /root
curl -O -L http://sourceforge.net/projects/samtools/files/samtools/0.1.19/samtools-0.1.19.tar.bz2
tar xvfj samtools-0.1.19.tar.bz2
cd samtools-0.1.19
make
cp samtools /usr/local/bin
cp bcftools/bcftools /usr/local/bin
cd misc/
cp *.pl maq2sam-long maq2sam-short md5fa md5sum-lite wgsim /usr/local/bin/
Next, map the raw reads:
cd /mnt/richness
gunzip ../ecoliMG1655.fa.gz
bwa index ../ecoliMG1655.fa
bwa aln ../ecoliMG1655.fa ../ecoli_ref-5m.fastq.gz > raw.sai
bwa samse ../ecoliMG1655.fa raw.sai ../ecoli_ref-5m.fastq.gz > raw.sam
samtools faidx ../ecoliMG1655.fa
samtools import ../ecoliMG1655.fa.fai raw.sam raw.bam
samtools sort raw.bam raw.sorted
samtools index raw.sorted.bam
and map the diginormed reads:
bwa aln ../ecoliMG1655.fa ecoli_ref-5m.fastq.gz.keep.abundfilt.keep > dn.sai
bwa samse ../ecoliMG1655.fa dn.sai ecoli_ref-5m.fastq.gz.keep.abundfilt.keep > dn.sam
samtools import ../ecoliMG1655.fa.fai dn.sam dn.bam
samtools sort dn.bam dn.sorted
samtools index dn.sorted.bam
Now, view with ‘samtools tview’:
samtools tview raw.sorted.bam ../ecoliMG1655.fa
or:
samtools tview dn.sorted.bam ../ecoliMG1655.fa
You can use normalize-by-median.py to assess information saturation in your shotgun data:
normalize-by-median.py -k 20 -C 5 -x 2e9 -N 4 -R report.txt <FASTA/FASTQ/GZ/BZ2 files>
‘report.txt’ column 1 will then contain the number of reads examined, and column 2 will contain the number of reads kept. When this levels off, you’ve saturated to a coverage of 5.
Run:
cd /mnt
mkdir saturate
cd saturate
normalize-by-median.py -k 20 -C 5 -x 5e8 -N 4 -R ecoli_5m-report.txt ../ecoli_ref-5m.fastq.gz
The results are in ‘/mnt/saturate/ecoli_5m-report.txt’.
First, load all the k-mers in the data set into a counting file:
load-into-counting.py -k 20 -x 2e9 -N 4 counts.kh reads.fq
Then calculate the spectrum of read coverages:
python /path/to/khmer/sandbox/calc-median-distribution.py counts.kh reads.fq reads.hist
First, let’s do it for a small data set with interesting coverage:
cd /mnt
mkdir cover1
cd cover1
load-into-counting.py -k 20 -x 1e8 -N 4 counts.kh /root/src/khmer/data/stamps-reads.fa.gz
python /root/src/khmer/sandbox/calc-median-distribution.py counts.kh /root/src/khmer/data/stamps-reads.fa.gz reads.hist
The results are in ‘/mnt/cover1/reads.hist’.
Now try E. coli:
cd /mnt
mkdir cover2
cd cover2
load-into-counting.py -k 20 -x 1e9 -N 4 counts.kh ../ecoli_ref-5m.fastq.gz
python /root/src/khmer/sandbox/calc-median-distribution.py counts.kh ../ecoli_ref-5m.fastq.gz reads.hist
The results are in ‘/mnt/cover2/reads.hist’.
calc-error-profile.py reads.fq
For example,
cd /mnt
mkdir error
cd error
calc-error-profile.py ../ecoli_ref-5m.fastq.gz
The results will be in ‘/mnt/error/ecoli_ref-5m.fastq.gz.errhist’.
Do:
cd /mnt
mkdir kmercov
cd kmercov
load-into-counting.py -k 20 -x 1e7 -N 4 counts.kh /root/src/khmer/data/stamps-reads.fa.gz
abundance-dist.py counts.kh /root/src/khmer/data/stamps-genomes.fa counts.out
The results will be in ‘/mnt/kmercov/counts.out’.
Do:
cd /mnt
mkdir part
cd part
do-partition.py -k 32 -x 1e7 -N 4 stamps /root/src/khmer/data/stamps-reads.fa.gz
extract-partitions.py -X 1000 stamps *.part
You will now have two files, stamps.group0000.fa and stamps.group0001.fa, that represent the two “species” in the data set. To examine them, let’s graph their abundances:
abundance-dist-single.py -k 20 -x 1e7 -N 4 stamps.group0000.fa group0.hist
abundance-dist-single.py -k 20 -x 1e7 -N 4 stamps.group0001.fa group1.hist
The hist files will be in ‘/mnt/part/group0.hist’ and ‘/mnt/part/group1.hist’.
To get started on the mRNAseq protocol with a small data set, do:
mkdir /data
cd /data
curl -O http://athyra.idyll.org/~t/mrnaseq-subset.tar
tar xvf mrnaseq-subset.tar
and then start at “Link your data into a working directory”. (You’ll need to install the software at the top, too.)
To get started on the metagenomics protocol with a small data set, do:
mkdir /data
cd /data
curl -O http://athyra.idyll.org/~t/metag-subset.tar
tar xvf metag-subset.tar
and then start at “Link your data into a working directory”. (You’ll need to install the software at the top, too.)
Development and posting of this material, and the associated workshop, were supported by Grant Number R25HG006243 from the National Human Genome Research Institute and an NSF OCI supplement to NSF DBI-0939454.
This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.
For an introduction to the documentation format please see the reST primer.