Notes on using khmer to do stuff¶

Software¶

You’ll need khmer and its dependencies installed:

cd /root
pip install -e git+https://github.com/ged-lab/khmer.git@kmer_error_profile#egg=khmer

The khmer src (including sandbox scripts and data) will then be placed in /root/src/khmer/.

If installing elsewhere you may need to modify install [...] to install –user [...]

We’ll be using the khmer command line scripts.

Data¶

Download two data sets:

cd /mnt
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m.fastq.gz
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz

(The sequencing data set is the first 5m reads from Chitsaz et al., 2011.)

Graphs¶

There are a bunch of graphs in an IPython Notebook; you can view the rendered notebook in your Web viewer or you can download it to work with on your Amazon machine (https:// + machine name, password ‘beacon’, and on the command line:

cd /usr/local/notebooks
curl -O https://raw.github.com/ngs-docs/2013-davis-assembly/master/2013-davis-assembly-khmer.ipynb

0. Assembling mRNAseq and metagenome data in the cloud¶

See Eel Pond mRNAseq assembly protocol and Kalamazoo metagenome assembly protocols.

Also see NGS course materials and other workshop materials.

Note ‘Using screen’ and ‘Installing Dropbox’.

1. Estimating genomic richness/non-repetitive genome content¶

If you want to estimate the non-repetitive sequence content of your sample, you can use the three-pass assembly protocol (see Kalamazoo)

normalize-by-median.py -k 20 -C 20 -x 2e9 -N 4 --savehash C20.kh <FASTA/FASTQ/GZ/BZ2 files>
filter-abund.py C20.kh *.keep
normalize-by-median.py -k 20 -C 5 -x 2e9 -N 4 *.keep.abundfilt

Finally, get the complete stats on the remaining reads:

python /path/to/khmer/sandbox/readstats.py *.keep.abundfilt.keep

Divide the total number of bases remaining by 5 and you’ll have an estimate of your total genome size. (Technically, the total number of nodes in your De Bruijn graph. :)

Note that the ‘-x’ and ‘-N’ parameters, above, should multiple to be under the total amount of memory on your computer. Scripts will complain if this is set too low.

Example¶

Run:

cd /mnt
mkdir richness
cd richness
normalize-by-median.py -k 20 -C 20 -x 5e8 -N 4 --savehash C20.kh ../ecoli_ref-5m.fastq.gz
filter-abund.py C20.kh *.keep
normalize-by-median.py -k 20 -C 5 -x 5e8 -N 4 *.keep.abundfilt

python /root/src/khmer/sandbox/readstats.py *.keep.abundfilt.keep

For this data, you should get:

33151414 bp / 385233 seqs; 86.1 average length -- total

which works out to:

385233 * 86.1 / 5 = 6633712.26

so the predicted genome size of E. coli is 6.6 Mbp. Not too far off... the true answer is 4.5 Mbp.

Looking at diginorm coverage¶

First, we need to install the BWA aligner:

cd /root
wget -O bwa-0.7.5.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2/download

tar xvfj bwa-0.7.5.tar.bz2
cd bwa-0.7.5a
make

cp bwa /usr/local/bin

We also need a new version of samtools:

cd /root
curl -O -L http://sourceforge.net/projects/samtools/files/samtools/0.1.19/samtools-0.1.19.tar.bz2
tar xvfj samtools-0.1.19.tar.bz2
cd samtools-0.1.19
make
cp samtools /usr/local/bin
cp bcftools/bcftools /usr/local/bin
cd misc/
cp *.pl maq2sam-long maq2sam-short md5fa md5sum-lite wgsim /usr/local/bin/

Next, map the raw reads:

cd /mnt/richness
gunzip ../ecoliMG1655.fa.gz
bwa index ../ecoliMG1655.fa

bwa aln ../ecoliMG1655.fa ../ecoli_ref-5m.fastq.gz > raw.sai
bwa samse ../ecoliMG1655.fa raw.sai ../ecoli_ref-5m.fastq.gz > raw.sam

samtools faidx ../ecoliMG1655.fa
samtools import ../ecoliMG1655.fa.fai raw.sam raw.bam
samtools sort raw.bam raw.sorted
samtools index raw.sorted.bam

and map the diginormed reads:

bwa aln ../ecoliMG1655.fa ecoli_ref-5m.fastq.gz.keep.abundfilt.keep > dn.sai
bwa samse ../ecoliMG1655.fa dn.sai ecoli_ref-5m.fastq.gz.keep.abundfilt.keep > dn.sam

samtools import ../ecoliMG1655.fa.fai dn.sam dn.bam
samtools sort dn.bam dn.sorted
samtools index dn.sorted.bam

Now, view with ‘samtools tview’:

samtools tview raw.sorted.bam ../ecoliMG1655.fa

or:

samtools tview dn.sorted.bam ../ecoliMG1655.fa

2. Generating a saturation curve for shotgun data¶

You can use normalize-by-median.py to assess information saturation in your shotgun data:

normalize-by-median.py -k 20 -C 5 -x 2e9 -N 4 -R report.txt <FASTA/FASTQ/GZ/BZ2 files>

‘report.txt’ column 1 will then contain the number of reads examined, and column 2 will contain the number of reads kept. When this levels off, you’ve saturated to a coverage of 5.

Example¶

Run:

cd /mnt
mkdir saturate
cd saturate
normalize-by-median.py -k 20 -C 5 -x 5e8 -N 4 -R ecoli_5m-report.txt ../ecoli_ref-5m.fastq.gz

The results are in ‘/mnt/saturate/ecoli_5m-report.txt’.

3. Generating a coverage plot/coverage spectrum without a reference¶

First, load all the k-mers in the data set into a counting file:

load-into-counting.py -k 20 -x 2e9 -N 4 counts.kh reads.fq

Then calculate the spectrum of read coverages:

python /path/to/khmer/sandbox/calc-median-distribution.py counts.kh reads.fq reads.hist

Example 1¶

First, let’s do it for a small data set with interesting coverage:

cd /mnt
mkdir cover1
cd cover1
load-into-counting.py -k 20 -x 1e8 -N 4 counts.kh /root/src/khmer/data/stamps-reads.fa.gz
python /root/src/khmer/sandbox/calc-median-distribution.py counts.kh /root/src/khmer/data/stamps-reads.fa.gz reads.hist

The results are in ‘/mnt/cover1/reads.hist’.

Example 2¶

Now try E. coli:

cd /mnt
mkdir cover2
cd cover2
load-into-counting.py -k 20 -x 1e9 -N 4 counts.kh ../ecoli_ref-5m.fastq.gz
python /root/src/khmer/sandbox/calc-median-distribution.py counts.kh ../ecoli_ref-5m.fastq.gz reads.hist

The results are in ‘/mnt/cover2/reads.hist’.

4. Generating an error profile for individual shotgun reads¶

calc-error-profile.py reads.fq

For example,

cd /mnt
mkdir error
cd error

calc-error-profile.py ../ecoli_ref-5m.fastq.gz

The results will be in ‘/mnt/error/ecoli_ref-5m.fastq.gz.errhist’.

5. Assembly-filtered k-mer spectrum¶

Do:

cd /mnt
mkdir kmercov
cd kmercov

load-into-counting.py -k 20 -x 1e7 -N 4 counts.kh /root/src/khmer/data/stamps-reads.fa.gz
abundance-dist.py counts.kh /root/src/khmer/data/stamps-genomes.fa counts.out

The results will be in ‘/mnt/kmercov/counts.out’.

6. Partitioning¶

Do:

cd /mnt
mkdir part
cd part

do-partition.py -k 32 -x 1e7 -N 4 stamps /root/src/khmer/data/stamps-reads.fa.gz

extract-partitions.py -X 1000 stamps *.part

You will now have two files, stamps.group0000.fa and stamps.group0001.fa, that represent the two “species” in the data set. To examine them, let’s graph their abundances:

abundance-dist-single.py -k 20 -x 1e7 -N 4 stamps.group0000.fa group0.hist
abundance-dist-single.py -k 20 -x 1e7 -N 4 stamps.group0001.fa group1.hist

The hist files will be in ‘/mnt/part/group0.hist’ and ‘/mnt/part/group1.hist’.

7. A small demo set for khmer protocols/mRNAseq¶

To get started on the mRNAseq protocol with a small data set, do:

mkdir /data
cd /data
curl -O http://athyra.idyll.org/~t/mrnaseq-subset.tar
tar xvf mrnaseq-subset.tar

and then start at “Link your data into a working directory”. (You’ll need to install the software at the top, too.)

8. A small demo set for khmer protocols/metagenomics¶

To get started on the metagenomics protocol with a small data set, do:

mkdir /data
cd /data
curl -O http://athyra.idyll.org/~t/metag-subset.tar
tar xvf metag-subset.tar

and then start at “Link your data into a working directory”. (You’ll need to install the software at the top, too.)

Notes on using khmer to do stuff¶

Software¶

Data¶

Graphs¶

0. Assembling mRNAseq and metagenome data in the cloud¶

1. Estimating genomic richness/non-repetitive genome content¶

Example¶

Looking at diginorm coverage¶

2. Generating a saturation curve for shotgun data¶

Example¶

3. Generating a coverage plot/coverage spectrum without a reference¶

Example 1¶

Example 2¶

4. Generating an error profile for individual shotgun reads¶

5. Assembly-filtered k-mer spectrum¶

6. Partitioning¶

7. A small demo set for khmer protocols/mRNAseq¶

8. A small demo set for khmer protocols/metagenomics¶

Table Of Contents

Previous topic

Next topic

This Page

Edit this document!

Navigation

Notes on using khmer to do stuff¶

Software¶

Data¶

Graphs¶

0. Assembling mRNAseq and metagenome data in the cloud¶

1. Estimating genomic richness/non-repetitive genome content¶

Example¶

Looking at diginorm coverage¶

2. Generating a saturation curve for shotgun data¶

Example¶

3. Generating a coverage plot/coverage spectrum without a reference¶

Example 1¶

Example 2¶

4. Generating an error profile for individual shotgun reads¶

5. Assembly-filtered k-mer spectrum¶

6. Partitioning¶

7. A small demo set for khmer protocols/mRNAseq¶

8. A small demo set for khmer protocols/metagenomics¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation

Edit this document!