Notes on using khmer to do stuff


You’ll need khmer and its dependencies installed:

cd /root
pip install -e git+

The khmer src (including sandbox scripts and data) will then be placed in /root/src/khmer/.

If installing elsewhere you may need to modify install [...] to install –user [...]

We’ll be using the khmer command line scripts.


Download two data sets:

cd /mnt
curl -O
curl -O

(The sequencing data set is the first 5m reads from Chitsaz et al., 2011.)


There are a bunch of graphs in an IPython Notebook; you can view the rendered notebook in your Web viewer or you can download it to work with on your Amazon machine (https:// + machine name, password ‘beacon’, and on the command line:

cd /usr/local/notebooks
curl -O

1. Estimating genomic richness/non-repetitive genome content

If you want to estimate the non-repetitive sequence content of your sample, you can use the three-pass assembly protocol (see Kalamazoo) -k 20 -C 20 -x 2e9 -N 4 --savehash <FASTA/FASTQ/GZ/BZ2 files> *.keep -k 20 -C 5 -x 2e9 -N 4 *.keep.abundfilt

Finally, get the complete stats on the remaining reads:

python /path/to/khmer/sandbox/ *.keep.abundfilt.keep

Divide the total number of bases remaining by 5 and you’ll have an estimate of your total genome size. (Technically, the total number of nodes in your De Bruijn graph. :)

Note that the ‘-x’ and ‘-N’ parameters, above, should multiple to be under the total amount of memory on your computer. Scripts will complain if this is set too low.



cd /mnt
mkdir richness
cd richness -k 20 -C 20 -x 5e8 -N 4 --savehash ../ecoli_ref-5m.fastq.gz *.keep -k 20 -C 5 -x 5e8 -N 4 *.keep.abundfilt

python /root/src/khmer/sandbox/ *.keep.abundfilt.keep

For this data, you should get:

33151414 bp / 385233 seqs; 86.1 average length -- total

which works out to:

385233 * 86.1 / 5 = 6633712.26

so the predicted genome size of E. coli is 6.6 Mbp. Not too far off... the true answer is 4.5 Mbp.

Looking at diginorm coverage

First, we need to install the BWA aligner:

cd /root
wget -O bwa-0.7.5.tar.bz2

tar xvfj bwa-0.7.5.tar.bz2
cd bwa-0.7.5a

cp bwa /usr/local/bin

We also need a new version of samtools:

cd /root
curl -O -L
tar xvfj samtools-0.1.19.tar.bz2
cd samtools-0.1.19
cp samtools /usr/local/bin
cp bcftools/bcftools /usr/local/bin
cd misc/
cp *.pl maq2sam-long maq2sam-short md5fa md5sum-lite wgsim /usr/local/bin/

Next, map the raw reads:

cd /mnt/richness
gunzip ../ecoliMG1655.fa.gz
bwa index ../ecoliMG1655.fa

bwa aln ../ecoliMG1655.fa ../ecoli_ref-5m.fastq.gz > raw.sai
bwa samse ../ecoliMG1655.fa raw.sai ../ecoli_ref-5m.fastq.gz > raw.sam

samtools faidx ../ecoliMG1655.fa
samtools import ../ecoliMG1655.fa.fai raw.sam raw.bam
samtools sort raw.bam raw.sorted
samtools index raw.sorted.bam

and map the diginormed reads:

bwa aln ../ecoliMG1655.fa ecoli_ref-5m.fastq.gz.keep.abundfilt.keep > dn.sai
bwa samse ../ecoliMG1655.fa dn.sai ecoli_ref-5m.fastq.gz.keep.abundfilt.keep > dn.sam

samtools import ../ecoliMG1655.fa.fai dn.sam dn.bam
samtools sort dn.bam dn.sorted
samtools index dn.sorted.bam

Now, view with ‘samtools tview’:

samtools tview raw.sorted.bam ../ecoliMG1655.fa


samtools tview dn.sorted.bam ../ecoliMG1655.fa

2. Generating a saturation curve for shotgun data

You can use to assess information saturation in your shotgun data: -k 20 -C 5 -x 2e9 -N 4 -R report.txt <FASTA/FASTQ/GZ/BZ2 files>

‘report.txt’ column 1 will then contain the number of reads examined, and column 2 will contain the number of reads kept. When this levels off, you’ve saturated to a coverage of 5.



cd /mnt
mkdir saturate
cd saturate -k 20 -C 5 -x 5e8 -N 4 -R ecoli_5m-report.txt ../ecoli_ref-5m.fastq.gz

The results are in ‘/mnt/saturate/ecoli_5m-report.txt’.

3. Generating a coverage plot/coverage spectrum without a reference

First, load all the k-mers in the data set into a counting file: -k 20 -x 2e9 -N 4 reads.fq

Then calculate the spectrum of read coverages:

python /path/to/khmer/sandbox/ reads.fq reads.hist

Example 1

First, let’s do it for a small data set with interesting coverage:

cd /mnt
mkdir cover1
cd cover1 -k 20 -x 1e8 -N 4 /root/src/khmer/data/stamps-reads.fa.gz
python /root/src/khmer/sandbox/ /root/src/khmer/data/stamps-reads.fa.gz reads.hist

The results are in ‘/mnt/cover1/reads.hist’.

Example 2

Now try E. coli:

cd /mnt
mkdir cover2
cd cover2 -k 20 -x 1e9 -N 4 ../ecoli_ref-5m.fastq.gz
python /root/src/khmer/sandbox/ ../ecoli_ref-5m.fastq.gz reads.hist

The results are in ‘/mnt/cover2/reads.hist’.

4. Generating an error profile for individual shotgun reads reads.fq

For example,

cd /mnt
mkdir error
cd error ../ecoli_ref-5m.fastq.gz

The results will be in ‘/mnt/error/ecoli_ref-5m.fastq.gz.errhist’.

5. Assembly-filtered k-mer spectrum


cd /mnt
mkdir kmercov
cd kmercov -k 20 -x 1e7 -N 4 /root/src/khmer/data/stamps-reads.fa.gz /root/src/khmer/data/stamps-genomes.fa counts.out

The results will be in ‘/mnt/kmercov/counts.out’.

6. Partitioning


cd /mnt
mkdir part
cd part -k 32 -x 1e7 -N 4 stamps /root/src/khmer/data/stamps-reads.fa.gz -X 1000 stamps *.part

You will now have two files, stamps.group0000.fa and stamps.group0001.fa, that represent the two “species” in the data set. To examine them, let’s graph their abundances: -k 20 -x 1e7 -N 4 stamps.group0000.fa group0.hist -k 20 -x 1e7 -N 4 stamps.group0001.fa group1.hist

The hist files will be in ‘/mnt/part/group0.hist’ and ‘/mnt/part/group1.hist’.

7. A small demo set for khmer protocols/mRNAseq

To get started on the mRNAseq protocol with a small data set, do:

mkdir /data
cd /data
curl -O
tar xvf mrnaseq-subset.tar

and then start at “Link your data into a working directory”. (You’ll need to install the software at the top, too.)

8. A small demo set for khmer protocols/metagenomics

To get started on the metagenomics protocol with a small data set, do:

mkdir /data
cd /data
curl -O
tar xvf metag-subset.tar

and then start at “Link your data into a working directory”. (You’ll need to install the software at the top, too.)

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.

Development and posting of this material, and the associated workshop, were supported by Grant Number R25HG006243 from the National Human Genome Research Institute and an NSF OCI supplement to NSF DBI-0939454.

Edit this document!

This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.

  1. Go to Notes on using khmer to do stuff on GitHub.
  2. Edit files using GitHub's text editor in your web browser (see the 'Edit' tab on the top right of the file)
  3. Fill in the Commit message text box at the bottom of the page describing why you made the changes. Press the Propose file change button next to it when done.
  4. Then click Send a pull request.
  5. Your changes are now queued for review under the project's Pull requests tab on GitHub!

For an introduction to the documentation format please see the reST primer.