================================ Notes on using khmer to do stuff ================================ Software ======== You'll need khmer and its dependencies installed:: cd /root pip install -e git+https://github.com/ged-lab/khmer.git@kmer_error_profile#egg=khmer The khmer src (including sandbox scripts and data) will then be placed in `/root/src/khmer/`. If installing elsewhere you may need to modify `install [...]` to `install --user [...]` We'll be using the `khmer command line scripts `__. Data ==== Download two data sets:: cd /mnt curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m.fastq.gz curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz (The sequencing data set is the first 5m reads from `Chitsaz et al., 2011 `__.) Graphs ====== There are a bunch of graphs in an IPython Notebook; you can `view the rendered notebook `__ in your Web viewer *or* you can download it to work with on your Amazon machine (https:// + machine name, password 'beacon', and on the command line:: cd /usr/local/notebooks curl -O https://raw.github.com/ngs-docs/2013-davis-assembly/master/2013-davis-assembly-khmer.ipynb 0. Assembling mRNAseq and metagenome data in the cloud ====================================================== See `Eel Pond mRNAseq assembly protocol `__ and `Kalamazoo metagenome assembly protocols `__. Also see `NGS course materials `__ and `other workshop materials `__. Note `'Using screen' `__ and `'Installing Dropbox' `__. 1. Estimating genomic richness/non-repetitive genome content ============================================================ If you want to estimate the non-repetitive sequence content of your sample, you can use the three-pass assembly protocol (see `Kalamazoo `__) :: normalize-by-median.py -k 20 -C 20 -x 2e9 -N 4 --savehash C20.kh filter-abund.py C20.kh *.keep normalize-by-median.py -k 20 -C 5 -x 2e9 -N 4 *.keep.abundfilt Finally, get the complete stats on the remaining reads:: python /path/to/khmer/sandbox/readstats.py *.keep.abundfilt.keep Divide the total number of bases remaining by 5 and you'll have an estimate of your total genome size. (Technically, the total number of nodes in your De Bruijn graph. :) Note that the '-x' and '-N' parameters, above, should multiple to be under the total amount of memory on your computer. Scripts will complain if this is set too low. Example ~~~~~~~ Run:: cd /mnt mkdir richness cd richness normalize-by-median.py -k 20 -C 20 -x 5e8 -N 4 --savehash C20.kh ../ecoli_ref-5m.fastq.gz filter-abund.py C20.kh *.keep normalize-by-median.py -k 20 -C 5 -x 5e8 -N 4 *.keep.abundfilt python /root/src/khmer/sandbox/readstats.py *.keep.abundfilt.keep For this data, you should get:: 33151414 bp / 385233 seqs; 86.1 average length -- total which works out to:: 385233 * 86.1 / 5 = 6633712.26 so the predicted genome size of E. coli is 6.6 Mbp. Not too far off... the true answer is 4.5 Mbp. Looking at diginorm coverage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, we need to install the `BWA aligner `__:: cd /root wget -O bwa-0.7.5.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2/download tar xvfj bwa-0.7.5.tar.bz2 cd bwa-0.7.5a make cp bwa /usr/local/bin We also need a new version of `samtools `__:: cd /root curl -O -L http://sourceforge.net/projects/samtools/files/samtools/0.1.19/samtools-0.1.19.tar.bz2 tar xvfj samtools-0.1.19.tar.bz2 cd samtools-0.1.19 make cp samtools /usr/local/bin cp bcftools/bcftools /usr/local/bin cd misc/ cp *.pl maq2sam-long maq2sam-short md5fa md5sum-lite wgsim /usr/local/bin/ Next, map the raw reads:: cd /mnt/richness gunzip ../ecoliMG1655.fa.gz bwa index ../ecoliMG1655.fa bwa aln ../ecoliMG1655.fa ../ecoli_ref-5m.fastq.gz > raw.sai bwa samse ../ecoliMG1655.fa raw.sai ../ecoli_ref-5m.fastq.gz > raw.sam samtools faidx ../ecoliMG1655.fa samtools import ../ecoliMG1655.fa.fai raw.sam raw.bam samtools sort raw.bam raw.sorted samtools index raw.sorted.bam and map the diginormed reads:: bwa aln ../ecoliMG1655.fa ecoli_ref-5m.fastq.gz.keep.abundfilt.keep > dn.sai bwa samse ../ecoliMG1655.fa dn.sai ecoli_ref-5m.fastq.gz.keep.abundfilt.keep > dn.sam samtools import ../ecoliMG1655.fa.fai dn.sam dn.bam samtools sort dn.bam dn.sorted samtools index dn.sorted.bam Now, view with 'samtools tview':: samtools tview raw.sorted.bam ../ecoliMG1655.fa or:: samtools tview dn.sorted.bam ../ecoliMG1655.fa 2. Generating a saturation curve for shotgun data ================================================= You can use normalize-by-median.py to assess information saturation in your shotgun data:: normalize-by-median.py -k 20 -C 5 -x 2e9 -N 4 -R report.txt 'report.txt' column 1 will then contain the number of reads examined, and column 2 will contain the number of reads kept. When this levels off, you've saturated to a coverage of 5. Example ~~~~~~~ Run:: cd /mnt mkdir saturate cd saturate normalize-by-median.py -k 20 -C 5 -x 5e8 -N 4 -R ecoli_5m-report.txt ../ecoli_ref-5m.fastq.gz The results are in '/mnt/saturate/ecoli_5m-report.txt'. 3. Generating a coverage plot/coverage spectrum without a reference =================================================================== First, load all the k-mers in the data set into a counting file:: load-into-counting.py -k 20 -x 2e9 -N 4 counts.kh reads.fq Then calculate the spectrum of read coverages:: python /path/to/khmer/sandbox/calc-median-distribution.py counts.kh reads.fq reads.hist Example 1 ~~~~~~~~~ First, let's do it for a small data set with interesting coverage:: cd /mnt mkdir cover1 cd cover1 load-into-counting.py -k 20 -x 1e8 -N 4 counts.kh /root/src/khmer/data/stamps-reads.fa.gz python /root/src/khmer/sandbox/calc-median-distribution.py counts.kh /root/src/khmer/data/stamps-reads.fa.gz reads.hist The results are in '/mnt/cover1/reads.hist'. Example 2 ~~~~~~~~~ Now try E. coli:: cd /mnt mkdir cover2 cd cover2 load-into-counting.py -k 20 -x 1e9 -N 4 counts.kh ../ecoli_ref-5m.fastq.gz python /root/src/khmer/sandbox/calc-median-distribution.py counts.kh ../ecoli_ref-5m.fastq.gz reads.hist The results are in '/mnt/cover2/reads.hist'. 4. Generating an error profile for individual shotgun reads =========================================================== calc-error-profile.py reads.fq For example, :: cd /mnt mkdir error cd error calc-error-profile.py ../ecoli_ref-5m.fastq.gz The results will be in '/mnt/error/ecoli_ref-5m.fastq.gz.errhist'. 5. Assembly-filtered k-mer spectrum =================================== Do:: cd /mnt mkdir kmercov cd kmercov load-into-counting.py -k 20 -x 1e7 -N 4 counts.kh /root/src/khmer/data/stamps-reads.fa.gz abundance-dist.py counts.kh /root/src/khmer/data/stamps-genomes.fa counts.out The results will be in '/mnt/kmercov/counts.out'. 6. Partitioning =============== Do:: cd /mnt mkdir part cd part do-partition.py -k 32 -x 1e7 -N 4 stamps /root/src/khmer/data/stamps-reads.fa.gz extract-partitions.py -X 1000 stamps *.part You will now have two files, stamps.group0000.fa and stamps.group0001.fa, that represent the two "species" in the data set. To examine them, let's graph their abundances:: abundance-dist-single.py -k 20 -x 1e7 -N 4 stamps.group0000.fa group0.hist abundance-dist-single.py -k 20 -x 1e7 -N 4 stamps.group0001.fa group1.hist The hist files will be in '/mnt/part/group0.hist' and '/mnt/part/group1.hist'. 7. A small demo set for khmer protocols/mRNAseq =============================================== To get started on `the mRNAseq protocol `__ with a small data set, do:: mkdir /data cd /data curl -O http://athyra.idyll.org/~t/mrnaseq-subset.tar tar xvf mrnaseq-subset.tar and then start at "Link your data into a working directory". (You'll need to install the software at the top, too.) 8. A small demo set for khmer protocols/metagenomics ==================================================== To get started on `the metagenomics protocol `__ with a small data set, do:: mkdir /data cd /data curl -O http://athyra.idyll.org/~t/metag-subset.tar tar xvf metag-subset.tar and then start at "Link your data into a working directory". (You'll need to install the software at the top, too.)