# DSK
| **Linux** | **Mac OSX** |
|-----------|-------------|
[](https://ci.inria.fr/gatb-core/view/DSK-gitlab/job/tool-dsk-build-debian7-64bits-gcc-4.7/) | [](https://ci.inria.fr/gatb-core/view/DSK-gitlab/job/tool-dsk-build-macos-10.9.5-gcc-4.2.1/)
[](http://www.gnu.org/licenses/agpl-3.0.en.html)
# What is DSK?
DSK is a k-mer counter for reads or genomes.
It takes as input a set of sequences in FASTA or FASTQ format (see "Input" section).
DSK outputs a set of solid kmers, i.e. kmers which occur more than a minimal amount of times in the input (-abundance-min parameter).
It also outputs the number of times these kmers occur.
See "Results visualization" section to learn how to use its output.
# Getting the latest source code
## Requirements
CMake 3.1+; see http://www.cmake.org/cmake/resources/software.html
C++/11 capable compiler (e.g. gcc 4.7+, clang 3.5+, Apple/clang 6.0+).
## Instructions
# get a local copy of source code
git clone --recursive https://github.com/GATB/dsk.git
# compile the code an run a simple test on your computer
cd dsk
sh INSTALL
## Testing
If you retrieved the software from a git clone, do the following AFTER compiling it:
cd ../scripts # we suppose you are in the build directory
./simple_test.sh
If you retrieved the software from a binary distribution:
cd test
./simple_test.sh
## Using
* type `./dsk` for usage instructions.
## Input
File input format can be fasta, fastq, either gzipped or not.
To pass several files as input, separate file names by a comma (","), for example:
./dsk -file A1.fa,A2.fa,A3.fa -kmer-size 31
Alternatively, input can be a list of files (one file per line):
ls -1 *.fastq > list_reads
./dsk -file list_reads
## Results visualization
DSK uses a binary format for its output.
By default, the format is HDF5, so you can use HDF5 tools for getting ascii output
from the HDF5 output (such tools are provided with DSK distribution).
In particular, the HDF5 output of DSK holds two sets of data:
* the couples of (kmer,abundance)
* the histogram of abundances
You can get content of a dataset (kmers in partition 0, and whole histogram) with:
ext/gatb-core/bin/gatb-h5dump -y -d dsk/solid/0 output.h5
ext/gatb-core/bin/gatb-h5dump -y -d histogram/histogram output.h5
To see the results as a list of "[kmer] [count]\n", use the `dsk2ascii` binary with the output file from dsk
dsk2ascii -file output.h5 -out output.txt
To plot kmer coverage distribution,
build/ext/gatb-core/bin/gatb-h5dump -y -d histogram/histogram output.h5 | grep "^\ *[0-9]" | tr -d " " | paste - - | gnuplot -p -e 'plot "-" with lines'
### Alternative ways to plot kmer coverage distributions
To plot the kmer coverage distribution, you can also use the `-histo` option and a simple R script provided in the `utils` directory
./dsk -file A.fa -histo 1 -out outputfile
Rscript utils/plot-histo.R outputfile.histo
# the kmer count profile is plotted in the image file outputfile.histo.png
With the option `-histo2D`, you can also plot a kmer comparison plot between a genome assembly and a read set to visualize for instance k-mer duplication levels of the given assembly, following a nice idea originally implemented in [KAT (Kmer Analysis Toolkit)](https://github.com/TGAC/KAT) (see examples and how to interpret such plots [in the KAT documentation](http://kat.readthedocs.io/en/latest/walkthrough.html#genome-assembly-analysis-using-k-mer-spectra)). For this purpose, you need to give as input the assembly fasta file and the read files with the option `-file` taking care to give the assembly file as the first file of the list.
./dsk -file assembly.fa,reads1.fq,reads2.fq -histo2D 1 -out outputname
# the kmer count matrix is output in file outputname.histo2D,
# where lines correspond to abundances in the read dataset and columns abundances in the genome assembly (0 to 10)
Rscript utils/plot-histo2D.R outputname.histo2D
# the kmer comparison plot is plotted in the image file outputname.histo2D.png
## Kmers and their reverse complements
DSK converts all kmers to their canonical representation with respect to reverse-complementation.
In other words, a kmer and its reverse complement are considered to be the same object.
For example, with k=3 and assuming the kmer AAA and its reverse complement TTT are both present the input dataset, DSK will consider that one of them is the canonical kmer, e.g. AAA. If AAA is present 2 times and TTT 3 times, then DSK outputs that the count of AAA is 5 and won't return the count of TTT at all.
Note: a canonical kmer is not necessarily the lexicographically smallest one! DSK uses a different ordering for faster performance. Specifically, DSK considers tha A