Skip to content

Latest commit

 

History

History
38 lines (27 loc) · 4.38 KB

quick-start.md

File metadata and controls

38 lines (27 loc) · 4.38 KB

Up to table of contents

Quick start

While there are many different partis subcommands, likely the first thing you want to do is run partition on a fasta input file /path/to/yourseqs.fa. The following command will first infer a set of parameters (including germline genes) on the sample, then group sequences into clonal families and annotate each family with V gene, naive sequence, etc:

/path/to/<partis_dir>/bin/partis partition --infname /path/to/yourseqs.fa --outfname /path/to/yourseqs-partition.yaml.

This by default assumes input of single-chain positive sense human igh. If you have another species or locus, set the --species {human,mouse,macaque,c57bl,balbc} (details here) and/or --locus {tra,trb,trd,trg,igl,igk,igh} options. If you have different loci mixed together, you'll need to either set --paired-loci, or first run ./bin/split-loci.py, both of which split different loci into separate files (the former runs the latter in the course of handling data with pairing information). Both of these take the argument --reverse-negative-strands if you have a mix of positive and negative sense sequences. If you have heavy/light pairing information, for instance from 10x single cell data, it is important to incorporate it as described here in order to take advantage of (among other things) dramatically improved partitioning accuracy. If you're using Docker, and you mounted your host filesystem as described here, you should replace /path/to with the appropriate host mount point within Docker.

If you want it to run faster (by a factor of 5 to 10, and at the cost of some accuracy) add --fast, which uses naive vsearch clustering. The number of processes on your local machine defaults to the number of cpus, so shouldn't need to be adjusted (with --n-procs N) unless you're running several things at once. To paralellize over many machines, the slurm and sge batch systems are currently supported (details here).

Typically, you can expect to annotate 10,000 sequences on an 8-core desktop in about five minutes, and partition in 25 minutes. If it's taking too long, or using too much memory, there are many ways to improve these by orders of magnitude by trading some accuracy, or by focusing only on specific aspects of the repertoire, described here, here and here.

In addition to any output files specified with --outfname (described here), partis writes to two directories on your file system. Temporary working files go in --workdir, which is entirely removed upon successful completion. The workdir defaults to a subdirectory of /tmp (/tmp/$USER/hmms/<random.randint>), and this default shouldn't need to be changed unless you're using a batch system to run on multiple machines, in which case it needs to be on a network mount that they can all see. Permament parameter files are written to --parameter-dir, which defaults to a subdirectory of the current directory (see here).

You can add --debug 1 (or --debug 2) to print a lot of additional information about what's going on. The output of these should usually be viewed with less -RS either directly by piping | or after redirecting > to a log file (-S disables line wrapping -- use left/right arrows to move side-to-side).

For details on the large number of available partition options, run partis partition --help.

A variety of overview plots will be written to disk if you set --plotdir <plotdir>. Details on their content can be found here.

After you've partitioned your sample, you might want to view an ascii-art representation of the resulting clusters and annotations with view-output, infer trees, or calculate selection metrics to predict affinity with get-selection-metrics. You might also want to use the linearham package for accurate Bayesian infererence of trees and naive sequences. And for rich, browser-based visualization of families, trees, and annotations we recommend Olmsted.