graph-gene Caller (ggCaller), combines gene annotation and pangenome clustering steps within population-wide de Bruijn Graphs. Using population-frequency information, ggCaller improves consistency of gene annotations, leading to more accurate clustering and significantly reduced run-times versus linear-genome based annotation and pangenome analysis.
ggcaller is written by Sam Horsfield, and in collaboration with Nicholas Croucher.
Split k-mer analysis (SKA) can be used to produce alignments from closely related sequence assemblies or
reads, quickly (because it is alignment-free) and with minimal fuss (due to the interface).
This enables downstream analysis such as phylogenetics or sequence completeness.
In collaboration with Simon Harris.
Library of sketching functions used to rapidly calculate core and accessory distances between bacterial genomes. Designed as a faster drop-in back-end for PopPUNK, you can also use to replace mash (100x speedup, further 50x with GPUs).
A rust rewrite of pp-sketchlib particularly targeted at very large datasets by using a new file format. To find nearest neighbours or
perform subset queries, this code is an improvement over pp-sketchlib.
Tools for bacterial genomic epidemiology. Quickly find core and accessory distances between whole-genome sequences, and use these to find genetically clusters. New data can be rapidly ‘queried’ against existing clusters, giving consistent nomenclature.
In collaboration with Nicholas Croucher.
Poplation analysis PIPEline. Designed to be run downstream of PopPUNK, to obtain subclusters and visualisations.
A Snakemake pipeline that requires some config modifications to run.
A suite of R packages to assist in fast, reproducible state-space models, particularly compartmental models used in epidemiology.
In collaboration with Rich Fitzjohn and RESIDE at MRC GIDA
odin: a domain-specific language that makes compartmental models easy to write, extend, and are fast to run:
dust: parallelised code for random number generation for Monte Carlo models, used to actually run state space models forward in time
mcstate: statistical support which allows running, forecasting and inference from these models
Comprehensive pangenome-wide association studies in microbes. Find genetic variation that is linked to a phenotype or clone of interest.
In collaboration with Marco Galardini.
Packages to count and determine presence of non-redundant sequence elements. Use ‘caller’ both to determine what the are unitigs in a population, and to quickly determine presence/absence in a new population. The ‘counter’ is no longer supported, but is an alternative to the first purpose.