-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible issue with taxa labelling in --lineage-clustering #95
Comments
Hi Mat, There looks to be a something slightly odd going on - I will try to take a look at this in the next couple of days. Was the external clustering passed to into the lineage assignment command? If you can let me know the command line syntax, that'd be great. Thanks, Nick. |
Hi Nick, The command I've used to run the lineage clustering is
At Rank 4 poppunk is grouping around 400 genomes into one cluster, and 90 into a second (with a few others in a third). Those proportions are similar to the divisions in the tree, so my suspicion is that this is a relabelling problem, rather than with the clustering itself. However, it's worth considering if poppunk is identifying kmers that are not relevant. This particular dataset uses assemblies from direct sequencing (SureSelect-enriched metagenomic data), but the assemblies included all have good scores in CheckM (>95% completeness, <5% contamination). There should be virtually no accessory genome for this species (it's all core and extremely clonal). The tree is based on reference mapping + gubbins, with joint ancestral reconstruction performed after ML. At the least, I would have expected poppunk to be able to reproduce the two major lineages we see in all phylogenetic analyses, but my ultimate goal is to define a persistent and reproducible sublineage nomenclature (I currently use rpinecone, but the clusters there are dependent on the dataset). I pasted the command I used for adding the Mat |
Thanks Mat - I've been running some tests and it seems to be generating some sensible outputs at the moment. There were issues with this previously where we had a mismatch between the output order of sketchlib, and the input order expected by PopPUNK. It is possible these packages could be out of sync. The versions I've been working with are I will have a look at the external clustering "feature". |
The lack of external clustering is because the lineages are not yet included as a proper model (enhancement #79). In the mean time, it should be possible for the external clusters to be visualised alongside the lineage assignments using the |
Ok, so I should be on the latest version of poppunk (pulled from the gitrepo on July 22nd immediately after John altered the setup.py script - commit# I think I'm runing the latest pp-sketchlib on bioconda ( Regarding the external clustering file, it's not a huge issue for me, since I usually merge and analyse everything in R/ggtree, but more of an observation - the separate methods you suggest fine to me. Thanks. |
If you have some examples on the Sanger filesystem, I can take a look an example set directly, if that were helpful? |
Sure. The poppunk lineage db is at The original kmer database is at Finally, the ML tree (ref based) I'm comparing things against is |
Thanks - so I set up a fresh conda environment with the latest PopPUNK git clone and ran the clustering:
These results look quite sensible against the PopPUNK tree, which resembles your phylogeny in its structure: However, comparing a few taxa between the two trees suggests there might have been some scrambling of the names - this probably looks like a sketchlib/PopPUNK interface issue. @johnlees, do you have any suggestions for looking at these? |
Yeah, that looks like the tree and clusters I got from poppunk too - but as you say, the names seem scrambled. |
I've found this a bit of a frustrating issue, as it's been reported a few times, but I've never been able to replicate it myself. Perhaps @matbeale if you could give the location of your conda env I could try and step through the code myself to try and work it out? |
@johnlees The rank 1 clusters are OK I think - there's some polyphyly, but that's either monophyletic lineages arising within an ancestral lineage, or areas where the tree is not well resolved. As all the distances are quite short it might be possible to get better resolution with a higher sketch size. Just to check on where the problem might arise, I looked at the pairwise distances for two pairs of isolates: PHE140076A and TPA_BCC063 (close in PopPUNK - albeit on long-ish terminal branches - but not in ML tree), and PHE140076A and PHE170397A (close in ML tree, but not in PopPUNK). In the data shared by @matbeale, the distances are:
So it does appear that the lineages are reflecting the pairwise distances, but the distances/labels are behaving oddly. |
@johnlees my conda environment is at |
@nickjcroucher yes the population structure is extremely clonal. We see slightly more definition in the reference mapped tree, and in the long term this may mean that assembly-derived rank1 clusters are unsuitable (although that would be my preference) - alternate sketch sizes would be good. FYI, using a 10-snp threshold on the ref alignment tree in rpinecone, I get around 30 clusters (plus around 10 singletons) for the ML tree - I don't expect poppunk to precisely replicate those, but a similar level of resolution would be ideal for the downstream work I'm doing (and would enable a reproducible taxonomy for the field - given low substitution rates, 10 SNPs reflects around 50 years). |
@matbeale It's a bit off topic but I think trying to both increase the sketch size and the k-mer size range used might help with the tree definition - I think we went up to 100,000 for TB to get effectively SNP-level resolution. |
Can confirm this is a bug in sketchlib (sorry for the long delay!). I will have a fix soon |
This will be fixed in sketchlib v1.5.1 (tonight) and I will update PopPUNK to a new version to support this tomorrow, at which point I will let you know and close this issue. |
This should now be fixed if you use PopPUNK v2.2.0 and pp-sketchlib v1.5.1. The latter is available via conda already, the former should appear on conda shortly (or can be taken from master if needed sooner) |
Hi John,
As discussed, I've been running the --lineage-clustering option.
poppunk --lineage-clustering --ref-db strain_db --external-clustering Global_TPA_SS14-mapped_2020-06-09.noINDELS.fix-masked.gubbins.SNPs.pyjar.renamed.tre_rpinecone-lineages.csv --output db_lineage-clustering-rv --ranks 1,2,3,4 --distances strain_db/strain_db.dists
It seems to have defined clusters, but when applied to a tree (in ggtree) the lineages bear no relation to the phylogenetic structure. This could reflect generating a poppunk database from assembly core genome and then comparing to a very constrained reference alignment derived tree, but I would expect to see at least some definition. You suggested this may relate to taxa labelling in the output.
I also note that the
--external-clustering
option I used here didn't result in an output file.Many thanks, Mat
The text was updated successfully, but these errors were encountered: