-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Banded Alignments and long sequences #400
Comments
vsearch implements two pairwise alignment methods. The manpage states that the second method is triggered when aligning very long sequences:
Implementing a banded pairwise alignment method for high-similarity searches is an interesting idea, duly noted. In the meantime, a great way to speed up thing is to limit the number of pairwise alignments you actually need to perform. Increasing the wordlength value could help you there by reducing a bit the number of candidate sequences to align:
|
Hi @frederic-mahe I think this is the linear space algorithm which was developed for the case that the matrices built from aligning 2 sequences is too large to fit into memory. We rather deal with the case that many fields in the matrix will never ever be used but still being generated.
We need to produce a large amount of alignments for benchmarking our tool and for finding useful cutoffs. So limiting The idea with wordsize is cool. I will check how that impacts runtime. Do you know how words are calculated if IUPAC characters appear in the reference sequences?
|
Thanks for suggesting the idea of a banded alignment. This is something that we have and will consider. It is clear that full alignment is time-consuming with long sequences and a banded alignment could save considerable time (and memory). However, it is complex to implement in combination with the SIMD parallelisation performed during alignment by vsearch. Unfortunately, it is currently not possible to output the order of accepted and rejected hits. It could potentially be added as a kind of diagnostic option. Words containing ambiguous nucleotide symbols (those other than |
Hi!
I have a question regarding handling of sequences with different length.
I usually align sequences with a fixed identity cutoff of 97% and, depending on the sample, different read length.
vsearch
is really fast when aligning short sequences (~100bp) but becomes prohibitively slow when aligning longer sequences (~250 bp). I assume this is due to the matrix sizes for the DP alignment step. Can that be? One way to address this problem is by allowing for a banded alignment. Assuming that I look for an alignment with 97% identity in a read with length 250 then the maximum band needed would be 8. Or in terms or memory consumption --> The full matrix would be 250*250 = 62500 (The reference is of course longer but I assume that there is a smart way this is being taking into account). The reduced matrix would be 250 * 17 = 4250. Considering that you need multiple matrices and not only creating but also filling them takes time I would assume that this way you could lower runtime significantly. Is there an option that does something similar?Thanks a lot,
Hans
The text was updated successfully, but these errors were encountered: