-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap / Todos #3
Comments
Hello! I've been thinking what could be useful for generic library for bioinformatics tasks. This is what I thought:
I think that it would be a good idea to see what other general bioinformatics libraries like BioPython have implemented. I hope I will be able to check this in a couple of days and add more suggestions and/or specify what I wrote higher. Also, probably the project's name "Rust-Bio, a rusty bioinformatics library" is a bit un-easy to understand at first glance. Googling "rust for bioinformatics" don't give you a link to this repo, and I think that that search query is a more frequent one than other possible queries. May I suggest as a title something like "Rust-Bio, Rust for Bioinformatics", and after that, in description, "Rust module for general bioinformatics tasks"? Probably, "BioRust" would be the best possible name because of various other libraries like BioPython, BioJava and BioRuby, but it has been already claimed: https://github.com/hirokai/BioRust and https://github.com/biorust . Feel free to comment and criticise my suggestions, I'd be glad to discuss todos. |
Some more suggestions:
|
@parir Thanks! (: What algorithms for multiple alignments do you think would be the most useful? |
I agree with @VD-N. It would be good to look into the other Bio[programming language of your choice] frameworks to figure out common API design principles. |
@VD-N Tough question! I haven't worked with sequence alignments in the last 3-4 years, so I don't know the current state of the art algorithms. Back then Muscle, MAFFT and ProbCons where supposed to be pretty good. ClustalW could also be included, because it's the classic one. |
Thank you, guys! |
Something similar to bx-pythons clustertrees and intervaltrees, perhaps? Simple, but indispensable... |
Very good idea, I have added it to the list. |
An extremely common use case I had in Python was turning a bed file into a dict of cluster or intervaltrees, where the keys were chromosomes or chromosome/strand tuples. Might be nice to have. Of course this is trivial to code up oneself - is Rust-bio more about implementing the difficult stuff or also having convenience functions? |
The key idea is to have a toolbox for anything that reoccurs when doing bioinformatics in Rust. In this sense, your suggestion fits perfectly. |
Hi everyone,
let alphabet = alphabets::dna::iupac_alphabet();
let pos = suffix_array(text);
let bwt = bwt(text, &pos);
let fmindex = FMIndex::new(&bwt, 3, &alphabet); is much complicated for a medium user and takes more time to write and analyse than this: let pos = suffix_array(text);
let bwt = bwt(text, &pos);
let fmindex = FMIndex::new(&bwt, 3, dna_iupac!); |
Hi, thanks for the useful suggestions! Let me answer piece by piece:
|
Oops, please pardon me for the late response!
|
Two more suggestions and one clarification to the previous one:
Also I would like to take this task: |
Regarding 2:The API I mean is this one: We have benchmarks now in the README. Regarding 5:I agree, that should be easy to do. But if anybody wants to implement that, I would rather suggest him to do that in a separate library. Most importantly, these clustering methods are useful for other fields as well, so why hiding them in a bioinformatics library. Depending on an additional library with Cargo is extremely easy, and does not make it more complicated for the user. Regarding 6:Yes, implementing BAM/CRAM/BCF in Rust only is surely possible. However, these custom binary formats are really complicated stuff. It would require quite some work. If anyboday wants to try that, I would be happy to include the code or have it as an auxilliary project within Rust-Bio. But at the moment, I would prefer to focus on stuff that is not already possible by using htslib. Regarding 7:There certainly is some overhead, so that you don't want to reinitialize e.g. in an inner loop. Regarding your second post
I have added your suggestions to the TODO list. |
Ok, thank you for your response! One more little suggestion: make |
Good idea. I have done that as suggested. |
Hello, I'm interested in rust-bio, but I have no experiences with Rust. I only experienced with Biopython. Any suggestion? |
@rilut sure feel free to fork and try it out. I will mark it in the list. |
I'd also suggest SDP algorithms for sparse alignment - generally handy things tools, especially when working with genomic assemblies or noisy long-read technologies. And much like @rilut , I'm fairly new to Rust and looking to learn more and help out. Will poke around a bit and see if anything jumps out at me. |
Thanks! I have added sparse alignment to the list! Looking forward to any contribution! |
This is more for rust-htslib, but it'd be nice to have an interface to the tabix stuff and potentially .csi. |
That's a good point, thanks! |
Greetings! I have just finished the basic workings of a transcriptome translator. The repo is available here: https://github.com/rundrop1/transcriptome_translator It currently generates all possible Amino Acid sequences from Nucleotide sequences. Support for RNA is coming today. Currently getting output is hacky as the way I did it yesterday was to capture stdout by executing through vim which requires a bit of trimming at the top of the file but otherwise works temporarily. (Note that if you do it this way you will get a valid fasta file) I didn't look before making the parser so I've implemented a FASTA parser with nom. The parser currently is only able to deserialize FASTA for further use. It also requires rust-phf which currently requires a specific nightly. I also utilize memmap for loading the file. The file is not mutated during processing and while the code is unsafe by memory safety standards but in practice it is usable if you are sure your file will not be modified before the translator has a chance to complete. I'd like to know if the translator is of any use to your crate and how I could get it into a more usable state for acceptance into this toolkit. I'd also like to add features like automatically BLASTing results and am open to implementing others ideas as well. I plan on adding features and otherwise maintaining 'transcriptome_translation' well into the future. |
Hi! |
Hey @johanneskoester, I'd be happy to implement a gtf/gff reader. |
Great, thank you. I have added you to the list! |
Great! Did you have any specific preferences for the implementation? I was just planning on starting with making it as similar as possible to what you have for FASTA. I'm still relatively new to Rust so It'll probably be fairly iterative if you don't mind giving input while I learn best practices. |
@anderspitman I've just created #rust-bio on irc.mozilla.org if you care to join me I can help answer your questions regarding rust. It seems that the reader is not too complicated and I can walk you through how to do some of the Rusty things. My implementation is verbose but easily readable with quite a few comments to explain the exact steps taken. I am not a bio-informatics professional but I essentially freelance for labs at a local university when computation can be applied to the questions they are trying to answer. @johanneskoester |
What are the thoughts on adding an AppVeyor build to check the library's build status on Windows systems? A template for doing this already exists at: https://github.com/japaric/trust This repo also includes a .travis.yml template for building on both Linux and MacOS, but that may muddy the waters on a TravisCI build failure being on Linux or MacOS. |
@rhagenson I don't think this is needed. Rust-Bio is pure Rust, hence it should work the same on all OSes. |
The wording there is what makes me want to add it whether it is necessary or not. The library "should" work the same no matter the OS, but having an extra badge tells potential users that, as far as we have tested, the library *does* work.
I'll let you make the final call, Johannes. There are certainly other things for me to work on but I am willing to get AppVeyor builds going if it is desired.
|
So, if there is a difference, it would be a bug in Rust I think. I am hesitating to activate Windows builds, because, to be fair we would also have to activate osx builds. This is 3x the build load we have now, for almost no benefit, since it is very unlikely to be a difference. I am also thinking about handling natures resources responsibly here. What we could do is a weekly test of the master branch. |
I can appreciate the conservative approach and it certainly is a bit excessive to launch three builds with each update to master. I am not sure how we would automate a weekly Windows build test though. Do you know of a way to do this? I am not seeing anything to space builds out by a week with AppVeyor. |
If it is not yet possible, we could ask for a weekly build feature. No need to rush, windows builds can wait. |
BioJulia project seems to have good ideas that could be copied. Search for BioJulia on https://julialang.org/blog/ for some interesting idea (like how to store FASTA sequences efficiently in memory). BioJulia also has proper parsers for different bioinformatic formats which could be useful to look at (when reimplementing parsers with nom):
https://homes.cs.washington.edu/~dcjones/biojl/parsing.html Some examples: https://github.com/BioJulia/GenomicFeatures.jl/tree/master/src/ |
Thanks! This is certainly something to consider! |
Has anyone written a Partial-Order Aligner in Rust? I was thinking of trying my hand at it over the break, but I'd hate to duplicate effort if someone has already started |
In fact, I want a partial-order aligner so much I, err, already wrote one: So I guess I volunteer for that? |
Great!! |
@johanneskoester I just introduced myself here #27 :) Do you have any recommendations for a short and sweet project? That should take a day or couple, just to discover your code base and see what's going on! |
Hi and welcome @wizofe! Unfortunately, all the currently open todos are rather big. However, feel free to propose any textbook algorithm that is still missing. Another possibility is to write a file parser for a format that is not yet available in rust-bio or rust-htslib. |
Hi! Just introduced myself in the separate thread, I'd like to know which are the plans to include PDB format parsing, especially for proteins. |
Hi. I love dynamic programming algorithms, graph algorithms, and anything involving the concept of an approximation algorithm (TSP approximation). I have a few years of Haskell experience, but I have not touched Rust yet. I am kind of interested in working at the intersection of algorithms, Rust, and bioinformatics, especially if I find something that brings together distributed algorithms and sequential algorithms to scale horizontally and vertically in combination. I also tend to enjoy short programming exercises like HackerRank or Codility, if helps in what tasks you suggest. I'll try to look through the list of todo's, but feel free to suggest some "new to rust" tasks, for me to get my feet wet. |
@brianjimenez welcome! The plan is: if there is somebody who wants to do the work, we are happy to include it. In general the dev model is that things are added as needed and whenever a volunteer does the coding. Sorry guys for the late response, I have been on vacation. |
@johanneskoester thanks a lot! I've started reading the 2018 edition Rust book a few weeks ago and messing around with a dirty approach to PDB parsing here. Still a lot of Rust to learn, but I'll happy to help with the PDB parts. |
I am looking for something that calculates the molecular weight of nucleic acids and polypeptides, and there doesn't seem to be anything that does that on crates.io right now. Is that in scope for this library? I'd be happy to implement it if so. |
@jimrybarski Yes, absolutely! |
I'm wondering If rust-overlaps could be a reference for overlaps task? |
I would be happy to start writing more benchmark tests. I'll start with the functions related to sparse alignments. Should I write benchmarks for every non-trivial function, or should I work on a particular subset of functions? |
Hello! Is similarity clustering in the scope of this library (or rust-bio-tools)? It would be nice to know if someone is working in a UCLUST clone or any of the mmseqs2 clustering methods. |
Sure, any more general bioinformatics algorithms that can be used across tools should go into rust-bio, general data types can be defined in rust-bio-types and if you want to create some simple command-line tool based on rust-bio (and/or rust-htslib), you can do this in rust-bio-tools. Any bigger tools that use general rust-bio functionality but do some bigger amounts of extra work probably warrant a separate crate. As for understanding the algorithms, the links you provided point to papers that are probably a good point to start, and for mmseqs2 you could dig into the code -- it does not seem to contain too many comments, but looks well structured. In general, it seems to support a lot of different functionality and it's probably a good idea to pick some small functionality to start with. |
Do you have any plans to setup OpenCollective/GitHub Sponsors? I would certainly be happy to send a bit from time to time, like it's done with BioJulia:
|
Hey, @johanneskoester , could you elaborate on the point |
Hi, I would like to contrbute to the Hidden Markov model algorithms, Viterbi and perhaps Baum-Welch. |
Regarding the use of the I found this wrapper on github associated to https://bccfe.ca/ |
This is a continuously updated list of Todos. If you have a suggestion, comment to the issue. I will update the list if the suggestion fits in.
Please note that none of the items will be guaranteed to be implemented. If you want to make sure something is done, consider to contribute the code to Rust-Bio, any help is welcome! Of course you will be listed as one of the authors.
If you want to implement an item, please post it in this thread and I will mark it in the list so that we don't duplicate work.
Changes to current code
New code
bio::alphabets
The text was updated successfully, but these errors were encountered: