GSoC 2021, GA4GH and htsget-rs
TL;DR: https://github.com/umccr/htsget-rs
GSoC 2021 and htsget-rs
The Rust htsget server that was sketched a few months ago at UMCCR has been put together thanks to GA4GH’s generous allocation of 2 students to work on this effort to nourish the Rust bioinformatics ecosystem while delivering a GA4GH driver technology implementation such as the htsget protocol.
After a bit of a struggle finding suitable candidates:
I didn't anticipate it'd take so much time to find eager @rustlang students for #GSOChttps://t.co/LH20mjKI8T...
— a.k.a brainstorm (@braincode) April 7, 2021
Any takers out there? Ping me and let's talk :)
The work carried by Daniel Castillo de la Rosa and Marko Malenic with the mentoring supervision of Christian Pérez-Llamas, Victor San Kho Lin, Michael Macias and Roman Valls Guimera (myself) has been published in this repo and it is ready to use 🎉
To sum up, this project has:
- Implemented all major bioinformatic formats that htsget supports: VCF, BCF, BAM and CRAM… along with their corresponding indices. Non-bioinfo savvy readers: get acquainted with some of those formats here.
- Tested and reported bugs to a crucially important underlying Rust library: Noodles.
- Added a local htsget (http) server that can be spawned for testing or on-premise deployment purposes.
- Created a benchmark suite to spot performance regressions and compare against other third party implementations.
- Code reviewed implementations, documentation on architecture and operation, including functional and integration tests.
- Prompted third party genome viewers to properly support GA4GH’s htsget specification in their implementations.
Also, to be clear, there’s work to be finished up:
- Proper local testing and deployment for AWS lambdas. Unfortunately there’s a chain of upstream AWS dependencies that need to be fixed first. To clarify, Rust lambdas can be deployed with SAM but not tested locally, which makes local development/deploy iterations too cumbersome to be practical at the time of writing this.
- S3 storage backend, supporting “non-immediately” accessible storage tiers such as AWS Glacier or Deep Archive.
- DRS compatible object ID resolver, to improve integration of other GA4GH-standardized information retrieval mechanisms.
And future avenues for improvement are (but not limited to):
- Other cloud-native deployments to major public cloud vendors such as: Cloudflare, Google, Azure, etc…
- Other (cloud-native?) storage backends such as: Minio.
- Profiling and optimization: cargo-flamegraphs, cargo-instruments, etc…
- Additional http server implementations such as Warp, now only actix-web is implemented but others can be built on top of the http-core abstraction.
Now, if you want to see more nitty-gritty details and ways to contribute, keep reading and see how Marko and Daniel implemented several of htsget-rs underpinnings.
Marko Malenic
The “qualifying task” chosen by Marko was in the lines of the class
attribute reflected in the htsget spec and he did a good job implementing it.
Marko took the initial BAM search module as a reference and implemented the CRAM search backend. After encompassing some necessary changes in the local files storage backend such as functionality to return object sizes, we realised that there was a significant amount of code replication across formats: at the end of the day, we search bytes on every format, so there’s a fair share of commonalities.
Last but not least, as Noodles was getting support for async methods, so did htsget-rs incorporate its async facets. This is the reason why Marko separated async code from the blocking code, so that Daniel could easily benchmark both implementations and measure its underlying impact.
For a more detailed list of the particular work units and commits, there’s a gist for that.
Daniel Castillo de la Rosa
Daniel hit the ground running on the quals by filing a PR with the remaining htsget query builder methods implemented and then some tests for the local storage backend.
Starting with the already existing BAM search support, Daniel gained greater understading about BAI (BAM index) VirtualOffsets and managed to optimise htsget’s returned ranges accordingly. On top of that, during his main work with htsget-rs, he found several issues in the underlying library, Noodles, that will definitely help other bioinformatics developers using it in the future. We would like to thank our co-mentor Michael Macias, a.k.a @zaeleus, for his light-speed fixing of the reported bugs and implementation of new features that made this project possible.
After BAM was established and well tested, Daniel moved into implementing VCF and its compressed/binary counterpart, BCF while its support was implemented in Noodles and engaging with other Noodle users along his coding path.
With all major bioinformatics formats that htsget supports in its spec implemented, Daniel went through the HTTP side of htsget: adding the http-actix-web HTTP server and an abstraction, http-core for other HTTP servers, the GA4GH service-info endpoint, which gives a birds-eye-view on what an htsget endpoint provides and its documentation.
Furthermore, htsget needs some middleware that disambiguates names and locations of particular datasets, and that’s what Daniel fixed with his regex-based id-resolver, as the GA4GH reference implementation does. In the future, other data/id resolver mechanisms should be implemented for better interoperability with different research institutes, industry and other htsget users at large.
Lastly, Daniel is working on adding benchmarks with the excellent criterion-rs crate, which provides a good base for comparing our implementation with itself (across code changes) and against other third party implementations.
Future
Other than deploying htsget-rs on your own usecase, here’s some unassorted TODOs and ideas that can be explored:
- Implement Crypt4GH gotchas.
- Implement other ID resolvers, possibly integrating well with DRS.
- Implement other Storage backends, ideally leading to a BioSerDe crate, where all common cloud native formats are supported out of the box: Parquet, ORC, json.gz… to be able for compare against more custom designed storage representations such as pVCF, SAV and other upcoming formats that are proposed at GA4GH but that have relatively poor compatibility with the rest of the (big data) ecosystem and tooling found in today’s commercial cloud systems.
One thing that prevents wider cloud-native adoption is formats that are not cloud-native:
- Big BAM/CRAM files throttle the througput of S3 object store transfers. Using more object-storage friendly storage backend could work much better with htsget and other bioinformatic analysis pipelines.
- Where does htsget fit in the greater picture as a middleware between DRS, clients and other third party programs and hardware such as Illumina’s DRAGEN and ICA service?