GSoC 2022, GA4GH and BioSerDe

TL;DR: Check out BioSerDe and its accompanying SerDe experiments with noodles-bed.

a_dna_sequencing_format_screaming_for_help_and_much_needed_change_by_salvador_dall_e

GSoC 2022 and BioSerDe

The BioSerDe project, initially idealized in a noodles issue is finally being worked on by UMMCR in partnership with GA4GH’s proposal’s on GSOC.

It’s being currently worked on by Gabriel Simonetto with the mentorship of Roman Valls Guimera, Michael Milton and Marko Malenic.

The mission is to have a safe and performant system to convert bioinformatics formats into alternative data representations.

Deciding on this “Bioinformatics Rosetta Stone” rough shape

How do we execute on that goal? Is it possible to build a rosetta stone that transforms, say, a BAM file into Arrow data stream relatively effortlessly?

In the 2 months of duration of the project, various experiments have been made, we started with initial attempts using protobuf files, but then quickly realized we would be wasting a lot of potential, since this approach would force all formats to conform to a row-based representation.

This then initiated a discussion targeting which was the correct way to define said representation. From this discussion, we explored many different options (1 2 3 4).

Ultimately, the current conclusion from all of these experiments was that it was necessary to modify noodles to accommodate some necessities of BioSerDe. This brought us back to the issue that started it all on noodles. In the current status of said discussion, Michael Macias, author of Noodles, asks for a demonstration of how BioSerDe would use said changes in the noodles-bed crate as the simplest starting point among the VCF/BCF/SAM/BAM/CRAM bioinformatics format salad.

Present

And that is the step the project currently is in: for the last couple of weeks we have forked noodles on the ummcr organization in order to create the changes needed for this demonstration, with the first bits of code already being worked on (5 6).

Future

The intention of this fork is to experiment how an ergonomic SerDe implementation will fit within the intersection of Rust users and the bioinformatics community.

We think that this approach is worth exploring because SerDe is a very well understood crate within the Rust ecosystem and so (growingly) does Noodles as a Rust alternative to htslib. We are also aware of the risks and upcoming challenges of taking this approach, namely:

  1. Not being merged nor supported upstream by Noodles: To be fully clear, this is not a “hostile” type of fork by any stretch of the imagination. We aim at BioSerDe being used by bioinformaticians and data scientists at large and this is done by building community not dividing it. Exploring the feasibility of embedding SerDe into Noodles helps us figure out limitations and drawbacks we can solve, refactor or reconsider later on: Perhaps serde-remote is all we need for our usecase at the end?. Or maybe we’ll see a fitting trait architecture at the end of this journey, paving the way for future contributions?
  2. When not used appropriatedly, SerDe loads all bytes into memory. We know this is does not scale well within the multi-gigabyte bioinformatics file formats ecosystem. Instead, we need to convert between formats by streaming bytes from the underlying noodles structures. This could be eased by serde-transcoding capabilities or any other attempts by third parties such as tokio-serde or experiments with postcard and async by James Munns a format primarily designed for embedded targets.
  3. Unknown unknowns: we’ll probably fill this up at the end of this GSoC edition with our learnings and insight.

Tons to explore and implement! We’re excited to see the outcomes of the second GSoC 2022 term.

Join us

Interested in having your say? Reach us out at BioSerDe or umccr-noodles repos for discussion, issues and contributions!

Related