GSoC 2022, GA4GH and BioSerDe

TL;DR: Check out BioSerDe and its accompanying Serde experiments with noodles-bed.

a_dna_sequencing_format_screaming_for_help_and_much_needed_change_by_salvador_dall_e

GSoC 2022 and BioSerDe

The BioSerDe project, initially idealized in a noodles issue, has been worked on by UMCCR in partnership with GA4GH’s GSOC proposals on GSOC.

The project was developed by Gabriel Simonetto with the mentorship of Roman Valls Guimera, Michael Milton and Marko Malenic.

The mission is to have a safe and performant system to convert bioinformatics formats into alternative data representations.

Deciding on this “Bioinformatics Rosetta Stone” rough shape

How do we execute on that goal? Is it possible to build a rosetta stone that transforms, say, a BAM file into an Arrow data stream relatively effortlessly?

In the first 2 months of duration of the project, various experiments have been made. We started with initial attempts using protobuf files, but then quickly realized we would be wasting a lot of potential, since this approach would force all formats to conform to a row-based representation.

This then initiated a discussion targeting which was the correct way to define said representation. From this discussion, we explored many different options (1 2 3 4).

Ultimately, the current conclusion from all of these experiments was that it was necessary to modify noodles to accommodate some necessities of BioSerDe. This brought us back to the issue that started it all on noodles. In the current status of said discussion, Michael Macias, author of noodles, asks for a demonstration of how BioSerDe would use said changes in the noodles-bed crate as the simplest starting point among the VCF/BCF/SAM/BAM/CRAM bioinformatics format salad.

Working with formats inside of noodles

Once we concluded we needed to fork noodles in order to unlock progress, we developed a plan to showcase how Serde functionalities could be added to the crate.

The idea here is to create a serde_json representation of BED, as well as having a Serializer and a Deserializer for the regular BED formats, in order to take advantage of the Serde ecosystem. This would allow for serde-transcode usage as a conversion between the 2 formats without needing to collect the entire input into an intermediate form in memory.

After some time testing the possibilities with how the API and code should be defined, we arrived at some Serde solutions that would help us to produce the desired behaviors in a clean way.

At this point, the usage of derive macros on top of the BED record struct were already enough to have a basic representation of BED in the json format.

All that was left to do, was to make a Serializer and Deserializer for the regular BED format. From this phase of the project onwards, there is one concern that will always be present in our decisions: how to reduce the duplication of code that comes from having parts of the behavior of the serialization and deserialization process already present in parts of the noodles codebase, namely, the Display and FromStr traits.

This gives us 2 options: using the noodles functionality inside the Serializer and Deserializer, or, recreating the functionality of the Serializer and Deserializer, and then, calling it on the original noodles entrypoints (Display calls the Serializer, instead of Serializer calling the Display). We initially went with the first idea, but later, some developments revealed that maybe we have to change our perspective.

Our initial research eventually found the serde-with crate which has many functionalities, but most important to us, allows for a struct to use its Display and FromStr traits as Serde behavior. This usage was very promising: it was suddenly very quick to implement the plethora of BED definitions as each BED version can take care of it’s own particularities. Even better is that Display and FromStr are well-defined for the many noodles formats, which makes it so that this architecture will work well for them as well.

At this point, we had Serde functionality for all BED formats, as well as the serde_json custom representation. With this basic behavior working, it was time to look into how a conversion between 2 formats would look like.

Experimenting on conversions between two formats.

As mentioned before, we had the goal of using serde-transcode as our main tool of conversion, since the BioSerDe project has an interest in being memory efficient between conversions.

However, a quick inspection made us realize that serde-transcode works by receiving only a Serializer and a Deserializer object, which is a problem, since our previous solution was reliant on passing a type annotation for serde-with to know which Display and FromStr implementation to use.

At this point, we had a decision to make: either, we drop serde-transcode, and make a custom function which uses internal memory allocation in order to define a type that will be used in the process, or we need to go back on our architecture decision and find a way for the serializer and deserializer to produce BED valid entries inside the serialization and deserialization process, without the help of serde-with.

An extensive discussion on this problem and decision can be found here.

Present

And this is how the project is going at the moment, we are currently searching for ways to call upon noodles functionalities from within the serializers. Ideally, all parsing should be done using already existing code, and that should be possible by tracking the state of serialization, and calling the appropriate functions at the right times.

Future

The intention of this fork is to experiment how an ergonomic Serde implementation will fit within the intersection of Rust users and the bioinformatics community.

We think that this approach is worth exploring because Serde is a very well understood crate within the Rust ecosystem and so (growingly) does noodles as a Rust alternative to htslib. We are also aware of the risks and upcoming challenges of taking this approach, namely:

  1. Not being merged nor supported upstream by noodles: To be fully clear, this is not a “hostile” type of fork by any stretch of the imagination. We aim at BioSerDe being used by bioinformaticians and data scientists at large and this is done by building community not dividing it. Exploring the feasibility of embedding Serde into Noodles helps us figure out limitations and drawbacks we can solve, refactor or reconsider later on: Perhaps serde-remote is all we need for our use case at the end?. Or maybe we’ll see a fitting trait architecture at the end of this journey, paving the way for future contributions?
  2. When not used appropriately, Serde loads all bytes into memory. We know this is does not scale well within the multi-gigabyte bioinformatics file formats ecosystem. Instead, we need to convert between formats by streaming bytes from the underlying noodles structures. This could be eased by serde-transcoding capabilities or any other attempts by third parties such as tokio-serde or experiments with postcard and async by James Munns, a format primarily designed for embedded targets.
  3. Some experimentation is still needed as to which is the best way to merge noodles functionality with Serde functionality, maybe there is still a way to use serde-with from inside the serializers. Or maybe even with the hardships of implementing more intricate Serializers, it’s still the best way to end up with performant code. After this architecture is defined, discussions around the future of the crate should be resumed.

Tons to explore and implement! We’re excited to see where this goes.

Join us

Interested in having your say? Reach us out at BioSerDe or umccr-noodles repos for discussion, issues and contributions!

Related