GSoC 2022, GA4GH and BioSerDe (pt2)

TL;DR: Check out BioSerDe and its accompanying SerDe experiments with noodles-bed.

This is the second part of previous post

GSoC 2022 and BioSerDe

The BioSerDe project, initially idealized in a noodles issue is finally being worked on by UMMCR in partnership with GA4GH’s proposal’s on GSOC.

It’s being currently worked on by Gabriel Simonetto with the mentorship of Roman Valls Guimera, Michael Milton and Marko Malenic.

The mission is to have a safe and performant system to convert bioinformatics formats into alternative data representations.

…Previously

Once we concluded we needed to fork noodles in order to unlock progress, we developed a plan to showcase how serde functionalities could be added to the crate.

Working with formats inside of noodles

The idea here is to create a serde-json representation of BED, as well as having a Serializer and a Deserializer for the regular BED formats, in order to take advantage of the serde ecosystem, which would allow for a serde-transcode usage as a conversion between the 2 formats without needing to collect the entire input into an intermediate form in memory.

After some time testing the possibilities with how the API and code should be defined, we arrived at some serde solutions that would help us to produce the desired behaviors in a clean way.

At this point, the usage of derive macros on top of the BED record struct were already enough to have a basic representation of BED in the json format.

All that was left to do, was to make a Serializer and Deserializer for the regular BED format. From this phase of the project onwards, there is one concern that will always be present in our decisions: how to reduce the duplication of code that comes from having parts of the behavior of the serialization and deserialization process already present in parts of the noodles codebase, namely, the Display and FromStr traits.

This gives us 2 options: using the noodles functionality inside the Serializer and Deserializer, or, recreating the functionality on the Serializer and Deserializer, and then, calling it on the original noodles entrypoints (Display calls the Serializer, instead of Serializer calling the Display). We initially went with the first idea, but later, some developtments will reveal that maybe we have to change our perspective.

Our initial research eventually found the serde-with crate which has many functionalities, but most important to us, allows for a struct to use its Display and FromStr traits as serde behavior. This usage was very promising: it was suddenly very quick to implement serde the plethora of BED definitions as each BED version can take care of it’s own particularities. Even better is that Display and FromStr are well defined for the many noodles formats, which makes it so that this architecture will work well for them as well.

At this point, we had serde functionality for all BED formats, as well as the serde_json custom representation. With these basic behavior working, it was time to look into how would a conversion between 2 formats look like.

Experimenting on conversions between two formats.

As mentioned before, we had the goal of using serde-transcode as our main tool of conversion, since the BioSerDe project has an interest in being memory efficient between conversions.

However, a quick inspection made us realize that serde-transcode works by receiving only an Serializer and a Deserializer object, which is a problem, since our previous solution was reliant on passing a type annotation for serde-with to know which Display and FromStr implementation to use.

At this point, we had a decision to make: either, we drop serde-transcode, and make a custom function which uses internal memory allocation, in order to define a type that will be used in the process, or we need to go back on our architecture decision and find a way for the serializer and deserializer to produce BED valid entries inside the serialization and deserialization process, without the help of serde-with.

An extensive discussion on this problem and decision can be found here

Present

And this is how the project is going at the moment, we are currently searching for ways to either call upon noodles functionalities from within the serializers. Ideally, all parsing should be done using already existing code, and that should be possible by tracking the state of serialization, and calling the appropriate functions at the right times.

Join us

Interested in having your say? Reach us out at BioSerDe or umccr-noodles repos for discussion, issues and contributions!

Related