Integrating Flink and Kafka in the standard genomics pipeline

High-throughput DNA sequencing is a key data acquisition technology which enables dozens of important applications, from oncology to personalized diagnostics. We extended work presented last year to port additional portions of the standard genomics data processing pipeline to Flink. Our Flink-based processor consists of two distinct specialized modules (reader and writer) that are loosely linked via Kafka streams, thus allowing for easy composability and integration into already existing Hadoop workflows. To extend our work we had to manage the dynamical creation and detection of the data streams: the set of output files is not known in advance by the writer, which learns it at running time. Particular care had to be taken to handle the finite nature of the genomic streams: since we use some already existing Hadoop output formats, we had to properly handle the flow of end-of-streams markers through Flink and Kafka, in order to have the final output files correctly finalized.