Moving Beyond Moving Bytes

Streaming applications almost always require a schema. This is because the most interesting operations that can be applied to a data stream -- projection, scaling, aggregation, filtering, joining, streaming SQL -- all require you to know something about the types and values of fields in your data; otherwise you’re just moving bytes and counting anonymous things. This talk is an introduction and overview of shared schema registries [1,2] with a demonstration of how they can be integrated into Apache Flink pipelines to centralize schema management and enable schema reuse across data flow systems (e.g., from Apache Kafka or Apache NiFi to Flink and back again). We will begin with a discussion about the shortcomings of the common practice of embedding schemas and generated classes in code projects, followed by an illustration of essential registry features (e.g., centralization, versioning, transformation and validation) as they appear in both Confluent’s and Hortonworks’s schema registries. And, we’ll close with a detailed look at how these schema registries can be integrated into Flink serializers, sources and sinks. 1. 2.

Speakers involved

Joey Frazee

Product Solutions Architect,

Suneel Marthi

Senior Principal Engineer, Office of Technology, Red Hat, Inc.,