Building composable systems: Why, How and Standards
Building analytical engines often involves rebuilding few key components. Many engines rebuild their entire stack - from SQL parsing to execution, leading to duplicated effort and maintenance overhead. But what if we could take a more modular approach?
The data ecosystem, especially analytical engines has seen rapid growth in recent years. Several specialized engines (e.g. Flink, Trino, DataFusion) and supporting tools (e.g. dbt, Dagster) have emerged. Yet, some of these systems rebuild the same core components from scratch.
This post explores two papers that highlight an alternative: building systems with reusable components with well-defined interfaces.
Layers in a Typical Analytical Engine
Analytical engines are split into several common layers, with each layer handling a specific role in query processing. The key layers typically include:
Language front end: Parses user queries, typically in SQL, and converts them into the engine’s internal intermediate representation.
Intermediate representation(IR): An internal data structure used to represent queries.
Query Optimizer: Transforms the IR into an optimized execution plan to improve query performance and reduce resource usage.
Execution engine: Executes the optimized query plan, processing data at the node level. Serialization components often fall under this category as well.
Connector Libraries: Manages data exchange between the engine and external storage systems, databases.
Execution runtime: Handles distributed execution, including task scheduling and fault recovery in a multi-node setup.
Most engines build all these layers. For example, DuckDB is an embedded DB that builds each of these layers. Tightly integrated monolithic systems have reasons such as performance, faster evolution and form factor in some cases. However, most systems would benefit from a modular approach, where engines reuse well-defined components can significantly reduce development efforts.
Advantages of Reusable Components
Building analytical engines with reusable components and leveraging existing ones has several benefits:
Engineering Efficiency: Avoids duplicate effort by reusing sub-components. This requires the engine’s interfaces to follow standards.
Example: Velox supports substrait intermediate representation as input, enabling Apache Gluten to standardize on substrait as the IR and integrate with the Velox execution engine.
Better user experience: Users benefit from consistent semantics.
Example 1: Multiple engines, like RisingWave, Feldera adopt PostgresSQL’s dialect, reducing the learning curve.
Example 2: Azure EventHubs and Warpstream use the Kafka protocol, ensuring compatibility with existing tools.
Faster innovation: A smaller, modular codebase is easier to maintain, freeing up resources for new development.
Broader Adoption: Independent, well-defined components can be used in different contexts. Even if an entire product isn’t widely adopted, its sub-components might still gain traction.
Starting something from scratch and getting the MVP going might look faster at first. However, reusing a widely used library pays off in the long run by reducing maintenance overhead and improving interoperability.
Meta’s Lakehouse Stack Consolidation example
Shared Foundations: Modernizing Meta’s Data Lakehouse
Over time, Meta’s data lake house stack evolved organically, resulting in multiple specialized systems that did not share common components. This led to increased operational overhead for each of these systems, and a fragmented user experience. To address these challenges, Meta undertook a consolidation effort, emphasizing shared components across its data infrastructure.
Key changes in Meta’s stack:
Interactive Queries: Presto with SSD cache for interactive queries. Existing dashboards were updated to generate Presto SQL and the migration took approximately 2 years.
Batch Processing: Due to organic growth, HiveQL was widely used. Meta adopted Presto SQL dialect as the longer term standard. Spark, with presto’s dialect was used for batch, and Meta also explored running Presto on Spark.
Compute standardization: Velox emerged as the unified execution library. Serialization libraries were also standardized.
By consolidating systems and adopting reusable components, Meta reduced maintenance overhead and improved consistency across its analytical stack.
Examples of Reusable Components in Practice
Feldera
Feldera is an incremental computation engine based on the DBSP paper. It consists of two major components
DBSP Rust library - Maintains materialized views incrementally.
SQL to DBSP compiler - Built using Apache Calcite, this compiler reuses Calcite’s SQL parsing and optimization capabilities to generate DBSP Rust code for execution.
Apache Gluten
Apache Gluten allows Apache Spark users to replace Spark’s execution engine with optimized alternatives like Velox. This resulted in 2x speed improvements for some queries. Gluten uses substrait as the intermediate representation, enabling interoperability with different execution engines. Velox’s availability as a standalone, reusable execution library with substrait IR input made this integration possible.
Delta Kernel
Delta Lake initially required each engine to reimplement its protocol, slowing down adoption of new protocol features. To address this, delta community introduced “delta kernel”, a lightweight library that makes writing delta connectors easier. This approach allowed DuckDB to easily integrate with Delta Lake by using the Rust delta kernel to implement Delta Lake extension.
Arroyo
Arroyo is a stream processing engine that leverages Apache datafusion for parsing SQL queries and optimization.
Few other common standards
Here are few other attempts at standards in the analytical ecosystem with varying adoption.
Schema Registry: A service that provides mapping from schema id to schema for formats like avro and protobuf. Typically used for interpreting messages in Kafka like systems. Confluent schema registry is an example.
CloudEvents: CloudEvents is a specification for describing event data in a common way. This provides a convention for how schema id and other properties could be encoded in Kafka messages and other systems.
Table Formats: Specification for how analytical tables must be stored on object stores: Delta Spec, Apache Iceberg Spec and Apache Hudi Spec. SlateDB could evolve as that specification for key-value stores.
Recap: Library that converts between schemas in different systems, such as serialization formats and databases.
Iceberg Rest Catalog: API spec for iceberg catalog.
Open Telemetry: Collection of specs for telemetry data.
Summary
The data ecosystem continues to evolve. By splitting systems into well-defined, reusable components and leveraging standard interfaces, engineers can reduce development and maintenance overhead while fostering broader adoption. Real-world examples from Feldera, Apache Gluten and Meta demonstrate the practical benefits of this approach, highlighting how modular design fosters interoperability.