Streaming Materialized Views with MongoDB Atlas

Databases with stream processors are great at streaming materialized views.

Share

The Demo

This repository contains a demo I wrote that shows how Atlas Stream Processing can maintain a continuously updating materialized view in MongoDB Atlas. This showcases how, when paired with a database, a stream processor can tackle one of the more challenging problems in data infrastructure.

MongoDB documentation shows how query developers can use aggregation pipelines to materialize a view as a collection on demand. It works fine (I implemented it here, you can run that too) if the materialization interval is lax: hours and days not minutes and seconds. My streaming materialized view repository shows a pattern of building a stream processor that captures change stream events and continuously updates a collection. This points at a combination that is surprisingly rare, even today: a tightly integrated system where OLTP data can be directly used in streaming pipelines and materialized back into operational or analytical collections.

Views don't rematerialize themselves

Materializing a view is one of the oldest problems in data infrastructure: you have the data, but the operation that performs an aggregate rollup takes too long. You sit and wait for a stale result. If only you had the ability to pre-compute and store a query result to speed up reads.

Simple problem, right? Straightforward solution, right? Interestingly, no. The world of materialized views is fraught with tradeoff: functional capability or operational complexity?

The tradeoff goes away: Stream Processing answers the view materialization problem

When database developers define a materialized view, they have to make a tradeoff between freshness and latency. When a view is materialized by a stream processor, sink = viewQuery(source) at all times. The tradeoff goes away.

Streaming materialized views are event-driven, not schedule-driven. Traditionally, view refreshes are scheduled incremental events or manually invoked jobs. The developer or administrator has to tune refresh interval against business requirements. With a stream processor, the tradeoff goes away.

It's 2026 and we still struggle to materialize a view

Everyone who takes a stab at view materialization requires the user to constrain the query structure or accept the significantly larger operational costs that come with managing multiple systems. Databases can't do it without stream processors. Stream processors can do it, but they're useless without databases.

There are lots of attempts. None of them offer the complete picture. All of them trade something away. Some examples:

Relational databases such as Postgres let you specify a query as a materialized view. You usually have to materialize it on demand using REFRESH. Other databases offer syntactic sugar around how the database refreshes the view automatically. And you can always use something like cron to refresh the view. But the view never refreshes itself.

Materialize is not trying to be a general-purpose OLTP database like PostgreSQL or a broad analytical warehouse like Snowflake. It's very good at incremental view maintenance. It's not very good at high-volume point writes, transactional app backends, row-level CRUD workloads or serving user applications directly.

Continuous aggregates in TimescaleDB come close too. A continuous aggregate is an interface to a hypertable which combines view materialization with smart on-demand incremental refreshes execution to always provide the freshest result. But this requires the data to be organized in a certain way: a time-series specific hypertable with query friendly bucketing.

With FLIP-435, Apache Flink supports table materialization. Continuous mode handles algebraic and non-algebraic operations. But you own the state backend, TTL tuning, RocksDB configuration, and so on. Of course, you can pay to have this fully managed, but the tradeoff is that you pay. And you get a table materialization only exists in the context of a FlinkSQL query. Despite what our friends at DBEngines say, Flink isn't a database. Your stream processing job might be fast, but proper table materialization outside the context of a database with a hefty management and operational overhead just doesn't count.

MongoDB is uniquely suited for materialized views

MongoDB Atlas Stream Processing sits adjacent to MongoDB Atlas. The stream processor is already there, trivially connected to your data. Change streams are the event source: inserts, updates, deletes, pre- and post- images are already supported by the pattern. You don't need a Kafka cluster, another stream processor deployment, or a change data capture connector. The transformation language is a variant of the MongoDB aggregation pipeline, which MongoDB developers have known for years.

My repository includes documentation on the delta logic pattern necessary to build streaming materialized views. Take a look!

Oh and you can put it anywhere

A good chunk of streaming materialized view capabilities can be built today with Atlas Stream Processing. The view can be materialized as another MongoDB collection, on a Kafka topic, or (soon) as an Iceberg table, accessible to another analytic engine. I think you'll find the economics behind this approach are compelling.

The Gap Between a Pattern and a Primitive

Right now this is something you build, not something you declare. Right now the target collection is decoupled from the stream processor. Other things can write to it, and if you drop the collection the stream processor will keep running, recreating the collection with an incomplete aggregation. You can imagine a real streaming materialized view primitive, managing the source, the sink, and the processor together as a cohesive unit.