Blog: Stateful Stream Processing with Digital Twins

February 12, 2025

What’s Wrong with Today’s Stateful Stream Processing?

The vast majority of stream processing and complex event processing systems, such as Apache Storm and Software AG’s Apama, focus on extracting patterns contained within streaming data. They do not focus on determining what streaming data are saying about the devices or other data sources that generate this telemetry. They don’t track the state of the data sources themselves.

For example, consider an IoT stream-processing application that monitors data from a temperature sensor to prevent the failure of a critical medical freezer. The application looks for unusual patterns in temperature changes, such as sudden spikes or a continuously upward trend. However, it does not take into account other information about the freezer, such as usage patterns and service history. Making use of this information would help the stream-processing application do a better job of predicting an impending failure.

The following diagram depicts a typical stream-processing pipeline that analyzes event messages from a fleet of vehicles:

The stream-processing steps (shown in the circles) must sort out messages from the combined data streams and extract useful patterns without the benefit of contextual information about each data source.

More recent stream-processing platforms, such as Apache Flink, have incorporated stateful stream processing into their architectures in the form of key-value stores or databases that the application can make use of to enhance their analysis.

A stateful stream-processing pipeline with attached storage

However, just adding an external data store doesn’t provide guidance about how to structure applications so that they do a better job of analyzing the data. Also, repeatedly reading and updating an external data store can significantly impact the performance of a stream-processing pipeline.

A Better Approach: Real-Time Digital Twins

The problem with stateful stream processing is the pipeline itself, which makes it difficult to keep track of the evolving state of data sources. The popular digital twin model provides a compelling answer to this challenge. Digital twins were conceived by Dr. Michael Grieves (U. Michigan) in 2002 to aid in the design and life-cycle management of new products by modeling their behavior in software. When used in stream processing, real-time digital twins can also track the state of individual data sources and efficiently process their messages. A collection of real-time digital twins can easily replace a stream-processing pipeline. For example, digital twins can process messages from vehicles in a fleet while maintaining important information about each vehicle:

Stream processing implemented with real-time digital twins

Real-time digital twins offer several advantages over a stateful stream-processing pipeline. They provide a convenient way to store state information about each data source. This lets an application easily track all relevant information about the evolving state of each data source so that it can analyze incoming events in a rich context and provide high quality insights, alerting, and feedback. For example, digital twins can maintain an electric vehicle’s charging history to aid in evaluating battery telemetry. They also simplify application design by automatically separating messages from different data sources so that applications don’t have to do this. The hosting platform performs all message orchestration and delivery to digital twins.

Real-time digital twins boost performance in two ways. Instead of moving stored data in and out of a stream-processing pipeline, they deliver messages to the contextual data needed for analytics. Because the data doesn’t move, the application responds faster and reduces networking overhead. Also, unlike a pipeline, digital twins can be replicated to handle many data sources without creating bottlenecks. When hosted in memory using an in-memory computing platform like ScaleOut Digital Twins™, digital twins can access stored data and process messages in a few milliseconds. Platforms like this can host millions of digital twins on a scalable cluster of servers.

Leveraging Object-Oriented Programming

Real-time digital twins make it easy to develop applications for stateful stream processing by taking advantage of well understood object-oriented programming techniques. A real-time digital twin is just a software object containing state data about the data source and a method for processing incoming messages. Analytics code can optionally incorporate machine learning algorithms. It also can reach out to databases to access and update historical data sets.

Depiction of a real-time digital twin software object

The hosting platform creates an instance of a digital twin for each unique data source and delivers all messages from the data source to the digital twin for processing. Target applications, including telematics, security, logistics, IoT, and smart cities, typically use thousands of digital twins to handle the workload from many data sources.

Summing Up

When analyzing streaming data from many data sources, stateful stream-processing pipelines can create challenges for application developers and limit their ability to introspect on the dynamic state of data sources. Real-time digital twins reorganize both stream processing and state management in a manner that makes it easy to track the dynamic state of data sources. They simplify application design using object-oriented programming techniques, offload message orchestration, and boost both performance and scalability. Real-time digital twins provide a powerful new way to implement stateful stream processing for large, complex systems.

‍

Want to see digital twins in action?

Schedule a customized demo here.

‍

About The Author

_‍

William Bain, CEO at ScaleOut Software

Dr. William L. Bain is the founder and CEO of ScaleOut Software, which has been developing software products since 2003 designed to enhance operational intelligence within live systems using scalable, in-memory computing technology. Bill earned a Ph.D. in electrical engineering from Rice University. With over a 40-year career focused on parallel computing, he has contributed to advancements at Bell Labs Research, Intel, and Microsoft, and holds several patents in computer architecture and distributed computing.

Stateful Stream Processing with Digital Twins

What’s Wrong with Today’s Stateful Stream Processing?