By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Media

Integrating Data-Parallel Analytics into Stream-Processing Using an In-Memory Data Grid

William Bain, CEO at ScaleOut Software, Inc., spoke at the In-Memory Computing Summit Europe 2018 in London on June 25. His talk was titled, "Integrating Data-Parallel Analytics into Stream-Processing Using an In-Memory Data Grid." In use cases ranging from IoT to ecommerce, an ongoing challenge for stream-processing applications is to extract important insights from real-time systems as fast as possible and then generate effective feedback that optimizes operations or avoids costly failures.

Unlike popular software platforms for streaming analytics (e.g., Apache Storm, Flink, Spark Streaming, and legacy CEP), which focus on extracting value from unfiltered data streams, in-memory data grids (IMDGs) have opened the door to stateful stream processing that correlates event streams by data sources using a “digital twin” model and enables much deeper introspection on these data sources. This talk describes the next step in the evolution of the digital twin model made possible by IMDGs: the incorporation of real-time, data-parallel analytics that further enhances introspection by providing immediate feedback on aggregate behaviors.

As illustrated by several real-world applications, this new capability leverages the power of IMDGs to significantly increase the effectiveness of stream-processing applications. Stream-processing and data-parallel analytics have traditionally led parallel lives. As described by the Lambda architecture, streaming analytics are usually hosted in the “speed layer,” while data-parallel analytics are hosted in the “batch layer” with results appearing later in queries that merge these two views. Data-parallel analytics captures vital aggregate trends that can enhance introspection. Built by the necessity to host processing layers on different systems, the Lambda architecture fails to deliver real-time feedback to stream-processing applications from data-parallel analytics.

Consider a medical monitoring application that captures and analyzes telemetry from hundreds of thousands of monitoring devices. Using an IMDG to host a digital twin model of the patients enables the tracking of real-time state for each patient, and it automatically correlates incoming telemetry from the data sources to the respective digital twin models. This allows the application to analyze telemetry in real time with rich context that includes the patient’s medical history and recent events, allowing much deeper introspection than available in stateless stream-processing systems.

The next step is to extract and analyze salient state information from the real-time state of all digital twin models and feed the results back to these models for incorporation into the analysis algorithm. This offers the next level of introspection that considers dynamic, aggregate trends. For example, the medical application can average key parameters, such as heart rate, across all patients and pivot this data by region, age group, gender, etc. These results can be reported back to the digital twin models in real time to enrich the analysis of incoming telemetry.