Benchmark UC2: Downsampling
Another common use case for stream processing architectures is reducing the amount of events, messages, or measurements by aggregating multiple records within consecutive, non-overlapping time windows. Typical aggregations compute the average, minimum, or maximum of measurements within a time window or count the occurrence of same events. Such reduced amounts of data are required, for example, to save computing resources or to provide a better user experience (e.g., for data visualizations). When using aggregation windows of fixed size that succeed each other without gaps (called tumbling windows in many stream processing engines), the (potentially varying) message frequency is reduced to a constant value. This is also referred to as downsampling. Downsampling allows for applying many machine learning methods that require data of a fixed frequency.
Dataflow Architecture
The dataflow architecture first reads measurement data from an input stream and then assigns each measurement to a time window of fixed, but statically configurable size. Afterwards, an aggregation operator computes the summary statistics sum, count, minimum, maximum, average and population variance for a time window. Finally, the aggregation result containing all summary statistics is written to an output stream.
Further Reading
S. Henning and W. Hasselbring. “Theodolite: Scalability Benchmarking of Distributed Stream Processing Engines in Microservice Architectures”. In: Big Data Research 25. 2021. DOI: 10.1016/j.bdr.2021.100209.