Latency
Introduction
Many Flink applications have a requirement for low latency: for example, applications doing fraud or anomaly detection, or monitoring and alerting. In these situations, it's critical to have a way to measure and monitor the latency.
Latency is Measured with Histograms
A single latency measurement is a time interval, such as 50 milliseconds, or 2 minutes, measuring the delay in processing a single event. But what you really want to know is something more like the typical latency, in combination with having an understanding of how bad it sometimes gets.
Business requirements for latency are often expressed in terms of percentages: for example, you might have a requirement that 99% of all events be processed within 1 second of their creation.
Often it is helpful to have a more complete understanding of the frequency distribution of your job's latencies. This is why Flink uses histogram metrics to measure and report latencies. These histogram metrics report a collection of statistics that include:
- min
- max
- mean
- standard deviation
- median
- 75th percentile
- 90th percentile
- 95th percentile
- 98th percentile
- 99th percentile
- 99.9th percentile
By monitoring these statistics, you can gain a lot of insight into the processing delays in your pipeline(s).
Flink's Built-in Latency Metrics
Flink has a built-in system for gathering and reporting metrics, and Flink provides many interesting and useful metrics out of the box. These built-in metrics include some latency measurements, which you may, or may not, want to rely upon.
However, Flink's built-in latency metrics are of limited utility because they don't measure the
total latency.
What they measure instead is the time it takes for special StreamElement
objects called latency
markers to traverse the job graph from the sources to each operator.
These latency markers are able to make that journey more quickly than
application-level event objects, because these markers take a shortcut around each operator.
An application's event objects are actively processed by the functions in that job — e.g.,
the process functions, windows, joins, etc. — but these functions wouldn't know how what to
do with a latency marker, so once one of these markers reaches the head of an operator's input
queue, it is immediately forwarded to the output queue.
In short, Flink's built-in latency metrics provide a lower bound on the true latency, but aren't complete, accurate measurements. Moreover, enabling these metrics can significantly degrade performance, making them even harder to interpret, and unsuitable for regular use in production environments.
Using Custom Metrics for Latency Monitoring
You can obtain more meaningful latency measurements by instrumenting your jobs with custom latency metrics. See the Measuring Latency Recipe for an example.
Reducing Latency
There are many things you can do to reduce latency in your Flink jobs. Here are some of the more useful things you can consider:
- Optimize serialization.
- Use pre-aggregation (e.g., with windows).
- Organize your job topology to have as few keyBy and rebalance operations between the sources and sinks as possible. reinterpretAsKeyedStream may help with this.
- Reduce the network buffer timeout.
- Reduce the auto-watermarking interval.
- Reduce the checkpointing interval (for transactional sinks), perhaps in combination with using log-based checkpointing as offered by the changelog state backend. See see Improving speed and stability of checkpointing with generic log-based incremental checkpoints for more details.
- Optimize garbage collection.
- Enable object reuse.
For an in-depth treatment of this topic, see Getting into Low-Latency Gears with Apache Flink – Part 1 and – Part 2.