Skip to main content

Async I/O (DataStream API)

Motivation

synchronous I/O

The purpose of the Flink's Async I/O feature in the DataStream API is to help users avoid the problems that occur when a user function (e.g., a FlatMapFunction or ProcessFunction) does synchronous (blocking) I/O to fetch information from outside of Flink. This is problematic because:

  • The response time of the external system is likely to be high enough that any synchronous I/O operation will become a bottleneck.

  • While blocked waiting for the response, that task's resources (CPU and memory) are sitting idle.

  • In addition to causing poor resource utilization, this blocking behavior can prevent checkpointing from occuring in a timely fashion, leading to checkpoint timeouts.

asynchronous I/O

Ordered and Unordered Modes

Async I/O has two modes: ordered and unordered.

In ordered mode the async I/O operator respects the order in which events arrive, and emits results in the same order.

In unordered mode, results are emitted as soon as the results become available — except that for jobs using event time, the watermarks' semantics are preserved. The watermarks effectively act as barriers that limit the extent of the re-ordering.

The implementation of unordered mode groups stream records into segments separated by watermarks. Each segment is completely emitted before any entries from following segments are emitted. Thus, no stream record can be overtaken by a watermark, and no watermark can overtake a stream record. But reordering within a segment is possible (and likely).

Exactly-once Guarantees

The async I/O operator can guarantee exactly-once results. During checkpointing it snapshots enough of its state so that in the case of a restart, all requests whose results had not yet been emitted at the time of the checkpoint are re-issued.

Capacity, Timeout Handling, and Retries

These parameters control the behavior of the async I/O operator:

  • Capacity: The capacity determines the size of the fixed-length queue that the implementation of async I/O uses to do its bookkeeping about pending requests. This capacity is specified on a per-instance basis, so if you are concerned about the total load being placed on the external service you need to think about parallelism * capacity.

    Having an upper-limit on the number of in-flight requests ensures that the operator will at some point trigger backpressure, rather than accumulating an ever-growing backlog.

  • Timeout: This is the total time the operator will wait for a response before failing. This includes retries if an AsyncRetryStrategy has been provided.

  • AsyncRetryStrategy: You can specify under what conditions you want to retry (based on the result returned, or an exception), and how long to delay before trying again. There are built-in fixed-delay and exponential-backoff strategies.

    info

    Built-in support for retry handling was added in Flink 1.16. With earlier versions of Flink you have to handle this on your own.

References