QUIC FUTURE: Add concurrency architecture design document

Reviewed-by: Neil Horman <nhorman@openssl.org> Reviewed-by: Saša Nedvědický <sashan@openssl.org> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from https://github.com/openssl/openssl/pull/26025)
2024-04-24 13:38:27 +01:00 · 2024-04-24 13:38:27 +01:00 · 3686d215fe
commit 3686d215fe
parent 15f859403e
2 changed files with 413 additions and 0 deletions
--- a/doc/designs/quic-design/images/quic-concurrency-models.svg
+++ b/doc/designs/quic-design/images/quic-concurrency-models.svg
--- a/doc/designs/quic-design/quic-concurrency.md
+++ b/doc/designs/quic-design/quic-concurrency.md
@ -0,0 +1,412 @@
+QUIC Concurrency Architecture
+=============================
+
+Introduction
+------------
+
+Most QUIC implementations in C are offered as a simple state machine without any
+included I/O solution. Applications must do significant integration work to
+provide the necessary infrastructure for a QUIC implementation to integrate
+with. Moreover, blocking I/O at an application level may not be supported.
+
+OpenSSL QUIC seeks to offer a QUIC solution which can serve multiple use cases:
+
+- Firstly, it seeks to offer the simple state machine model and a fully
+  customisable network path (via a BIO) for those who want it;
+
+- Secondly, it seeks to offer a turnkey solution with an in-the-box I/O
+  and polling solution which can support blocking API calls in a Berkeley
+  sockets-like way.
+
+These usage modes are somewhat diametrically opposed. One involves libssl
+consuming no resources but those it is given, with an application responsible
+for synchronisation and a potentially custom network I/O path. This usage model
+is not “smart”. Network traffic is connected to the state machine and state is
+input and output from the state machine as needed by an application on a purely
+non-blocking basis. Determining *when* to do anything is largely the
+application's responsibility.
+
+The other diametrically opposed usage mode involves libssl managing more things
+internally to provide an easier to use solution. For example, it may involve
+spinning up background threads to ensure connections are serviced regularly (as
+in our existing client-side thread assisted mode).
+
+In order to provide for these different use cases, the concept of concurrency
+models is introduced. A concurrency model defines how “cleverly” the QUIC engine
+will operate and how many background resources (e.g. threads, other OS
+resources) will be established to support operation.
+
+Concurrency Models
+------------------
+
+- **Unsynchronised Concurrency Model (UCM):** In the Unsynchronised Concurrency
+  Model, calls to SSL objects are not synchronised. There is no locking on any
+  APL call (the omission of which is purely an optimisation). The application is
+  either single-threaded or is otherwise responsible for doing synchronisation
+  itself.
+
+  Blocking API calls are not supported under this model. This model is intended
+  primarily for single-threaded use as a simple state machine by advanced
+  applications, and many applications will be likely to disable autoticking.
+
+- **Contentive Concurrency Model (CCM):** In the
+  Contentive Concurrency Model, calls to SSL objects are wrapped in locks and
+  multi-threaded usage of a QUIC connection (for example, parallel writes to
+  different QUIC stream SSL objects belonging to the same QUIC connection) is
+  synchronised by a mutex.
+
+  This is contentive in the sense that if a large number of threads are trying
+  to write to different streams on the same connection, a large amount of lock
+  contention will occur. As such, this concurrency model will not scale and
+  provide good performance, at least within the context of concurrent use
+  of a single connection.
+
+  Under this model, APL calls by the application result in lock-wrapped
+  mutations of QUIC core objects (`QUIC_CHANNEL`, `QUIC_STREAM`, etc.) on the
+  same thread.
+
+  This model may be used either in a variant which does not support blocking
+  (NB-CCM) or which does support blocking (B-CCM). The blocking variant must
+  spin up additional OS resources to correctly support blocking semantics.
+
+- **Thread Assisted Contentive Concurrency Model (TA-CCM):** This is currently
+  implemented by our thread assisted mode for client-side QUIC usage. It does
+  not realise the full state separation or performance of the Worker Concurrency
+  Model (WCM) below. Instead, it simply spawns a background thread which ensures
+  QUIC timer events are handled as needed. It makes use of the Contentive
+  Concurrency Model for performing that handling, in that it obtains a lock when
+  ticking a QUIC connection just as any call by an application would.
+
+  This mode is likely to be deprecated in favour of the full Worker Concurrency
+  Model (WCM), which it will naturally be subsumed by.
+
+- **Worker Concurrency Model (WCM):** In the Worker Concurrency Model,
+  a background worker thread is spawned to manage connection processing. All
+  interaction with a SSL object goes through this thread in some way.
+  Interactions with SSL objects are essentially translated into commands and
+  handled by the worker thread. To optimise performance and minimise lock
+  contention, there is an emphasis on message passing over locking.
+  Internal dataflow for application data can be managed in a zero-copy way to
+  minimise the costs of this message passing.
+
+  Under this model, QUIC core objects (`QUIC_CHANNEL`, `QUIC_STREAM`, etc.) will
+  live solely on the worker thread and access to these objects by an application
+  thread will be entirely forbidden.
+
+  Blocking API calls are supported under this model.
+
+These concurrency models are summarised as follows:
+
+| Model  | Sophistication | Concurrency           | Blocking Supported | OS Resources              | Timer Events    | RX Steering | Core State Affinity  |
+|--------|----------------|-----------------------|--------------------|---------------------------|-----------------|-------------|----------------------|
+| UCM    | Lowest         | ST only               | No                 | None                      | App Responsible | None        | App Thread           |
+| CCM    |                | MT (Contentive)       | Optional           | Mutex, (Notifier)         | App Responsible | TBD         | App Threads          |
+| TA-CCM† |                | MT (Contentive)       | Optional           | Mutex, Thread, (Notifier) | Managed         | TBD         | App & Assist Threads |
+| WCM    | Highest        | MT (High Performance) | Yes                | Mutex, Thread, Notifier   | Managed         | Futureproof | Worker Thread        |
+
+† To eventually be deprecated in favour of WCM.
+
+Legend:
+
+- **Blocking Supported:** Whether blocking calls to e.g. `SSL_read` can be
+  supported. If this is listed as “optional”, extra resources are required to
+  support this under the listed model and these resources could be omitted if an
+  application indicates it does not need this functionality at initialisation
+  time.
+
+- **OS Resources:** “Mutex” refers to mutex and condition variable resources.
+  “Notifier” refers to a kind of OS resource needed to allow one thread to wake
+  another thread which is currently blocking in an OS socket polling call such
+  as poll(2) (e.g. an eventfd or socketpair). Resources listed in parentheses in
+  the table above are required only if blocking support is desired.
+
+- **Timer Events:** Is an application responsible for ensuring QUIC timeout
+  events are handled in a timely manner?
+
+- **RX Steering:** The matter of RX steering will be discussed in detail in a
+  future document. Broadly speaking, RX steering concerns whether incoming
+  traffic for multiple different QUIC connections on the same local port (e.g.
+  for a server) can be vectored *by the OS* to different threads or whether the
+  demuxing of incoming traffic for different connections has to be done manually
+  on an in-process basis.
+
+  The WCM model most readily supports RX steering and is futureproof in this
+  regard. The feasibility of having the UCM and CCM models support RX steering
+  is left for future analysis.
+
+- **Core State Affinity:** Which threads are allowed to touch the QUIC core
+  objects (`QUIC_CHANNEL`, `QUIC_STREAM`, etc.)
+
+Architecture
+------------
+
+To recap, the API Personality Layer (APL) refers to the code in `quic_impl.c`
+which implements the libssl API personality (`SSL_write`, etc.). The APL is
+cleanly separated from the QUIC core implementation (`QUIC_CHANNEL`, etc.).
+
+Since UCM is basically a slight optimisation of CCM in which unnecessary locking
+is elided, discussion from hereon in will focus on CCM and WCM except where
+there are specific differences between CCM and UCM.
+
+Supporting both CCM and WCM creates significant architectural challenges. Under
+CCM, QUIC core objects have their state mutated under lock by arbitrary
+application threads and these mutations happen during APL calls. By contrast, a
+performant WCM architecture requires that APL calls be recorded and serviced in
+an asynchronous fashion involving message passing to a worker thread. This
+threatens to require highly divergent dispatch architectures for the two
+concurrency models.
+
+As such, the concept of a **Concurrency Management Layer (CML)** is introduced.
+The CML lives between the APL and the QUIC core code. It is responsible for
+dispatching in-thread mutations of QUIC core objects when operating under CCM,
+and for dispatching messages to a worker thread under WCM.
+
+![Concurrency Models Diagram](images/quic-concurrency-models.svg)
+
+There are two different CMLs:
+
+- **Direct CML (DCML)**, in which core objects are worked on in the same thread
+  which made an APL call, under lock;
+
+- **Worker CML (WCML)**, in which core objects are managed by a worker thread
+  with communication via message passing. This CML is split into a front end
+  (WCML-FE) and back end (WCML-BE).
+
+The legacy thread assisted mode uses a bespoke method which is similar to the
+approach used by the DCML.
+
+CML Design
+----------
+
+The CML is designed to have as small an API surface area as possible to enable
+unified handling of as many kinds of (APL) API operations as possible. The idea
+is that complex APL calls are translated into simple operations on the CML.
+
+At its core, the CML exposes some number of *pipes*. The number of pipes which
+can be accessed via the CML varies as connections and streams are created and
+destroyed. A pipe is a *unidirectional* transport for byte streams. Zero-copy
+optimisations are expected to be implemented in future but are deferred.
+
+The CML (`QUIC_CML`) allows the caller to refer to a pipe by providing an opaque
+pipe handle (`QUIC_CML_PIPE`). If the pipe is a sending pipe, the caller can use
+`ossl_cml_write` to try and add bytes to it. Conversely, if it is a receiving
+pipe, the caller can use `ossl_cml_read` to try and read bytes from it.
+
+The method `ossl_cml_block_until` allows the caller to block until at least one
+of the provided pipe handles is ready. Ready means that at least one byte can be
+written (for a sending pipe) or at least one byte can be read (for a receiving
+pipe).
+
+Note that there is only expected to be one `QUIC_CML` instance per QUIC event
+processing domain (i.e., per `QUIC_DOMAIN` / `QUIC_ENGINE` instance). The CML
+fully abstracts the QUIC core objects such as `QUIC_ENGINE` or `QUIC_CHANNEL` so
+that the APL never sees them.
+
+The caller retrieves a pipe handle using `ossl_cml_get_pipe`. This function
+retrieves a pipe based on two values:
+
+  - a CML pipe class;
+  - a CML *selector*.
+
+The CML selector is a tagged union structure which specifies what pipe is to be
+retrieved. Abstractly, examples of selectors include:
+
+```text
+    Domain      ()
+    Listener    (listener_id: uint)
+    Conn        (conn_id:     uint)
+    Stream      (conn_id:     uint, stream_id: u64)
+```
+
+In other words, the CML selector selects the “object” to retrieve a pipe from.
+
+The CML pipe class is one of the following values:
+
+- Request
+- Notification
+- App Send
+- App Recv
+
+The pipe classes available for a given selector vary. For example, the “App
+Send” and “App Recv” pipes only exist on a stream, so it is invalid to request
+such a pipe in conjunction with a different type of selector.
+
+The “Request” and “App Send” classes expose send-only streams, and the
+“Notification” and “App Recv” classes expose receive-only streams.
+
+For any given CML selector, the Request pipe is used to send serialized commands
+for asynchronous processing in relation to the entity selected by that selector.
+Conversely, the Notification pipe returns asynchronous notifications. These
+could be in relation to a previous Command (e.g. indicating whether a command
+succeeded), or unprompted notifications about other events.
+
+The underlying pattern here is that there is a bidirectional channel for control
+messages, and a bidirectional channel for application data, both comprised of
+two unidirectional pipes in turn.
+
+Pipe handles are stable for as long as the pipe they reference exists, so an APL
+object can cache a pipe handle if desired.
+
+All CML methods are thread safe. The CML implementation handles any necessary
+locking (if any) internally.
+
+The `ossl_cml_write_available` and `ossl_cml_read_available` calls determine the
+number of bytes which can currently be written to a send-only pipe, or read from
+a receive-only pipe, respectively.
+
+**Race conditions.** Because these are separate calls to `ossl_cml_write` and
+`ossl_cml_read`, the values returned by these functions may become out of date
+before the caller has a chance to read `ossl_cml_write` or `ossl_cml_read`.
+However, such changes are guaranteed to be monotonically in favour of the
+caller; for example, the value returned by `ossl_cml_write_available` will only
+ever increase asynchronously (and only decrease as a result of an
+`ossl_cml_write` call). Conversely, the value returned by
+`ossl_cml_read_available` will only ever increase asynchronously (and only
+decrease as a result of an `ossl_cml_read` call). Assuming that only one thread
+makes calls to CML functions at a given time *for a given pipe*, this therefore
+poses no issue for callers.
+
+Concurrent use of `ossl_cml_write` or `ossl_cml_read` for a given pipe is not
+intended (and would not make sense in any case). The caller is responsible for
+synchronising such calls.
+
+**Examples of pipe usage.** The application data pipes are used to serialize the
+actual application data sent or received on a QUIC stream. The usage of the
+request/notification pipes is more varied and used for control activity. There
+is therefore a “control/data” separation here. The request and notification
+pipes transport tagged unions. Abstractly, commands and notifications might
+include:
+
+- Request: Reset Stream (error code: u64)
+- Notification: Connection Terminated by Peer
+
+**Example implementation of `SSL_write`.** An `SSL_write`-like API might be
+implemented in the APL like this:
+
+```c
+int do_write(QUIC_CML *cml,
+             QUIC_CML_PIPE notification_pipe,
+             QUIC_CML_PIPE app_send_pipe,
+             const void *buf, size_t buf_len)
+{
+    size_t bytes_written = 0;
+
+    for (;;) {
+        /* e.g. connection termination */
+        process_any_notifications(notification_pipe);
+
+        /* state checks, etc. */
+        if (...->conn_terminated)
+            return 0;
+
+        if (buf_len == 0)
+            return 1;
+
+        if (!ossl_cml_write(cml, app_send_pipe, buf, buf_len, &bytes_written))
+            return 0;
+
+        if (bytes_written == 0) {
+            if (!should_block())
+                break;
+
+            ossl_cml_block_until(cml, {notification_pipe, app_send_pipe});
+            continue; /* try again */
+        }
+
+        buf     += bytes_written;
+        buf_len -= bytes_written;
+    }
+
+    return 1;
+}
+```
+
+```c
+/*
+ * Creates a new CML using the Direct CML (DCML) implementation. need_locking
+ * may be 0 to elide mutex usage if the application is guaranteed to synchronise
+ * access or is purely single-threaded.
+ */
+QUIC_CML *ossl_cml_new_direct(int need_locking);
+
+/* Creates a new CML using the Worker CML (WCML) implementation. */
+QUIC_CML *ossl_cml_new_worker(size_t num_worker_threads);
+
+/*
+ * Starts the CML operating. Idempotent after it returns successfully. For the
+ * WCML this might e.g. start background threads; for the DCML it is likely to
+ * be a no-op (but must still be called).
+ */
+int ossl_cml_start(QUIC_CML *cml);
+
+/*
+ * Begins the CML shutdown process. Returns 1 once shutdown is complete; may
+ * need to be called multiple times until shutdown is done.
+ */
+int ossl_cml_shutdown(QUIC_CML *cml);
+
+/*
+ * Immediate free of the CML. This is always safe but may cause handling
+ * of a connection to be aborted abruptly as it is an immediate teardown
+ * of all state.
+ */
+void ossl_cml_free(QUIC_CML *cml);
+
+/*
+ * Retrieves a pipe for a logical CML object described by selector. The pipe
+ * handle, which is stable over the life of the logical CML object, is written
+ * to *pipe_handle. class_ is a QUIC_CML_CLASS value.
+ */
+enum {
+    QUIC_CML_CLASS_REQUEST,         /* control; send */
+    QUIC_CML_CLASS_NOTIFICATION,    /* control; recv */
+    QUIC_CML_CLASS_APP_SEND,        /* data; send */
+    QUIC_CML_CLASS_APP_RECV         /* data; recv */
+};
+
+int ossl_cml_get_pipe(QUIC_CML                  *cml,
+                      int                       class_,
+                      const QUIC_CML_SELECTOR   *selector,
+                      QUIC_CML_PIPE             *pipe_handle);
+
+/*
+ * Returns the number of bytes a sending pipe can currently accept. The returned
+ * value may increase over time asynchronously but will only decrease in
+ * response to an ossl_cml_write call.
+ */
+size_t ossl_cml_write_available(QUIC_CML *cml, QUIC_CML_PIPE pipe_handle);
+
+/*
+ * Appends bytes into a sending pipe by copying them. The buffer can be freed
+ * as soon as this call returns.
+ */
+int ossl_cml_write(QUIC_CML *cml, QUIC_CML_PIPE pipe_handle,
+                   const void *buf, size_t buf_len);
+
+/*
+ * Returns the number of bytes a receiving pipe currently has waiting to be
+ * read. The returned value may increase over time asynchronously but will only
+ * decreate in response to an ossl_cml_read call.
+ */
+size_t ossl_cml_read_available(QUIC_CML *cml, QUIC_CML_PIPE pipe_handle);
+
+/*
+ * Reads bytes from a receiving pipe by copying them.
+ */
+int ossl_cml_read(QUIC_CML *cml, QUIC_CML_PIPE pipe_handle,
+                  void *buf, size_t buf_len);
+
+/*
+ * Blocks until at least one of the pipes in the array specified by
+ * pipe_handles is ready, or until the deadline given is reached.
+ *
+ * A pipe is ready if:
+ *
+ *   - it is a sending pipe and one or more bytes can now be written;
+ *   - it is a receiving pipe and one or more bytes can now be read.
+ */
+int ossl_cml_block_until(QUIC_CML *cml,
+                         const QUIC_CML_PIPE *pipe_handles,
+                         size_t num_pipe_handles,
+                         OSSL_TIME deadline);
+```