doc/atomics.md

   1 XNU use of Atomics and Memory Barriers
   2 ======================================
   3
   4 Goal
   5 ----
   6
   7 This document discusses the use of atomics and memory barriers in XNU. It is
   8 meant as a guide to best practices, and warns against a variety of possible
   9 pitfalls in the handling of atomics in C.
  10
  11 It is assumed that the reader has a decent understanding of
  12 the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order)
  13 as this document builds on it, and explains the liberties XNU takes with said
  14 model.
  15
  16 All the interfaces discussed in this document are available through
  17 the `<os/atomic_private.h>` header.
  18
  19 Note: Linux has thorough documentation around memory barriers
  20 (Documentation/memory-barriers.txt), some of which is Linux specific,
  21 but most is not and is a valuable read.
  22
  23
  24 Vocabulary
  25 ----------
  26
  27 In the rest of this document we'll refer to the various memory ordering defined
  28 by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst.
  29
  30 `os_atomic` also tries to make the distinction between compiler **barriers**
  31 (which limit how much the compiler can reorder code), and memory **fences**.
  32
  33
  34 The dangers and pitfalls of C11's `<stdatomic.h>`
  35 -------------------------------------------------
  36
  37 While the C11 memory model has likely been one of the most important additions
  38 to modern C, in the purest C tradition, it is a sharp tool.
  39
  40 By default, C11 comes with two variants of each atomic "operation":
  41
  42 - an *explicit* variant where memory orderings can be specified,
  43 - a regular variant which is equivalent to the former with the *seq_cst*
  44   memory ordering.
  45
  46 When an `_Atomic` qualified variable is accessed directly without using
  47 any `atomic_*_explicit()` operation, then the compiler will generate the
  48 matching *seq_cst* atomic operations on your behalf.
  49
  50 The sequentially consistent world is extremely safe from a lot of compiler
  51 and hardware reorderings and optimizations, which is great, but comes with
  52 a huge cost in terms of memory barriers.
  53
  54
  55 It seems very tempting to use `atomic_*_explicit()` functions with explicit
  56 memory orderings, however, the compiler is entitled to perform a number of
  57 optimizations with relaxed atomics, that most developers will not expect.
  58 Indeed, the compiler is perfectly allowed to perform various optimizations it
  59 does with other plain memory accesess such as coalescing, reordering, hoisting
  60 out of loops, ...
  61
  62 For example, when the compiler can know what `doit` is doing (which due to LTO
  63 is almost always the case for XNU), is allowed to transform this code:
  64
  65 ```c
  66     void
  67     perform_with_progress(int steps, long _Atomic *progress)
  68     {
  69         for (int i = 0; i < steps; i++) {
  70             doit(i);
  71             atomic_store_explicit(progress, i, memory_order_relaxed);
  72         }
  73     }
  74 ```
  75
  76 Into this, which obviously defeats the entire purpose of `progress`:
  77
  78 ```c
  79     void
  80     perform_with_progress(int steps, long _Atomic *progress)
  81     {
  82         for (int i = 0; i < steps; i++) {
  83             doit(i);
  84         }
  85         atomic_store_explicit(progress, steps, memory_order_relaxed);
  86     }
  87 ```
  88
  89
  90 How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls
  91 -----------------------------------------------------------
  92
  93 1. the memory locations passed to the various `os_atomic_*`
  94    functions do not need to be marked `_Atomic` or `volatile`
  95    (or `_Atomic volatile`), which allow for use of atomic
  96    operations in code written before C11 was even a thing.
  97
  98    It is however recommended in new code to use the `_Atomic`
  99    specifier.
 100
 101 2. `os_atomic_*` cannot be coalesced by the compiler:
 102    all accesses are performed on the specified locations
 103    as if their type was `_Atomic volatile` qualified.
 104
 105 3. `os_atomic_*` only comes with the explicit variants:
 106    orderings must be provided and can express either memory orders
 107    where the name is the same as in C11 without the `memory_order_` prefix,
 108    or a compiler barrier ordering `compiler_acquire`, `compiler_release`,
 109    `compiler_acq_rel`.
 110
 111 4. `os_atomic_*` emits the proper compiler barriers that
 112    correspond to the requested memory ordering (using
 113    `atomic_signal_fence()`).
 114
 115
 116 Best practices for the use of atomics in XNU
 117 --------------------------------------------
 118
 119 For most generic code, the `os_atomic_*` functions from
 120 `<os/atomic_private.h>` are the perferred interfaces.
 121
 122 `__sync_*`, `__c11_*` and `__atomic_*` compiler builtins should not be used.
 123
 124 `<stdatomic.h>` functions may be used if:
 125
 126 - compiler coalescing / reordering is desired (refcounting
 127   implementations may desire this for example).
 128
 129
 130 Qualifying atomic variables with `_Atomic` or even
 131 `_Atomic volatile` is encouraged, however authors must
 132 be aware that a direct access to this variable will
 133 result in quite heavy memory barriers.
 134
 135 The *consume* memory ordering should not be used
 136 (See *dependency* memory order later in this documentation).
 137
 138 **Note**: `<libkern/OSAtomic.h>` provides a bunch of legacy
 139 atomic interfaces, but this header is considered obsolete
 140 and these functions should not be used in new code.
 141
 142
 143 High level overview of `os_atomic_*` interfaces
 144 -----------------------------------------------
 145
 146 ### Compiler barriers and memory fences
 147
 148 `os_compiler_barrier(mem_order?)` provides a compiler barrier,
 149 with an optional barrier ordering. It is implemented with C11's
 150 `atomic_signal_fence()`. The barrier ordering argument is optional
 151 and defaults to the `acq_rel` compiler barrier (which prevents the
 152 compiler to reorder code in any direction around this barrier).
 153
 154 `os_atomic_thread_fence(mem_order)` provides a memory barrier
 155 according to the semantics of `atomic_thread_fence()`. It always
 156 implies the equivalent `os_compiler_barrier()` even on UP systems.
 157
 158 ### Init, load and store
 159
 160 `os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide
 161 facilities equivalent to `atomic_init`, `atomic_load_explicit`
 162 and `atomic_store_explicit` respectively.
 163
 164 Note that `os_atomic_load` and `os_atomic_store` promise that they will
 165 compile to a plain load or store. `os_atomic_load_wide` and
 166 `os_atomic_store_wide` can be used to have access to atomic loads and store
 167 that involve more costly codegen (such as compare exchange loops).
 168
 169 ### Basic RMW (read/modify/write) atomic operations
 170
 171 The following basic atomic RMW operations exist:
 172
 173 - `inc`: atomic increment (equivalent to an atomic add of `1`),
 174 - `dec`: atomic decrement (equivalent to an atomic sub of `1`),
 175 - `add`: atomic add,
 176 - `sub`: atomic sub,
 177 - `or`: atomic bitwise or,
 178 - `xor`: atomic bitwise xor,
 179 - `and`: atomic bitwise and,
 180 - `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value),
 181 - `min`: atomic min,
 182 - `max`: atomic max.
 183
 184 For any such operation, two variants exist:
 185
 186 - `os_atomic_${op}_orig` (for example `os_atomic_add_orig`)
 187   which returns the value stored at the specified location
 188   *before* the atomic operation took place
 189 - `os_atomic_${op}` (for example `os_atomic_add`) which
 190   returns the value stored at the specified location
 191   *after* the atomic operation took place
 192
 193 This convention is picked for two reasons:
 194
 195 1. `os_atomic_add(p, value, ...)` is essentially equivalent to the C
 196    in place addition `(*p += value)` which returns the result of the
 197    operation and not the original value of `*p`.
 198
 199 2. Most subtle atomic algorithms do actually require the original value
 200    stored at the location, especially for bit manipulations:
 201    `(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform
 202    `*p |= bit` but also tell you whether `bit` was set in the original value.
 203
 204    Making it more explicit that the original value is used is hence
 205    important for readers and worth the extra five keystrokes.
 206
 207 Typically:
 208
 209 ```c
 210     static int _Atomic i = 0;
 211
 212     printf("%d\n", os_atomic_inc_orig(&i)); // prints 0
 213     printf("%d\n", os_atomic_inc(&i)); // prints 2
 214 ```
 215
 216 ### Atomic swap / compare and swap
 217
 218 `os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`.
 219
 220 There are two variants of `os_atomic_cmpxchg` which are wrappers around
 221 `atomic_compare_exchange_strong_explicit`. Both of these variants will
 222 return false/0 if the compare exchange failed, and true/1 if the expected
 223 value was found at the specified location and the new value was stored.
 224
 225 1. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which
 226    will atomically store `new_value` at `address` if the current value
 227    is equal to `expected`.
 228
 229 2. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)`
 230    which has an extra `orig_value` argument which must be a pointer to a local
 231    variable and will be filled with the current value at `address` whether the
 232    compare exchange was successful or not. In case of success, the loaded value
 233    will always be `expected`, however in case of failure it will be filled with
 234    the current value, which is helpful to redrive compare exchange loops.
 235
 236 Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is
 237 specified, which only takes effect in case of a successful compare exchange.
 238 In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed`
 239 for the failure case ordering, as it is what is used most of the time.
 240
 241 There is no wrapper around `atomic_compare_exchange_weak_explicit`,
 242 as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops.
 243
 244 ### `os_atomic_rmw_loop`
 245
 246 This expressive and versatile construct allows for really terse and
 247 way more readable compare exchange loops. It also uses LL/SC constructs more
 248 efficiently than a compare exchange loop would allow.
 249
 250 Instead of a typical CAS-loop in C11:
 251
 252 ```c
 253     int _Atomic *address;
 254     int old_value, new_value;
 255     bool success = false;
 256
 257     old_value = atomic_load_explicit(address, memory_order_relaxed);
 258     do {
 259         if (!validate(old_value)) {
 260             break;
 261         }
 262         new_value = compute_new_value(old_value);
 263         success = atomic_compare_exchange_weak_explicit(address, &old_value,
 264                 new_value, memory_order_acquire, memory_order_relaxed);
 265     } while (__improbable(!success));
 266 ```
 267
 268 `os_atomic_rmw_loop` allows this form:
 269
 270 ```c
 271     int _Atomic *address;
 272     int old_value, new_value;
 273     bool success;
 274
 275     success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
 276         if (!validate(old_value)) {
 277             os_atomic_rmw_loop_give_up(break);
 278         }
 279         new_value = compute_new_value(old_value);
 280     });
 281 ```
 282
 283 Unlike the C11 variant, it lets the reader know in program order that this will
 284 be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops
 285 one has to jump to the end of the code to understand what it does.
 286
 287 Any control flow that attempts to exit its scope of the loop needs to be
 288 wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can
 289 abort their opened LL/SC transaction).
 290
 291 Because these loops are LL/SC transactions, it is undefined to perform
 292 any store to memory (register operations are fine) within these loops,
 293 as these may cause the store-conditional to always fail.
 294 In particular nesting of `os_atomic_rmw_loop` is invalid.
 295
 296 Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an
 297 `os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed
 298 before the loop should be used in this way:
 299
 300 ```c
 301     int _Atomic *address;
 302     int old_value, new_value;
 303     bool success;
 304
 305 again:
 306     success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
 307         if (needs_some_store_that_can_thwart_the_transaction(old_value)) {
 308             os_atomic_rmw_loop_give_up({
 309                 // Do whatever you need to do/store to central memory
 310                 // that would cause the loop to always fail
 311                 do_my_rmw_loop_breaking_store();
 312
 313                 // And only then redrive.
 314                 goto again;
 315             });
 316         }
 317         if (!validate(old_value)) {
 318             os_atomic_rmw_loop_give_up(break);
 319         }
 320         new_value = compute_new_value(old_value);
 321     });
 322 ```
 323
 324 ### the *dependency* memory order
 325
 326 Because the C11 *consume* memory order is broken in various ways,
 327 most compilers, clang included, implement it as an equivalent
 328 for `memory_order_acquire`. However, its concept is useful
 329 for certain algorithms.
 330
 331 As an attempt to provide a replacement for this, `<os/atomic_private.h>`
 332 implements an entirely new *dependency* memory ordering.
 333
 334 The purpose of this ordering is to provide a relaxed load followed by an
 335 implicit compiler barrier, that can be used as a root for a chain of hardware
 336 dependencies that would otherwise pair with store-releases done at this address,
 337 very much like the *consume* memory order is intended to provide.
 338
 339 However, unlike the *consume* memory ordering where the compiler had to follow
 340 the dependencies, the *dependency* memory ordering relies on explicit
 341 annotations of when the dependencies are expected:
 342
 343 - loads through a pointer loaded with a *dependency* memory ordering
 344   will provide a hardware dependency,
 345
 346 - dependencies may be injected into other loads not performed through this
 347   particular pointer with the `os_atomic_load_with_dependency_on` and
 348   `os_atomic_inject_dependency` interfaces.
 349
 350 Here is an example of how it is meant to be used:
 351
 352 ```c
 353     struct foo {
 354         long value;
 355         long _Atomic flag;
 356     };
 357
 358     void
 359     publish(struct foo *p, long value)
 360     {
 361         p->value = value;
 362         os_atomic_store(&p->flag, 1, release);
 363     }
 364
 365
 366     bool
 367     broken_read(struct foo *p, long *value)
 368     {
 369         /*
 370          * This isn't safe, as there's absolutely no hardware dependency involved.
 371          * Using an acquire barrier would of course fix it but is quite expensive...
 372          */
 373         if (os_atomic_load(&p->flag, relaxed)) {
 374             *value = p->value;
 375             return true;
 376         }
 377         return false;
 378     }
 379
 380     bool
 381     valid_read(struct foo *p, long *value)
 382     {
 383         long flag = os_atomic_load(&p->flag, dependency);
 384         if (flag) {
 385             /*
 386              * Further the chain of dependency to any loads through `p`
 387              * which properly pair with the release barrier in `publish`.
 388              */
 389             *value = os_atomic_load_with_dependency_on(&p->value, flag);
 390             return true;
 391         }
 392         return false;
 393     }
 394 ```
 395
 396 There are 4 interfaces involved with hardware dependencies:
 397
 398 1. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies,
 399    that should pair with a store or rmw with release semantics or stronger
 400    (release, acq\_rel or seq\_cst),
 401
 402 2. `os_atomic_inject_dependency` can be used to inject the dependency provided
 403    by a *dependency* load, or any other value that has had a dependency
 404    injected,
 405
 406 3. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load
 407    that still prolongs a dependency chain,
 408
 409 4. `os_atomic_make_dependency` to create an opaque token out of a given
 410    dependency root to inject into multiple loads.
 411
 412
 413 **Note**: this technique is NOT safe when the compiler can reason about the
 414 pointers that you are manipulating, for example if the compiler can know that
 415 the pointer can only take a couple of values and ditch all these manually
 416 crafted dependency chains. Hopefully there will be a future C2Y standard that
 417 provides a similar construct as a language feature instead.