doc/atomics.md

   1 XNU use of Atomics and Memory Barriers
   2 ======================================
   3
   4 Goal
   5 ----
   6
   7 This document discusses the use of atomics and memory barriers in XNU. It is
   8 meant as a guide to best practices, and warns against a variety of possible
   9 pitfalls in the handling of atomics in C.
  10
  11 It is assumed that the reader has a decent understanding of
  12 the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order)
  13 as this document builds on it, and explains the liberties XNU takes with said
  14 model.
  15
  16 All the interfaces discussed in this document are available through
  17 the `<machine/atomic.h>` header.
  18
  19 Note: Linux has thorough documentation around memory barriers
  20 (Documentation/memory-barriers.txt), some of which is Linux specific,
  21 but most is not and is a valuable read.
  22
  23
  24 Vocabulary
  25 ----------
  26
  27 In the rest of this document we'll refer to the various memory ordering defined
  28 by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst.
  29
  30 `os_atomic` also tries to make the distinction between compiler **barriers**
  31 (which limit how much the compiler can reorder code), and memory **fences**.
  32
  33
  34 The dangers and pitfalls of C11's `<stdatomic.h>`
  35 -------------------------------------------------
  36
  37 While the C11 memory model has likely been one of the most important additions
  38 to modern C, in the purest C tradition, it is a sharp tool.
  39
  40 By default, C11 comes with two variants of each atomic "operation":
  41
  42 - an *explicit* variant where memory orderings can be specified,
  43 - a regular variant which is equivalent to the former with the *seq_cst*
  44   memory ordering.
  45
  46 When an `_Atomic` qualified variable is accessed directly without using
  47 any `atomic_*_explicit()` operation, then the compiler will generate the
  48 matching *seq_cst* atomic operations on your behalf.
  49
  50 The sequentially consistent world is extremely safe from a lot of compiler
  51 and hardware reorderings and optimizations, which is great, but comes with
  52 a huge cost in terms of memory barriers. It is also completely wasted when
  53 building for a non SMP configuration.
  54
  55
  56 It seems very tempting to use `atomic_*_explicit()` functions with explicit
  57 memory orderings, however, the compiler is entitled to perform a number of
  58 optimizations with relaxed atomics, that most developers will not expect.
  59 Indeed, the compiler is perfectly allowed to perform various optimizations it
  60 does with other plain memory accesess such as coalescing, reordering, hoisting
  61 out of loops, ...
  62
  63 For example, when the compiler can know what `doit` is doing (which due to LTO
  64 is almost always the case for XNU), is allowed to transform this code:
  65
  66 ```c
  67     void
  68     perform_with_progress(int steps, long _Atomic *progress)
  69     {
  70         for (int i = 0; i < steps; i++) {
  71             doit(i);
  72             atomic_store_explicit(progress, i, memory_order_relaxed);
  73         }
  74     }
  75 ```
  76
  77 Into this, which obviously defeats the entire purpose of `progress`:
  78
  79 ```c
  80     void
  81     perform_with_progress(int steps, long _Atomic *progress)
  82     {
  83         for (int i = 0; i < steps; i++) {
  84             doit(i);
  85         }
  86         atomic_store_explicit(progress, steps, memory_order_relaxed);
  87     }
  88 ```
  89
  90
  91 How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls
  92 -----------------------------------------------------------
  93
  94 1. the memory locations passed to the various `os_atomic_*`
  95    functions do not need to be marked `_Atomic` or `volatile`
  96    (or `_Atomic volatile`), which allow for use of atomic
  97    operations in code written before C11 was even a thing.
  98
  99    It is however recommended in new code to use the `_Atomic`
 100    specifier.
 101
 102 2. `os_atomic_*` cannot be coalesced by the compiler:
 103    all accesses are performed on the specified locations
 104    as if their type was `_Atomic volatile` qualified.
 105
 106 3. `os_atomic_*` only comes with the explicit variants:
 107    orderings must be provided and can express either memory orders
 108    where the name is the same as in C11 without the `memory_order_` prefix,
 109    or a compiler barrier ordering `compiler_acquire`, `compiler_release`,
 110    `compiler_acq_rel`.
 111
 112 4. `os_atomic_*` elides barriers for non SMP configurations
 113    by default, however, it emits the proper compiler barriers
 114    that correspond to the requested memory ordering (using
 115    `atomic_signal_fence()`), even on UP configuration, so that
 116    the compiler cannot possibly reorder code on UP systems.
 117
 118
 119 Best practices for the use of atomics in XNU
 120 --------------------------------------------
 121
 122 For most generic code, the `os_atomic_*` functions from
 123 `<machine/atomic.h>` are the perferred interfaces.
 124
 125 `__sync_*`, `__c11_*` and `__atomic_*` compiler builtins should not be used.
 126
 127 `<stdatomic.h>` functions may be used if:
 128
 129 - compiler coalescing / reordering is desired (refcounting
 130   implementations may desire this for example).
 131
 132 - defaulting to relaxed atomics for non SMP platforms doesn't make sense
 133   (such as device access which may require memory fences even on UP systems).
 134
 135
 136 Qualifying atomic variables with `_Atomic` or even
 137 `_Atomic volatile` is encouraged, however authors must
 138 be aware that a direct access to this variable will
 139 result in quite heavy memory barriers.
 140
 141 The *consume* memory ordering should not be used
 142 (See *dependency* memory order later in this documentation).
 143
 144 **Note**: `<libkern/OSAtomic.h>` provides a bunch of legacy
 145 atomic interfaces, but this header is considered obsolete
 146 and these functions should not be used in new code.
 147
 148
 149 High level overview of `os_atomic_*` interfaces
 150 -----------------------------------------------
 151
 152 ### Compiler barriers and memory fences
 153
 154 `os_compiler_barrier(mem_order?)` provides a compiler barrier,
 155 with an optional barrier ordering. It is implemented with C11's
 156 `atomic_signal_fence()`. The barrier ordering argument is optional
 157 and defaults to the `acq_rel` compiler barrier (which prevents the
 158 compiler to reorder code in any direction around this barrier).
 159
 160 `os_atomic_thread_fence(mem_order)` provides a memory barrier
 161 according to the semantics of `atomic_thread_fence()`. It always
 162 implies the equivalent `os_compiler_barrier()` even on UP systems.
 163
 164 ### Init, load and store
 165
 166 `os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide
 167 facilities equivalent to `atomic_init`, `atomic_load_explicit`
 168 and `atomic_store_explicit` respectively.
 169
 170 Note that `os_atomic_load` and `os_atomic_store` promise that they will
 171 compile to a plain load or store. `os_atomic_load_wide` and
 172 `os_atomic_store_wide` can be used to have access to atomic loads and store
 173 that involve more costly codegen (such as compare exchange loops).
 174
 175 ### Basic RMW (read/modify/write) atomic operations
 176
 177 The following basic atomic RMW operations exist:
 178
 179 - `inc`: atomic increment (equivalent to an atomic add of `1`),
 180 - `dec`: atomic decrement (equivalent to an atomic sub of `1`),
 181 - `add`: atomic add,
 182 - `sub`: atomic sub,
 183 - `or`: atomic bitwise or,
 184 - `xor`: atomic bitwise xor,
 185 - `and`: atomic bitwise and,
 186 - `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value),
 187 - `min`: atomic min,
 188 - `max`: atomic max.
 189
 190 For any such operation, two variants exist:
 191
 192 - `os_atomic_${op}_orig` (for example `os_atomic_add_orig`)
 193   which returns the value stored at the specified location
 194   *before* the atomic operation took place
 195 - `os_atomic_${op}` (for example `os_atomic_add`) which
 196   returns the value stored at the specified location
 197   *after* the atomic operation took place
 198
 199 This convention is picked for two reasons:
 200
 201 1. `os_atomic_add(p, value, ...)` is essentially equivalent to the C
 202    in place addition `(*p += value)` which returns the result of the
 203    operation and not the original value of `*p`.
 204
 205 2. Most subtle atomic algorithms do actually require the original value
 206    stored at the location, especially for bit manipulations:
 207    `(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform
 208    `*p |= bit` but also tell you whether `bit` was set in the original value.
 209
 210    Making it more explicit that the original value is used is hence
 211    important for readers and worth the extra five keystrokes.
 212
 213 Typically:
 214
 215 ```c
 216     static int _Atomic i = 0;
 217
 218     printf("%d\n", os_atomic_inc_orig(&i)); // prints 0
 219     printf("%d\n", os_atomic_inc(&i)); // prints 2
 220 ```
 221
 222 ### Atomic swap / compare and swap
 223
 224 `os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`.
 225
 226 There are two variants of `os_atomic_cmpxchg` which are wrappers around
 227 `atomic_compare_exchange_strong_explicit`. Both of these variants will
 228 return false/0 if the compare exchange failed, and true/1 if the expected
 229 value was found at the specified location and the new value was stored.
 230
 231 1. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which
 232    will atomically store `new_value` at `address` if the current value
 233    is equal to `expected`.
 234
 235 2. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)`
 236    which has an extra `orig_value` argument which must be a pointer to a local
 237    variable and will be filled with the current value at `address` whether the
 238    compare exchange was successful or not. In case of success, the loaded value
 239    will always be `expected`, however in case of failure it will be filled with
 240    the current value, which is helpful to redrive compare exchange loops.
 241
 242 Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is
 243 specified, which only takes effect in case of a successful compare exchange.
 244 In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed`
 245 for the failure case ordering, as it is what is used most of the time.
 246
 247 There is no wrapper around `atomic_compare_exchange_weak_explicit`,
 248 as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops.
 249
 250 ### `os_atomic_rmw_loop`
 251
 252 This expressive and versatile construct allows for really terse and
 253 way more readable compare exchange loops. It also uses LL/SC constructs more
 254 efficiently than a compare exchange loop would allow.
 255
 256 Instead of a typical CAS-loop in C11:
 257
 258 ```c
 259     int _Atomic *address;
 260     int old_value, new_value;
 261     bool success = false;
 262
 263     old_value = atomic_load_explicit(address, memory_order_relaxed);
 264     do {
 265         if (!validate(old_value)) {
 266             break;
 267         }
 268         new_value = compute_new_value(old_value);
 269         success = atomic_compare_exchange_weak_explicit(address, &old_value,
 270                 new_value, memory_order_acquire, memory_order_relaxed);
 271     } while (__improbable(!success));
 272 ```
 273
 274 `os_atomic_rmw_loop` allows this form:
 275
 276 ```c
 277     int _Atomic *address;
 278     int old_value, new_value;
 279     bool success;
 280
 281     success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
 282         if (!validate(old_value)) {
 283             os_atomic_rmw_loop_give_up(break);
 284         }
 285         new_value = compute_new_value(old_value);
 286     });
 287 ```
 288
 289 Unlike the C11 variant, it lets the reader know in program order that this will
 290 be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops
 291 one has to jump to the end of the code to understand what it does.
 292
 293 Any control flow that attempts to exit its scope of the loop needs to be
 294 wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can
 295 abort their opened LL/SC transaction).
 296
 297 Because these loops are LL/SC transactions, it is undefined to perform
 298 any store to memory (register operations are fine) within these loops,
 299 as these may cause the store-conditional to always fail.
 300 In particular nesting of `os_atomic_rmw_loop` is invalid.
 301
 302 Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an
 303 `os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed
 304 before the loop should be used in this way:
 305
 306 ```c
 307     int _Atomic *address;
 308     int old_value, new_value;
 309     bool success;
 310
 311 again:
 312     success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
 313         if (needs_some_store_that_can_thwart_the_transaction(old_value)) {
 314             os_atomic_rmw_loop_give_up({
 315                 // Do whatever you need to do/store to central memory
 316                 // that would cause the loop to always fail
 317                 do_my_rmw_loop_breaking_store();
 318
 319                 // And only then redrive.
 320                 goto again;
 321             });
 322         }
 323         if (!validate(old_value)) {
 324             os_atomic_rmw_loop_give_up(break);
 325         }
 326         new_value = compute_new_value(old_value);
 327     });
 328 ```
 329
 330 ### the *dependency* memory order
 331
 332 Because the C11 *consume* memory order is broken in various ways,
 333 most compilers, clang included, implement it as an equivalent
 334 for `memory_order_acquire`. However, its concept is useful
 335 for certain algorithms.
 336
 337 As an attempt to provide a replacement for this, `<machine/atomic.h>`
 338 implements an entirely new *dependency* memory ordering.
 339
 340 The purpose of this ordering is to provide a relaxed load followed by an
 341 implicit compiler barrier, that can be used as a root for a chain of hardware
 342 dependencies that would otherwise pair with store-releases done at this address,
 343 very much like the *consume* memory order is intended to provide.
 344
 345 However, unlike the *consume* memory ordering where the compiler had to follow
 346 the dependencies, the *dependency* memory ordering relies on explicit
 347 annotations of when the dependencies are expected:
 348
 349 - loads through a pointer loaded with a *dependency* memory ordering
 350   will provide a hardware dependency,
 351
 352 - dependencies may be injected into other loads not performed through this
 353   particular pointer with the `os_atomic_load_with_dependency_on` and
 354   `os_atomic_inject_dependency` interfaces.
 355
 356 Here is an example of how it is meant to be used:
 357
 358 ```c
 359     struct foo {
 360         long value;
 361         long _Atomic flag;
 362     };
 363
 364     void
 365     publish(struct foo *p, long value)
 366     {
 367         p->value = value;
 368         os_atomic_store(&p->flag, 1, release);
 369     }
 370
 371
 372     bool
 373     broken_read(struct foo *p, long *value)
 374     {
 375         /*
 376          * This isn't safe, as there's absolutely no hardware dependency involved.
 377          * Using an acquire barrier would of course fix it but is quite expensive...
 378          */
 379         if (os_atomic_load(&p->flag, relaxed)) {
 380             *value = p->value;
 381             return true;
 382         }
 383         return false;
 384     }
 385
 386     bool
 387     valid_read(struct foo *p, long *value)
 388     {
 389         long flag = os_atomic_load(&p->flag, dependency);
 390         if (flag) {
 391             /*
 392              * Further the chain of dependency to any loads through `p`
 393              * which properly pair with the release barrier in `publish`.
 394              */
 395             *value = os_atomic_load_with_dependency_on(&p->value, flag);
 396             return true;
 397         }
 398         return false;
 399     }
 400 ```
 401
 402 There are 4 interfaces involved with hardware dependencies:
 403
 404 1. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies,
 405    that should pair with a store or rmw with release semantics or stronger
 406    (release, acq\_rel or seq\_cst),
 407
 408 2. `os_atomic_inject_dependency` can be used to inject the dependency provided
 409    by a *dependency* load, or any other value that has had a dependency
 410    injected,
 411
 412 3. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load
 413    that still prolongs a dependency chain,
 414
 415 4. `os_atomic_make_dependency` to create an opaque token out of a given
 416    dependency root to inject into multiple loads.
 417
 418
 419 **Note**: this technique is NOT safe when the compiler can reason about the
 420 pointers that you are manipulating, for example if the compiler can know that
 421 the pointer can only take a couple of values and ditch all these manually
 422 crafted dependency chains. Hopefully there will be a future C2Y standard that
 423 provides a similar construct as a language feature instead.