1 XNU use of Atomics and Memory Barriers
2 ======================================
7 This document discusses the use of atomics and memory barriers in XNU. It is
8 meant as a guide to best practices, and warns against a variety of possible
9 pitfalls in the handling of atomics in C.
11 It is assumed that the reader has a decent understanding of
12 the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order)
13 as this document builds on it, and explains the liberties XNU takes with said
16 All the interfaces discussed in this document are available through
17 the `<machine/atomic.h>` header.
19 Note: Linux has thorough documentation around memory barriers
20 (Documentation/memory-barriers.txt), some of which is Linux specific,
21 but most is not and is a valuable read.
27 In the rest of this document we'll refer to the various memory ordering defined
28 by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst.
30 `os_atomic` also tries to make the distinction between compiler **barriers**
31 (which limit how much the compiler can reorder code), and memory **fences**.
34 The dangers and pitfalls of C11's `<stdatomic.h>`
35 -------------------------------------------------
37 While the C11 memory model has likely been one of the most important additions
38 to modern C, in the purest C tradition, it is a sharp tool.
40 By default, C11 comes with two variants of each atomic "operation":
42 - an *explicit* variant where memory orderings can be specified,
43 - a regular variant which is equivalent to the former with the *seq_cst*
46 When an `_Atomic` qualified variable is accessed directly without using
47 any `atomic_*_explicit()` operation, then the compiler will generate the
48 matching *seq_cst* atomic operations on your behalf.
50 The sequentially consistent world is extremely safe from a lot of compiler
51 and hardware reorderings and optimizations, which is great, but comes with
52 a huge cost in terms of memory barriers. It is also completely wasted when
53 building for a non SMP configuration.
56 It seems very tempting to use `atomic_*_explicit()` functions with explicit
57 memory orderings, however, the compiler is entitled to perform a number of
58 optimizations with relaxed atomics, that most developers will not expect.
59 Indeed, the compiler is perfectly allowed to perform various optimizations it
60 does with other plain memory accesess such as coalescing, reordering, hoisting
63 For example, when the compiler can know what `doit` is doing (which due to LTO
64 is almost always the case for XNU), is allowed to transform this code:
68 perform_with_progress(int steps, long _Atomic *progress)
70 for (int i = 0; i < steps; i++) {
72 atomic_store_explicit(progress, i, memory_order_relaxed);
77 Into this, which obviously defeats the entire purpose of `progress`:
81 perform_with_progress(int steps, long _Atomic *progress)
83 for (int i = 0; i < steps; i++) {
86 atomic_store_explicit(progress, steps, memory_order_relaxed);
91 How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls
92 -----------------------------------------------------------
94 1. the memory locations passed to the various `os_atomic_*`
95 functions do not need to be marked `_Atomic` or `volatile`
96 (or `_Atomic volatile`), which allow for use of atomic
97 operations in code written before C11 was even a thing.
99 It is however recommended in new code to use the `_Atomic`
102 2. `os_atomic_*` cannot be coalesced by the compiler:
103 all accesses are performed on the specified locations
104 as if their type was `_Atomic volatile` qualified.
106 3. `os_atomic_*` only comes with the explicit variants:
107 orderings must be provided and can express either memory orders
108 where the name is the same as in C11 without the `memory_order_` prefix,
109 or a compiler barrier ordering `compiler_acquire`, `compiler_release`,
112 4. `os_atomic_*` elides barriers for non SMP configurations
113 by default, however, it emits the proper compiler barriers
114 that correspond to the requested memory ordering (using
115 `atomic_signal_fence()`), even on UP configuration, so that
116 the compiler cannot possibly reorder code on UP systems.
119 Best practices for the use of atomics in XNU
120 --------------------------------------------
122 For most generic code, the `os_atomic_*` functions from
123 `<machine/atomic.h>` are the perferred interfaces.
125 `__sync_*`, `__c11_*` and `__atomic_*` compiler builtins should not be used.
127 `<stdatomic.h>` functions may be used if:
129 - compiler coalescing / reordering is desired (refcounting
130 implementations may desire this for example).
132 - defaulting to relaxed atomics for non SMP platforms doesn't make sense
133 (such as device access which may require memory fences even on UP systems).
136 Qualifying atomic variables with `_Atomic` or even
137 `_Atomic volatile` is encouraged, however authors must
138 be aware that a direct access to this variable will
139 result in quite heavy memory barriers.
141 The *consume* memory ordering should not be used
142 (See *dependency* memory order later in this documentation).
144 **Note**: `<libkern/OSAtomic.h>` provides a bunch of legacy
145 atomic interfaces, but this header is considered obsolete
146 and these functions should not be used in new code.
149 High level overview of `os_atomic_*` interfaces
150 -----------------------------------------------
152 ### Compiler barriers and memory fences
154 `os_compiler_barrier(mem_order?)` provides a compiler barrier,
155 with an optional barrier ordering. It is implemented with C11's
156 `atomic_signal_fence()`. The barrier ordering argument is optional
157 and defaults to the `acq_rel` compiler barrier (which prevents the
158 compiler to reorder code in any direction around this barrier).
160 `os_atomic_thread_fence(mem_order)` provides a memory barrier
161 according to the semantics of `atomic_thread_fence()`. It always
162 implies the equivalent `os_compiler_barrier()` even on UP systems.
164 ### Init, load and store
166 `os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide
167 facilities equivalent to `atomic_init`, `atomic_load_explicit`
168 and `atomic_store_explicit` respectively.
170 Note that `os_atomic_load` and `os_atomic_store` promise that they will
171 compile to a plain load or store. `os_atomic_load_wide` and
172 `os_atomic_store_wide` can be used to have access to atomic loads and store
173 that involve more costly codegen (such as compare exchange loops).
175 ### Basic RMW (read/modify/write) atomic operations
177 The following basic atomic RMW operations exist:
179 - `inc`: atomic increment (equivalent to an atomic add of `1`),
180 - `dec`: atomic decrement (equivalent to an atomic sub of `1`),
183 - `or`: atomic bitwise or,
184 - `xor`: atomic bitwise xor,
185 - `and`: atomic bitwise and,
186 - `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value),
190 For any such operation, two variants exist:
192 - `os_atomic_${op}_orig` (for example `os_atomic_add_orig`)
193 which returns the value stored at the specified location
194 *before* the atomic operation took place
195 - `os_atomic_${op}` (for example `os_atomic_add`) which
196 returns the value stored at the specified location
197 *after* the atomic operation took place
199 This convention is picked for two reasons:
201 1. `os_atomic_add(p, value, ...)` is essentially equivalent to the C
202 in place addition `(*p += value)` which returns the result of the
203 operation and not the original value of `*p`.
205 2. Most subtle atomic algorithms do actually require the original value
206 stored at the location, especially for bit manipulations:
207 `(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform
208 `*p |= bit` but also tell you whether `bit` was set in the original value.
210 Making it more explicit that the original value is used is hence
211 important for readers and worth the extra five keystrokes.
216 static int _Atomic i = 0;
218 printf("%d\n", os_atomic_inc_orig(&i)); // prints 0
219 printf("%d\n", os_atomic_inc(&i)); // prints 2
222 ### Atomic swap / compare and swap
224 `os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`.
226 There are two variants of `os_atomic_cmpxchg` which are wrappers around
227 `atomic_compare_exchange_strong_explicit`. Both of these variants will
228 return false/0 if the compare exchange failed, and true/1 if the expected
229 value was found at the specified location and the new value was stored.
231 1. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which
232 will atomically store `new_value` at `address` if the current value
233 is equal to `expected`.
235 2. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)`
236 which has an extra `orig_value` argument which must be a pointer to a local
237 variable and will be filled with the current value at `address` whether the
238 compare exchange was successful or not. In case of success, the loaded value
239 will always be `expected`, however in case of failure it will be filled with
240 the current value, which is helpful to redrive compare exchange loops.
242 Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is
243 specified, which only takes effect in case of a successful compare exchange.
244 In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed`
245 for the failure case ordering, as it is what is used most of the time.
247 There is no wrapper around `atomic_compare_exchange_weak_explicit`,
248 as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops.
250 ### `os_atomic_rmw_loop`
252 This expressive and versatile construct allows for really terse and
253 way more readable compare exchange loops. It also uses LL/SC constructs more
254 efficiently than a compare exchange loop would allow.
256 Instead of a typical CAS-loop in C11:
259 int _Atomic *address;
260 int old_value, new_value;
261 bool success = false;
263 old_value = atomic_load_explicit(address, memory_order_relaxed);
265 if (!validate(old_value)) {
268 new_value = compute_new_value(old_value);
269 success = atomic_compare_exchange_weak_explicit(address, &old_value,
270 new_value, memory_order_acquire, memory_order_relaxed);
271 } while (__improbable(!success));
274 `os_atomic_rmw_loop` allows this form:
277 int _Atomic *address;
278 int old_value, new_value;
281 success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
282 if (!validate(old_value)) {
283 os_atomic_rmw_loop_give_up(break);
285 new_value = compute_new_value(old_value);
289 Unlike the C11 variant, it lets the reader know in program order that this will
290 be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops
291 one has to jump to the end of the code to understand what it does.
293 Any control flow that attempts to exit its scope of the loop needs to be
294 wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can
295 abort their opened LL/SC transaction).
297 Because these loops are LL/SC transactions, it is undefined to perform
298 any store to memory (register operations are fine) within these loops,
299 as these may cause the store-conditional to always fail.
300 In particular nesting of `os_atomic_rmw_loop` is invalid.
302 Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an
303 `os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed
304 before the loop should be used in this way:
307 int _Atomic *address;
308 int old_value, new_value;
312 success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
313 if (needs_some_store_that_can_thwart_the_transaction(old_value)) {
314 os_atomic_rmw_loop_give_up({
315 // Do whatever you need to do/store to central memory
316 // that would cause the loop to always fail
317 do_my_rmw_loop_breaking_store();
319 // And only then redrive.
323 if (!validate(old_value)) {
324 os_atomic_rmw_loop_give_up(break);
326 new_value = compute_new_value(old_value);
330 ### the *dependency* memory order
332 Because the C11 *consume* memory order is broken in various ways,
333 most compilers, clang included, implement it as an equivalent
334 for `memory_order_acquire`. However, its concept is useful
335 for certain algorithms.
337 As an attempt to provide a replacement for this, `<machine/atomic.h>`
338 implements an entirely new *dependency* memory ordering.
340 The purpose of this ordering is to provide a relaxed load followed by an
341 implicit compiler barrier, that can be used as a root for a chain of hardware
342 dependencies that would otherwise pair with store-releases done at this address,
343 very much like the *consume* memory order is intended to provide.
345 However, unlike the *consume* memory ordering where the compiler had to follow
346 the dependencies, the *dependency* memory ordering relies on explicit
347 annotations of when the dependencies are expected:
349 - loads through a pointer loaded with a *dependency* memory ordering
350 will provide a hardware dependency,
352 - dependencies may be injected into other loads not performed through this
353 particular pointer with the `os_atomic_load_with_dependency_on` and
354 `os_atomic_inject_dependency` interfaces.
356 Here is an example of how it is meant to be used:
365 publish(struct foo *p, long value)
368 os_atomic_store(&p->flag, 1, release);
373 broken_read(struct foo *p, long *value)
376 * This isn't safe, as there's absolutely no hardware dependency involved.
377 * Using an acquire barrier would of course fix it but is quite expensive...
379 if (os_atomic_load(&p->flag, relaxed)) {
387 valid_read(struct foo *p, long *value)
389 long flag = os_atomic_load(&p->flag, dependency);
392 * Further the chain of dependency to any loads through `p`
393 * which properly pair with the release barrier in `publish`.
395 *value = os_atomic_load_with_dependency_on(&p->value, flag);
402 There are 4 interfaces involved with hardware dependencies:
404 1. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies,
405 that should pair with a store or rmw with release semantics or stronger
406 (release, acq\_rel or seq\_cst),
408 2. `os_atomic_inject_dependency` can be used to inject the dependency provided
409 by a *dependency* load, or any other value that has had a dependency
412 3. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load
413 that still prolongs a dependency chain,
415 4. `os_atomic_make_dependency` to create an opaque token out of a given
416 dependency root to inject into multiple loads.
419 **Note**: this technique is NOT safe when the compiler can reason about the
420 pointers that you are manipulating, for example if the compiler can know that
421 the pointer can only take a couple of values and ditch all these manually
422 crafted dependency chains. Hopefully there will be a future C2Y standard that
423 provides a similar construct as a language feature instead.