]> git.saurik.com Git - apple/xnu.git/blob - doc/atomics.md
xnu-7195.81.3.tar.gz
[apple/xnu.git] / doc / atomics.md
1 XNU use of Atomics and Memory Barriers
2 ======================================
3
4 Goal
5 ----
6
7 This document discusses the use of atomics and memory barriers in XNU. It is
8 meant as a guide to best practices, and warns against a variety of possible
9 pitfalls in the handling of atomics in C.
10
11 It is assumed that the reader has a decent understanding of
12 the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order)
13 as this document builds on it, and explains the liberties XNU takes with said
14 model.
15
16 All the interfaces discussed in this document are available through
17 the `<os/atomic_private.h>` header.
18
19 Note: Linux has thorough documentation around memory barriers
20 (Documentation/memory-barriers.txt), some of which is Linux specific,
21 but most is not and is a valuable read.
22
23
24 Vocabulary
25 ----------
26
27 In the rest of this document we'll refer to the various memory ordering defined
28 by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst.
29
30 `os_atomic` also tries to make the distinction between compiler **barriers**
31 (which limit how much the compiler can reorder code), and memory **fences**.
32
33
34 The dangers and pitfalls of C11's `<stdatomic.h>`
35 -------------------------------------------------
36
37 While the C11 memory model has likely been one of the most important additions
38 to modern C, in the purest C tradition, it is a sharp tool.
39
40 By default, C11 comes with two variants of each atomic "operation":
41
42 - an *explicit* variant where memory orderings can be specified,
43 - a regular variant which is equivalent to the former with the *seq_cst*
44 memory ordering.
45
46 When an `_Atomic` qualified variable is accessed directly without using
47 any `atomic_*_explicit()` operation, then the compiler will generate the
48 matching *seq_cst* atomic operations on your behalf.
49
50 The sequentially consistent world is extremely safe from a lot of compiler
51 and hardware reorderings and optimizations, which is great, but comes with
52 a huge cost in terms of memory barriers.
53
54
55 It seems very tempting to use `atomic_*_explicit()` functions with explicit
56 memory orderings, however, the compiler is entitled to perform a number of
57 optimizations with relaxed atomics, that most developers will not expect.
58 Indeed, the compiler is perfectly allowed to perform various optimizations it
59 does with other plain memory accesess such as coalescing, reordering, hoisting
60 out of loops, ...
61
62 For example, when the compiler can know what `doit` is doing (which due to LTO
63 is almost always the case for XNU), is allowed to transform this code:
64
65 ```c
66 void
67 perform_with_progress(int steps, long _Atomic *progress)
68 {
69 for (int i = 0; i < steps; i++) {
70 doit(i);
71 atomic_store_explicit(progress, i, memory_order_relaxed);
72 }
73 }
74 ```
75
76 Into this, which obviously defeats the entire purpose of `progress`:
77
78 ```c
79 void
80 perform_with_progress(int steps, long _Atomic *progress)
81 {
82 for (int i = 0; i < steps; i++) {
83 doit(i);
84 }
85 atomic_store_explicit(progress, steps, memory_order_relaxed);
86 }
87 ```
88
89
90 How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls
91 -----------------------------------------------------------
92
93 1. the memory locations passed to the various `os_atomic_*`
94 functions do not need to be marked `_Atomic` or `volatile`
95 (or `_Atomic volatile`), which allow for use of atomic
96 operations in code written before C11 was even a thing.
97
98 It is however recommended in new code to use the `_Atomic`
99 specifier.
100
101 2. `os_atomic_*` cannot be coalesced by the compiler:
102 all accesses are performed on the specified locations
103 as if their type was `_Atomic volatile` qualified.
104
105 3. `os_atomic_*` only comes with the explicit variants:
106 orderings must be provided and can express either memory orders
107 where the name is the same as in C11 without the `memory_order_` prefix,
108 or a compiler barrier ordering `compiler_acquire`, `compiler_release`,
109 `compiler_acq_rel`.
110
111 4. `os_atomic_*` emits the proper compiler barriers that
112 correspond to the requested memory ordering (using
113 `atomic_signal_fence()`).
114
115
116 Best practices for the use of atomics in XNU
117 --------------------------------------------
118
119 For most generic code, the `os_atomic_*` functions from
120 `<os/atomic_private.h>` are the perferred interfaces.
121
122 `__sync_*`, `__c11_*` and `__atomic_*` compiler builtins should not be used.
123
124 `<stdatomic.h>` functions may be used if:
125
126 - compiler coalescing / reordering is desired (refcounting
127 implementations may desire this for example).
128
129
130 Qualifying atomic variables with `_Atomic` or even
131 `_Atomic volatile` is encouraged, however authors must
132 be aware that a direct access to this variable will
133 result in quite heavy memory barriers.
134
135 The *consume* memory ordering should not be used
136 (See *dependency* memory order later in this documentation).
137
138 **Note**: `<libkern/OSAtomic.h>` provides a bunch of legacy
139 atomic interfaces, but this header is considered obsolete
140 and these functions should not be used in new code.
141
142
143 High level overview of `os_atomic_*` interfaces
144 -----------------------------------------------
145
146 ### Compiler barriers and memory fences
147
148 `os_compiler_barrier(mem_order?)` provides a compiler barrier,
149 with an optional barrier ordering. It is implemented with C11's
150 `atomic_signal_fence()`. The barrier ordering argument is optional
151 and defaults to the `acq_rel` compiler barrier (which prevents the
152 compiler to reorder code in any direction around this barrier).
153
154 `os_atomic_thread_fence(mem_order)` provides a memory barrier
155 according to the semantics of `atomic_thread_fence()`. It always
156 implies the equivalent `os_compiler_barrier()` even on UP systems.
157
158 ### Init, load and store
159
160 `os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide
161 facilities equivalent to `atomic_init`, `atomic_load_explicit`
162 and `atomic_store_explicit` respectively.
163
164 Note that `os_atomic_load` and `os_atomic_store` promise that they will
165 compile to a plain load or store. `os_atomic_load_wide` and
166 `os_atomic_store_wide` can be used to have access to atomic loads and store
167 that involve more costly codegen (such as compare exchange loops).
168
169 ### Basic RMW (read/modify/write) atomic operations
170
171 The following basic atomic RMW operations exist:
172
173 - `inc`: atomic increment (equivalent to an atomic add of `1`),
174 - `dec`: atomic decrement (equivalent to an atomic sub of `1`),
175 - `add`: atomic add,
176 - `sub`: atomic sub,
177 - `or`: atomic bitwise or,
178 - `xor`: atomic bitwise xor,
179 - `and`: atomic bitwise and,
180 - `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value),
181 - `min`: atomic min,
182 - `max`: atomic max.
183
184 For any such operation, two variants exist:
185
186 - `os_atomic_${op}_orig` (for example `os_atomic_add_orig`)
187 which returns the value stored at the specified location
188 *before* the atomic operation took place
189 - `os_atomic_${op}` (for example `os_atomic_add`) which
190 returns the value stored at the specified location
191 *after* the atomic operation took place
192
193 This convention is picked for two reasons:
194
195 1. `os_atomic_add(p, value, ...)` is essentially equivalent to the C
196 in place addition `(*p += value)` which returns the result of the
197 operation and not the original value of `*p`.
198
199 2. Most subtle atomic algorithms do actually require the original value
200 stored at the location, especially for bit manipulations:
201 `(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform
202 `*p |= bit` but also tell you whether `bit` was set in the original value.
203
204 Making it more explicit that the original value is used is hence
205 important for readers and worth the extra five keystrokes.
206
207 Typically:
208
209 ```c
210 static int _Atomic i = 0;
211
212 printf("%d\n", os_atomic_inc_orig(&i)); // prints 0
213 printf("%d\n", os_atomic_inc(&i)); // prints 2
214 ```
215
216 ### Atomic swap / compare and swap
217
218 `os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`.
219
220 There are two variants of `os_atomic_cmpxchg` which are wrappers around
221 `atomic_compare_exchange_strong_explicit`. Both of these variants will
222 return false/0 if the compare exchange failed, and true/1 if the expected
223 value was found at the specified location and the new value was stored.
224
225 1. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which
226 will atomically store `new_value` at `address` if the current value
227 is equal to `expected`.
228
229 2. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)`
230 which has an extra `orig_value` argument which must be a pointer to a local
231 variable and will be filled with the current value at `address` whether the
232 compare exchange was successful or not. In case of success, the loaded value
233 will always be `expected`, however in case of failure it will be filled with
234 the current value, which is helpful to redrive compare exchange loops.
235
236 Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is
237 specified, which only takes effect in case of a successful compare exchange.
238 In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed`
239 for the failure case ordering, as it is what is used most of the time.
240
241 There is no wrapper around `atomic_compare_exchange_weak_explicit`,
242 as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops.
243
244 ### `os_atomic_rmw_loop`
245
246 This expressive and versatile construct allows for really terse and
247 way more readable compare exchange loops. It also uses LL/SC constructs more
248 efficiently than a compare exchange loop would allow.
249
250 Instead of a typical CAS-loop in C11:
251
252 ```c
253 int _Atomic *address;
254 int old_value, new_value;
255 bool success = false;
256
257 old_value = atomic_load_explicit(address, memory_order_relaxed);
258 do {
259 if (!validate(old_value)) {
260 break;
261 }
262 new_value = compute_new_value(old_value);
263 success = atomic_compare_exchange_weak_explicit(address, &old_value,
264 new_value, memory_order_acquire, memory_order_relaxed);
265 } while (__improbable(!success));
266 ```
267
268 `os_atomic_rmw_loop` allows this form:
269
270 ```c
271 int _Atomic *address;
272 int old_value, new_value;
273 bool success;
274
275 success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
276 if (!validate(old_value)) {
277 os_atomic_rmw_loop_give_up(break);
278 }
279 new_value = compute_new_value(old_value);
280 });
281 ```
282
283 Unlike the C11 variant, it lets the reader know in program order that this will
284 be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops
285 one has to jump to the end of the code to understand what it does.
286
287 Any control flow that attempts to exit its scope of the loop needs to be
288 wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can
289 abort their opened LL/SC transaction).
290
291 Because these loops are LL/SC transactions, it is undefined to perform
292 any store to memory (register operations are fine) within these loops,
293 as these may cause the store-conditional to always fail.
294 In particular nesting of `os_atomic_rmw_loop` is invalid.
295
296 Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an
297 `os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed
298 before the loop should be used in this way:
299
300 ```c
301 int _Atomic *address;
302 int old_value, new_value;
303 bool success;
304
305 again:
306 success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
307 if (needs_some_store_that_can_thwart_the_transaction(old_value)) {
308 os_atomic_rmw_loop_give_up({
309 // Do whatever you need to do/store to central memory
310 // that would cause the loop to always fail
311 do_my_rmw_loop_breaking_store();
312
313 // And only then redrive.
314 goto again;
315 });
316 }
317 if (!validate(old_value)) {
318 os_atomic_rmw_loop_give_up(break);
319 }
320 new_value = compute_new_value(old_value);
321 });
322 ```
323
324 ### the *dependency* memory order
325
326 Because the C11 *consume* memory order is broken in various ways,
327 most compilers, clang included, implement it as an equivalent
328 for `memory_order_acquire`. However, its concept is useful
329 for certain algorithms.
330
331 As an attempt to provide a replacement for this, `<os/atomic_private.h>`
332 implements an entirely new *dependency* memory ordering.
333
334 The purpose of this ordering is to provide a relaxed load followed by an
335 implicit compiler barrier, that can be used as a root for a chain of hardware
336 dependencies that would otherwise pair with store-releases done at this address,
337 very much like the *consume* memory order is intended to provide.
338
339 However, unlike the *consume* memory ordering where the compiler had to follow
340 the dependencies, the *dependency* memory ordering relies on explicit
341 annotations of when the dependencies are expected:
342
343 - loads through a pointer loaded with a *dependency* memory ordering
344 will provide a hardware dependency,
345
346 - dependencies may be injected into other loads not performed through this
347 particular pointer with the `os_atomic_load_with_dependency_on` and
348 `os_atomic_inject_dependency` interfaces.
349
350 Here is an example of how it is meant to be used:
351
352 ```c
353 struct foo {
354 long value;
355 long _Atomic flag;
356 };
357
358 void
359 publish(struct foo *p, long value)
360 {
361 p->value = value;
362 os_atomic_store(&p->flag, 1, release);
363 }
364
365
366 bool
367 broken_read(struct foo *p, long *value)
368 {
369 /*
370 * This isn't safe, as there's absolutely no hardware dependency involved.
371 * Using an acquire barrier would of course fix it but is quite expensive...
372 */
373 if (os_atomic_load(&p->flag, relaxed)) {
374 *value = p->value;
375 return true;
376 }
377 return false;
378 }
379
380 bool
381 valid_read(struct foo *p, long *value)
382 {
383 long flag = os_atomic_load(&p->flag, dependency);
384 if (flag) {
385 /*
386 * Further the chain of dependency to any loads through `p`
387 * which properly pair with the release barrier in `publish`.
388 */
389 *value = os_atomic_load_with_dependency_on(&p->value, flag);
390 return true;
391 }
392 return false;
393 }
394 ```
395
396 There are 4 interfaces involved with hardware dependencies:
397
398 1. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies,
399 that should pair with a store or rmw with release semantics or stronger
400 (release, acq\_rel or seq\_cst),
401
402 2. `os_atomic_inject_dependency` can be used to inject the dependency provided
403 by a *dependency* load, or any other value that has had a dependency
404 injected,
405
406 3. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load
407 that still prolongs a dependency chain,
408
409 4. `os_atomic_make_dependency` to create an opaque token out of a given
410 dependency root to inject into multiple loads.
411
412
413 **Note**: this technique is NOT safe when the compiler can reason about the
414 pointers that you are manipulating, for example if the compiler can know that
415 the pointer can only take a couple of values and ditch all these manually
416 crafted dependency chains. Hopefully there will be a future C2Y standard that
417 provides a similar construct as a language feature instead.