]>
Commit | Line | Data |
---|---|---|
cb323159 A |
1 | XNU use of Atomics and Memory Barriers |
2 | ====================================== | |
3 | ||
4 | Goal | |
5 | ---- | |
6 | ||
7 | This document discusses the use of atomics and memory barriers in XNU. It is | |
8 | meant as a guide to best practices, and warns against a variety of possible | |
9 | pitfalls in the handling of atomics in C. | |
10 | ||
11 | It is assumed that the reader has a decent understanding of | |
12 | the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order) | |
13 | as this document builds on it, and explains the liberties XNU takes with said | |
14 | model. | |
15 | ||
16 | All the interfaces discussed in this document are available through | |
f427ee49 | 17 | the `<os/atomic_private.h>` header. |
cb323159 A |
18 | |
19 | Note: Linux has thorough documentation around memory barriers | |
20 | (Documentation/memory-barriers.txt), some of which is Linux specific, | |
21 | but most is not and is a valuable read. | |
22 | ||
23 | ||
24 | Vocabulary | |
25 | ---------- | |
26 | ||
27 | In the rest of this document we'll refer to the various memory ordering defined | |
28 | by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst. | |
29 | ||
30 | `os_atomic` also tries to make the distinction between compiler **barriers** | |
31 | (which limit how much the compiler can reorder code), and memory **fences**. | |
32 | ||
33 | ||
34 | The dangers and pitfalls of C11's `<stdatomic.h>` | |
35 | ------------------------------------------------- | |
36 | ||
37 | While the C11 memory model has likely been one of the most important additions | |
38 | to modern C, in the purest C tradition, it is a sharp tool. | |
39 | ||
40 | By default, C11 comes with two variants of each atomic "operation": | |
41 | ||
42 | - an *explicit* variant where memory orderings can be specified, | |
43 | - a regular variant which is equivalent to the former with the *seq_cst* | |
44 | memory ordering. | |
45 | ||
46 | When an `_Atomic` qualified variable is accessed directly without using | |
47 | any `atomic_*_explicit()` operation, then the compiler will generate the | |
48 | matching *seq_cst* atomic operations on your behalf. | |
49 | ||
50 | The sequentially consistent world is extremely safe from a lot of compiler | |
51 | and hardware reorderings and optimizations, which is great, but comes with | |
f427ee49 | 52 | a huge cost in terms of memory barriers. |
cb323159 A |
53 | |
54 | ||
55 | It seems very tempting to use `atomic_*_explicit()` functions with explicit | |
56 | memory orderings, however, the compiler is entitled to perform a number of | |
57 | optimizations with relaxed atomics, that most developers will not expect. | |
58 | Indeed, the compiler is perfectly allowed to perform various optimizations it | |
59 | does with other plain memory accesess such as coalescing, reordering, hoisting | |
60 | out of loops, ... | |
61 | ||
62 | For example, when the compiler can know what `doit` is doing (which due to LTO | |
63 | is almost always the case for XNU), is allowed to transform this code: | |
64 | ||
65 | ```c | |
66 | void | |
67 | perform_with_progress(int steps, long _Atomic *progress) | |
68 | { | |
69 | for (int i = 0; i < steps; i++) { | |
70 | doit(i); | |
71 | atomic_store_explicit(progress, i, memory_order_relaxed); | |
72 | } | |
73 | } | |
74 | ``` | |
75 | ||
76 | Into this, which obviously defeats the entire purpose of `progress`: | |
77 | ||
78 | ```c | |
79 | void | |
80 | perform_with_progress(int steps, long _Atomic *progress) | |
81 | { | |
82 | for (int i = 0; i < steps; i++) { | |
83 | doit(i); | |
84 | } | |
85 | atomic_store_explicit(progress, steps, memory_order_relaxed); | |
86 | } | |
87 | ``` | |
88 | ||
89 | ||
90 | How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls | |
91 | ----------------------------------------------------------- | |
92 | ||
93 | 1. the memory locations passed to the various `os_atomic_*` | |
94 | functions do not need to be marked `_Atomic` or `volatile` | |
95 | (or `_Atomic volatile`), which allow for use of atomic | |
96 | operations in code written before C11 was even a thing. | |
97 | ||
98 | It is however recommended in new code to use the `_Atomic` | |
99 | specifier. | |
100 | ||
101 | 2. `os_atomic_*` cannot be coalesced by the compiler: | |
102 | all accesses are performed on the specified locations | |
103 | as if their type was `_Atomic volatile` qualified. | |
104 | ||
105 | 3. `os_atomic_*` only comes with the explicit variants: | |
106 | orderings must be provided and can express either memory orders | |
107 | where the name is the same as in C11 without the `memory_order_` prefix, | |
108 | or a compiler barrier ordering `compiler_acquire`, `compiler_release`, | |
109 | `compiler_acq_rel`. | |
110 | ||
f427ee49 A |
111 | 4. `os_atomic_*` emits the proper compiler barriers that |
112 | correspond to the requested memory ordering (using | |
113 | `atomic_signal_fence()`). | |
cb323159 A |
114 | |
115 | ||
116 | Best practices for the use of atomics in XNU | |
117 | -------------------------------------------- | |
118 | ||
119 | For most generic code, the `os_atomic_*` functions from | |
f427ee49 | 120 | `<os/atomic_private.h>` are the perferred interfaces. |
cb323159 A |
121 | |
122 | `__sync_*`, `__c11_*` and `__atomic_*` compiler builtins should not be used. | |
123 | ||
124 | `<stdatomic.h>` functions may be used if: | |
125 | ||
126 | - compiler coalescing / reordering is desired (refcounting | |
127 | implementations may desire this for example). | |
128 | ||
cb323159 A |
129 | |
130 | Qualifying atomic variables with `_Atomic` or even | |
131 | `_Atomic volatile` is encouraged, however authors must | |
132 | be aware that a direct access to this variable will | |
133 | result in quite heavy memory barriers. | |
134 | ||
135 | The *consume* memory ordering should not be used | |
136 | (See *dependency* memory order later in this documentation). | |
137 | ||
138 | **Note**: `<libkern/OSAtomic.h>` provides a bunch of legacy | |
139 | atomic interfaces, but this header is considered obsolete | |
140 | and these functions should not be used in new code. | |
141 | ||
142 | ||
143 | High level overview of `os_atomic_*` interfaces | |
144 | ----------------------------------------------- | |
145 | ||
146 | ### Compiler barriers and memory fences | |
147 | ||
148 | `os_compiler_barrier(mem_order?)` provides a compiler barrier, | |
149 | with an optional barrier ordering. It is implemented with C11's | |
150 | `atomic_signal_fence()`. The barrier ordering argument is optional | |
151 | and defaults to the `acq_rel` compiler barrier (which prevents the | |
152 | compiler to reorder code in any direction around this barrier). | |
153 | ||
154 | `os_atomic_thread_fence(mem_order)` provides a memory barrier | |
155 | according to the semantics of `atomic_thread_fence()`. It always | |
156 | implies the equivalent `os_compiler_barrier()` even on UP systems. | |
157 | ||
158 | ### Init, load and store | |
159 | ||
160 | `os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide | |
161 | facilities equivalent to `atomic_init`, `atomic_load_explicit` | |
162 | and `atomic_store_explicit` respectively. | |
163 | ||
164 | Note that `os_atomic_load` and `os_atomic_store` promise that they will | |
165 | compile to a plain load or store. `os_atomic_load_wide` and | |
166 | `os_atomic_store_wide` can be used to have access to atomic loads and store | |
167 | that involve more costly codegen (such as compare exchange loops). | |
168 | ||
169 | ### Basic RMW (read/modify/write) atomic operations | |
170 | ||
171 | The following basic atomic RMW operations exist: | |
172 | ||
173 | - `inc`: atomic increment (equivalent to an atomic add of `1`), | |
174 | - `dec`: atomic decrement (equivalent to an atomic sub of `1`), | |
175 | - `add`: atomic add, | |
176 | - `sub`: atomic sub, | |
177 | - `or`: atomic bitwise or, | |
178 | - `xor`: atomic bitwise xor, | |
179 | - `and`: atomic bitwise and, | |
180 | - `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value), | |
181 | - `min`: atomic min, | |
182 | - `max`: atomic max. | |
183 | ||
184 | For any such operation, two variants exist: | |
185 | ||
186 | - `os_atomic_${op}_orig` (for example `os_atomic_add_orig`) | |
187 | which returns the value stored at the specified location | |
188 | *before* the atomic operation took place | |
189 | - `os_atomic_${op}` (for example `os_atomic_add`) which | |
190 | returns the value stored at the specified location | |
191 | *after* the atomic operation took place | |
192 | ||
193 | This convention is picked for two reasons: | |
194 | ||
195 | 1. `os_atomic_add(p, value, ...)` is essentially equivalent to the C | |
196 | in place addition `(*p += value)` which returns the result of the | |
197 | operation and not the original value of `*p`. | |
198 | ||
199 | 2. Most subtle atomic algorithms do actually require the original value | |
200 | stored at the location, especially for bit manipulations: | |
201 | `(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform | |
202 | `*p |= bit` but also tell you whether `bit` was set in the original value. | |
203 | ||
204 | Making it more explicit that the original value is used is hence | |
205 | important for readers and worth the extra five keystrokes. | |
206 | ||
207 | Typically: | |
208 | ||
209 | ```c | |
210 | static int _Atomic i = 0; | |
211 | ||
212 | printf("%d\n", os_atomic_inc_orig(&i)); // prints 0 | |
213 | printf("%d\n", os_atomic_inc(&i)); // prints 2 | |
214 | ``` | |
215 | ||
216 | ### Atomic swap / compare and swap | |
217 | ||
218 | `os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`. | |
219 | ||
220 | There are two variants of `os_atomic_cmpxchg` which are wrappers around | |
221 | `atomic_compare_exchange_strong_explicit`. Both of these variants will | |
222 | return false/0 if the compare exchange failed, and true/1 if the expected | |
223 | value was found at the specified location and the new value was stored. | |
224 | ||
225 | 1. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which | |
226 | will atomically store `new_value` at `address` if the current value | |
227 | is equal to `expected`. | |
228 | ||
229 | 2. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)` | |
230 | which has an extra `orig_value` argument which must be a pointer to a local | |
231 | variable and will be filled with the current value at `address` whether the | |
232 | compare exchange was successful or not. In case of success, the loaded value | |
233 | will always be `expected`, however in case of failure it will be filled with | |
234 | the current value, which is helpful to redrive compare exchange loops. | |
235 | ||
236 | Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is | |
237 | specified, which only takes effect in case of a successful compare exchange. | |
238 | In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed` | |
239 | for the failure case ordering, as it is what is used most of the time. | |
240 | ||
241 | There is no wrapper around `atomic_compare_exchange_weak_explicit`, | |
242 | as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops. | |
243 | ||
244 | ### `os_atomic_rmw_loop` | |
245 | ||
246 | This expressive and versatile construct allows for really terse and | |
247 | way more readable compare exchange loops. It also uses LL/SC constructs more | |
248 | efficiently than a compare exchange loop would allow. | |
249 | ||
250 | Instead of a typical CAS-loop in C11: | |
251 | ||
252 | ```c | |
253 | int _Atomic *address; | |
254 | int old_value, new_value; | |
255 | bool success = false; | |
256 | ||
257 | old_value = atomic_load_explicit(address, memory_order_relaxed); | |
258 | do { | |
259 | if (!validate(old_value)) { | |
260 | break; | |
261 | } | |
262 | new_value = compute_new_value(old_value); | |
263 | success = atomic_compare_exchange_weak_explicit(address, &old_value, | |
264 | new_value, memory_order_acquire, memory_order_relaxed); | |
265 | } while (__improbable(!success)); | |
266 | ``` | |
267 | ||
268 | `os_atomic_rmw_loop` allows this form: | |
269 | ||
270 | ```c | |
271 | int _Atomic *address; | |
272 | int old_value, new_value; | |
273 | bool success; | |
274 | ||
275 | success = os_atomic_rmw_loop(address, old_value, new_value, acquire, { | |
276 | if (!validate(old_value)) { | |
277 | os_atomic_rmw_loop_give_up(break); | |
278 | } | |
279 | new_value = compute_new_value(old_value); | |
280 | }); | |
281 | ``` | |
282 | ||
283 | Unlike the C11 variant, it lets the reader know in program order that this will | |
284 | be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops | |
285 | one has to jump to the end of the code to understand what it does. | |
286 | ||
287 | Any control flow that attempts to exit its scope of the loop needs to be | |
288 | wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can | |
289 | abort their opened LL/SC transaction). | |
290 | ||
291 | Because these loops are LL/SC transactions, it is undefined to perform | |
292 | any store to memory (register operations are fine) within these loops, | |
293 | as these may cause the store-conditional to always fail. | |
294 | In particular nesting of `os_atomic_rmw_loop` is invalid. | |
295 | ||
296 | Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an | |
297 | `os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed | |
298 | before the loop should be used in this way: | |
299 | ||
300 | ```c | |
301 | int _Atomic *address; | |
302 | int old_value, new_value; | |
303 | bool success; | |
304 | ||
305 | again: | |
306 | success = os_atomic_rmw_loop(address, old_value, new_value, acquire, { | |
307 | if (needs_some_store_that_can_thwart_the_transaction(old_value)) { | |
308 | os_atomic_rmw_loop_give_up({ | |
309 | // Do whatever you need to do/store to central memory | |
310 | // that would cause the loop to always fail | |
311 | do_my_rmw_loop_breaking_store(); | |
312 | ||
313 | // And only then redrive. | |
314 | goto again; | |
315 | }); | |
316 | } | |
317 | if (!validate(old_value)) { | |
318 | os_atomic_rmw_loop_give_up(break); | |
319 | } | |
320 | new_value = compute_new_value(old_value); | |
321 | }); | |
322 | ``` | |
323 | ||
324 | ### the *dependency* memory order | |
325 | ||
326 | Because the C11 *consume* memory order is broken in various ways, | |
327 | most compilers, clang included, implement it as an equivalent | |
328 | for `memory_order_acquire`. However, its concept is useful | |
329 | for certain algorithms. | |
330 | ||
f427ee49 | 331 | As an attempt to provide a replacement for this, `<os/atomic_private.h>` |
cb323159 A |
332 | implements an entirely new *dependency* memory ordering. |
333 | ||
334 | The purpose of this ordering is to provide a relaxed load followed by an | |
335 | implicit compiler barrier, that can be used as a root for a chain of hardware | |
336 | dependencies that would otherwise pair with store-releases done at this address, | |
337 | very much like the *consume* memory order is intended to provide. | |
338 | ||
339 | However, unlike the *consume* memory ordering where the compiler had to follow | |
340 | the dependencies, the *dependency* memory ordering relies on explicit | |
341 | annotations of when the dependencies are expected: | |
342 | ||
343 | - loads through a pointer loaded with a *dependency* memory ordering | |
344 | will provide a hardware dependency, | |
345 | ||
346 | - dependencies may be injected into other loads not performed through this | |
347 | particular pointer with the `os_atomic_load_with_dependency_on` and | |
348 | `os_atomic_inject_dependency` interfaces. | |
349 | ||
350 | Here is an example of how it is meant to be used: | |
351 | ||
352 | ```c | |
353 | struct foo { | |
354 | long value; | |
355 | long _Atomic flag; | |
356 | }; | |
357 | ||
358 | void | |
359 | publish(struct foo *p, long value) | |
360 | { | |
361 | p->value = value; | |
362 | os_atomic_store(&p->flag, 1, release); | |
363 | } | |
364 | ||
365 | ||
366 | bool | |
367 | broken_read(struct foo *p, long *value) | |
368 | { | |
369 | /* | |
370 | * This isn't safe, as there's absolutely no hardware dependency involved. | |
371 | * Using an acquire barrier would of course fix it but is quite expensive... | |
372 | */ | |
373 | if (os_atomic_load(&p->flag, relaxed)) { | |
374 | *value = p->value; | |
375 | return true; | |
376 | } | |
377 | return false; | |
378 | } | |
379 | ||
380 | bool | |
381 | valid_read(struct foo *p, long *value) | |
382 | { | |
383 | long flag = os_atomic_load(&p->flag, dependency); | |
384 | if (flag) { | |
385 | /* | |
386 | * Further the chain of dependency to any loads through `p` | |
387 | * which properly pair with the release barrier in `publish`. | |
388 | */ | |
389 | *value = os_atomic_load_with_dependency_on(&p->value, flag); | |
390 | return true; | |
391 | } | |
392 | return false; | |
393 | } | |
394 | ``` | |
395 | ||
396 | There are 4 interfaces involved with hardware dependencies: | |
397 | ||
398 | 1. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies, | |
399 | that should pair with a store or rmw with release semantics or stronger | |
400 | (release, acq\_rel or seq\_cst), | |
401 | ||
402 | 2. `os_atomic_inject_dependency` can be used to inject the dependency provided | |
403 | by a *dependency* load, or any other value that has had a dependency | |
404 | injected, | |
405 | ||
406 | 3. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load | |
407 | that still prolongs a dependency chain, | |
408 | ||
409 | 4. `os_atomic_make_dependency` to create an opaque token out of a given | |
410 | dependency root to inject into multiple loads. | |
411 | ||
412 | ||
413 | **Note**: this technique is NOT safe when the compiler can reason about the | |
414 | pointers that you are manipulating, for example if the compiler can know that | |
415 | the pointer can only take a couple of values and ditch all these manually | |
416 | crafted dependency chains. Hopefully there will be a future C2Y standard that | |
417 | provides a similar construct as a language feature instead. |