[apple/xnu.git] / doc / atomics.md

XNU use of Atomics and Memory Barriers
======================================

Goal
----

This document discusses the use of atomics and memory barriers in XNU. It is
meant as a guide to best practices, and warns against a variety of possible
pitfalls in the handling of atomics in C.

It is assumed that the reader has a decent understanding of
the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order)
as this document builds on it, and explains the liberties XNU takes with said
model.

All the interfaces discussed in this document are available through
the `<os/atomic_private.h>` header.

Note: Linux has thorough documentation around memory barriers
(Documentation/memory-barriers.txt), some of which is Linux specific,
but most is not and is a valuable read.


Vocabulary
----------

In the rest of this document we'll refer to the various memory ordering defined
by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst.

`os_atomic` also tries to make the distinction between compiler **barriers**
(which limit how much the compiler can reorder code), and memory **fences**.


The dangers and pitfalls of C11's `<stdatomic.h>`
-------------------------------------------------

While the C11 memory model has likely been one of the most important additions
to modern C, in the purest C tradition, it is a sharp tool.

By default, C11 comes with two variants of each atomic "operation":

- an *explicit* variant where memory orderings can be specified,
- a regular variant which is equivalent to the former with the *seq_cst*
  memory ordering.

When an `_Atomic` qualified variable is accessed directly without using
any `atomic_*_explicit()` operation, then the compiler will generate the
matching *seq_cst* atomic operations on your behalf.

The sequentially consistent world is extremely safe from a lot of compiler
and hardware reorderings and optimizations, which is great, but comes with
a huge cost in terms of memory barriers.


It seems very tempting to use `atomic_*_explicit()` functions with explicit
memory orderings, however, the compiler is entitled to perform a number of
optimizations with relaxed atomics, that most developers will not expect.
Indeed, the compiler is perfectly allowed to perform various optimizations it
does with other plain memory accesess such as coalescing, reordering, hoisting
out of loops, ...

For example, when the compiler can know what `doit` is doing (which due to LTO
is almost always the case for XNU), is allowed to transform this code:

```c
    void
    perform_with_progress(int steps, long _Atomic *progress)
    {
        for (int i = 0; i < steps; i++) {
            doit(i);
            atomic_store_explicit(progress, i, memory_order_relaxed);
        }
    }
```

Into this, which obviously defeats the entire purpose of `progress`:

```c
    void
    perform_with_progress(int steps, long _Atomic *progress)
    {
        for (int i = 0; i < steps; i++) {
            doit(i);
        }
        atomic_store_explicit(progress, steps, memory_order_relaxed);
    }
```


How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls
-----------------------------------------------------------

1. the memory locations passed to the various `os_atomic_*`
   functions do not need to be marked `_Atomic` or `volatile`
   (or `_Atomic volatile`), which allow for use of atomic
   operations in code written before C11 was even a thing.

   It is however recommended in new code to use the `_Atomic`
   specifier.

2. `os_atomic_*` cannot be coalesced by the compiler:
   all accesses are performed on the specified locations
   as if their type was `_Atomic volatile` qualified.

3. `os_atomic_*` only comes with the explicit variants:
   orderings must be provided and can express either memory orders
   where the name is the same as in C11 without the `memory_order_` prefix,
   or a compiler barrier ordering `compiler_acquire`, `compiler_release`,
   `compiler_acq_rel`.

4. `os_atomic_*` emits the proper compiler barriers that
   correspond to the requested memory ordering (using
   `atomic_signal_fence()`).


Best practices for the use of atomics in XNU
--------------------------------------------

For most generic code, the `os_atomic_*` functions from
`<os/atomic_private.h>` are the perferred interfaces.

`__sync_*`, `__c11_*` and `__atomic_*` compiler builtins should not be used.

`<stdatomic.h>` functions may be used if:

- compiler coalescing / reordering is desired (refcounting
  implementations may desire this for example).


Qualifying atomic variables with `_Atomic` or even
`_Atomic volatile` is encouraged, however authors must
be aware that a direct access to this variable will
result in quite heavy memory barriers.

The *consume* memory ordering should not be used
(See *dependency* memory order later in this documentation).

**Note**: `<libkern/OSAtomic.h>` provides a bunch of legacy
atomic interfaces, but this header is considered obsolete
and these functions should not be used in new code.


High level overview of `os_atomic_*` interfaces
-----------------------------------------------

### Compiler barriers and memory fences

`os_compiler_barrier(mem_order?)` provides a compiler barrier,
with an optional barrier ordering. It is implemented with C11's
`atomic_signal_fence()`. The barrier ordering argument is optional
and defaults to the `acq_rel` compiler barrier (which prevents the
compiler to reorder code in any direction around this barrier).

`os_atomic_thread_fence(mem_order)` provides a memory barrier
according to the semantics of `atomic_thread_fence()`. It always
implies the equivalent `os_compiler_barrier()` even on UP systems.

### Init, load and store

`os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide
facilities equivalent to `atomic_init`, `atomic_load_explicit`
and `atomic_store_explicit` respectively.

Note that `os_atomic_load` and `os_atomic_store` promise that they will
compile to a plain load or store. `os_atomic_load_wide` and
`os_atomic_store_wide` can be used to have access to atomic loads and store
that involve more costly codegen (such as compare exchange loops).

### Basic RMW (read/modify/write) atomic operations

The following basic atomic RMW operations exist:

- `inc`: atomic increment (equivalent to an atomic add of `1`),
- `dec`: atomic decrement (equivalent to an atomic sub of `1`),
- `add`: atomic add,
- `sub`: atomic sub,
- `or`: atomic bitwise or,
- `xor`: atomic bitwise xor,
- `and`: atomic bitwise and,
- `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value),
- `min`: atomic min,
- `max`: atomic max.

For any such operation, two variants exist:

- `os_atomic_${op}_orig` (for example `os_atomic_add_orig`)
  which returns the value stored at the specified location
  *before* the atomic operation took place
- `os_atomic_${op}` (for example `os_atomic_add`) which
  returns the value stored at the specified location
  *after* the atomic operation took place

This convention is picked for two reasons:

1. `os_atomic_add(p, value, ...)` is essentially equivalent to the C
   in place addition `(*p += value)` which returns the result of the
   operation and not the original value of `*p`.

2. Most subtle atomic algorithms do actually require the original value
   stored at the location, especially for bit manipulations:
   `(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform
   `*p |= bit` but also tell you whether `bit` was set in the original value.

   Making it more explicit that the original value is used is hence
   important for readers and worth the extra five keystrokes.

Typically:

```c
    static int _Atomic i = 0;

    printf("%d\n", os_atomic_inc_orig(&i)); // prints 0
    printf("%d\n", os_atomic_inc(&i)); // prints 2
```

### Atomic swap / compare and swap

`os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`.

There are two variants of `os_atomic_cmpxchg` which are wrappers around
`atomic_compare_exchange_strong_explicit`. Both of these variants will
return false/0 if the compare exchange failed, and true/1 if the expected
value was found at the specified location and the new value was stored.

1. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which
   will atomically store `new_value` at `address` if the current value
   is equal to `expected`.

2. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)`
   which has an extra `orig_value` argument which must be a pointer to a local
   variable and will be filled with the current value at `address` whether the
   compare exchange was successful or not. In case of success, the loaded value
   will always be `expected`, however in case of failure it will be filled with
   the current value, which is helpful to redrive compare exchange loops.

Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is
specified, which only takes effect in case of a successful compare exchange.
In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed`
for the failure case ordering, as it is what is used most of the time.

There is no wrapper around `atomic_compare_exchange_weak_explicit`,
as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops.

### `os_atomic_rmw_loop`

This expressive and versatile construct allows for really terse and
way more readable compare exchange loops. It also uses LL/SC constructs more
efficiently than a compare exchange loop would allow.

Instead of a typical CAS-loop in C11:

```c
    int _Atomic *address;
    int old_value, new_value;
    bool success = false;

    old_value = atomic_load_explicit(address, memory_order_relaxed);
    do {
        if (!validate(old_value)) {
            break;
        }
        new_value = compute_new_value(old_value);
        success = atomic_compare_exchange_weak_explicit(address, &old_value,
                new_value, memory_order_acquire, memory_order_relaxed);
    } while (__improbable(!success));
```

`os_atomic_rmw_loop` allows this form:

```c
    int _Atomic *address;
    int old_value, new_value;
    bool success;

    success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
        if (!validate(old_value)) {
            os_atomic_rmw_loop_give_up(break);
        }
        new_value = compute_new_value(old_value);
    });
```

Unlike the C11 variant, it lets the reader know in program order that this will
be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops
one has to jump to the end of the code to understand what it does.

Any control flow that attempts to exit its scope of the loop needs to be
wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can
abort their opened LL/SC transaction).

Because these loops are LL/SC transactions, it is undefined to perform
any store to memory (register operations are fine) within these loops,
as these may cause the store-conditional to always fail.
In particular nesting of `os_atomic_rmw_loop` is invalid.

Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an
`os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed
before the loop should be used in this way:

```c
    int _Atomic *address;
    int old_value, new_value;
    bool success;

again:
    success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
        if (needs_some_store_that_can_thwart_the_transaction(old_value)) {
            os_atomic_rmw_loop_give_up({
                // Do whatever you need to do/store to central memory
                // that would cause the loop to always fail
                do_my_rmw_loop_breaking_store();

                // And only then redrive.
                goto again;
            });
        }
        if (!validate(old_value)) {
            os_atomic_rmw_loop_give_up(break);
        }
        new_value = compute_new_value(old_value);
    });
```

### the *dependency* memory order

Because the C11 *consume* memory order is broken in various ways,
most compilers, clang included, implement it as an equivalent
for `memory_order_acquire`. However, its concept is useful
for certain algorithms.

As an attempt to provide a replacement for this, `<os/atomic_private.h>`
implements an entirely new *dependency* memory ordering.

The purpose of this ordering is to provide a relaxed load followed by an
implicit compiler barrier, that can be used as a root for a chain of hardware
dependencies that would otherwise pair with store-releases done at this address,
very much like the *consume* memory order is intended to provide.

However, unlike the *consume* memory ordering where the compiler had to follow
the dependencies, the *dependency* memory ordering relies on explicit
annotations of when the dependencies are expected:

- loads through a pointer loaded with a *dependency* memory ordering
  will provide a hardware dependency,

- dependencies may be injected into other loads not performed through this
  particular pointer with the `os_atomic_load_with_dependency_on` and
  `os_atomic_inject_dependency` interfaces.

Here is an example of how it is meant to be used:

```c
    struct foo {
        long value;
        long _Atomic flag;
    };

    void
    publish(struct foo *p, long value)
    {
        p->value = value;
        os_atomic_store(&p->flag, 1, release);
    }


    bool
    broken_read(struct foo *p, long *value)
    {
        /*
         * This isn't safe, as there's absolutely no hardware dependency involved.
         * Using an acquire barrier would of course fix it but is quite expensive...
         */
        if (os_atomic_load(&p->flag, relaxed)) {
            *value = p->value;
            return true;
        }
        return false;
    }

    bool
    valid_read(struct foo *p, long *value)
    {
        long flag = os_atomic_load(&p->flag, dependency);
        if (flag) {
            /*
             * Further the chain of dependency to any loads through `p`
             * which properly pair with the release barrier in `publish`.
             */
            *value = os_atomic_load_with_dependency_on(&p->value, flag);
            return true;
        }
        return false;
    }
```

There are 4 interfaces involved with hardware dependencies:

1. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies,
   that should pair with a store or rmw with release semantics or stronger
   (release, acq\_rel or seq\_cst),

2. `os_atomic_inject_dependency` can be used to inject the dependency provided
   by a *dependency* load, or any other value that has had a dependency
   injected,

3. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load
   that still prolongs a dependency chain,

4. `os_atomic_make_dependency` to create an opaque token out of a given
   dependency root to inject into multiple loads.


**Note**: this technique is NOT safe when the compiler can reason about the
pointers that you are manipulating, for example if the compiler can know that
the pointer can only take a couple of values and ditch all these manually
crafted dependency chains. Hopefully there will be a future C2Y standard that
provides a similar construct as a language feature instead.
Commit	Line	Data
cb323159 A	1	XNU use of Atomics and Memory Barriers
	2	======================================
	3
	4	Goal
	5	----
	6
	7	This document discusses the use of atomics and memory barriers in XNU. It is
	8	meant as a guide to best practices, and warns against a variety of possible
	9	pitfalls in the handling of atomics in C.
	10
	11	It is assumed that the reader has a decent understanding of
	12	the [C11 memory model](https://en.cppreference.com/w/c/atomic/memory_order)
	13	as this document builds on it, and explains the liberties XNU takes with said
	14	model.
	15
	16	All the interfaces discussed in this document are available through
f427ee49	17	the `<os/atomic_private.h>` header.
cb323159 A	18
	19	Note: Linux has thorough documentation around memory barriers
	20	(Documentation/memory-barriers.txt), some of which is Linux specific,
	21	but most is not and is a valuable read.
	22
	23
	24	Vocabulary
	25	----------
	26
	27	In the rest of this document we'll refer to the various memory ordering defined
	28	by C11 as relaxed, consume, acquire, release, acq\_rel and seq\_cst.
	29
	30	`os_atomic` also tries to make the distinction between compiler barriers
	31	(which limit how much the compiler can reorder code), and memory fences.
	32
	33
	34	The dangers and pitfalls of C11's `<stdatomic.h>`
	35	-------------------------------------------------
	36
	37	While the C11 memory model has likely been one of the most important additions
	38	to modern C, in the purest C tradition, it is a sharp tool.
	39
	40	By default, C11 comes with two variants of each atomic "operation":
	41
	42	- an explicit variant where memory orderings can be specified,
	43	- a regular variant which is equivalent to the former with the seq_cst
	44	memory ordering.
	45
	46	When an `_Atomic` qualified variable is accessed directly without using
	47	any `atomic_*_explicit()` operation, then the compiler will generate the
	48	matching seq_cst atomic operations on your behalf.
	49
	50	The sequentially consistent world is extremely safe from a lot of compiler
	51	and hardware reorderings and optimizations, which is great, but comes with
f427ee49	52	a huge cost in terms of memory barriers.
cb323159 A	53
	54
	55	It seems very tempting to use `atomic_*_explicit()` functions with explicit
	56	memory orderings, however, the compiler is entitled to perform a number of
	57	optimizations with relaxed atomics, that most developers will not expect.
	58	Indeed, the compiler is perfectly allowed to perform various optimizations it
	59	does with other plain memory accesess such as coalescing, reordering, hoisting
	60	out of loops, ...
	61
	62	For example, when the compiler can know what `doit` is doing (which due to LTO
	63	is almost always the case for XNU), is allowed to transform this code:
	64
	65	```c
	66	void
	67	perform_with_progress(int steps, long _Atomic *progress)
	68	{
	69	for (int i = 0; i < steps; i++) {
	70	doit(i);
	71	atomic_store_explicit(progress, i, memory_order_relaxed);
	72	}
	73	}
	74	```
	75
	76	Into this, which obviously defeats the entire purpose of `progress`:
	77
	78	```c
	79	void
	80	perform_with_progress(int steps, long _Atomic *progress)
	81	{
	82	for (int i = 0; i < steps; i++) {
	83	doit(i);
	84	}
	85	atomic_store_explicit(progress, steps, memory_order_relaxed);
	86	}
	87	```
	88
	89
	90	How `os_atomic_*` tries to address `<stdatomic.h>` pitfalls
	91	-----------------------------------------------------------
	92
	93	1. the memory locations passed to the various `os_atomic_*`
	94	functions do not need to be marked `_Atomic` or `volatile`
	95	(or `_Atomic volatile`), which allow for use of atomic
	96	operations in code written before C11 was even a thing.
	97
	98	It is however recommended in new code to use the `_Atomic`
	99	specifier.
	100
	101	2. `os_atomic_*` cannot be coalesced by the compiler:
	102	all accesses are performed on the specified locations
	103	as if their type was `_Atomic volatile` qualified.
	104
	105	3. `os_atomic_*` only comes with the explicit variants:
	106	orderings must be provided and can express either memory orders
	107	where the name is the same as in C11 without the `memory_order_` prefix,
	108	or a compiler barrier ordering `compiler_acquire`, `compiler_release`,
	109	`compiler_acq_rel`.
	110
f427ee49 A	111	4. `os_atomic_*` emits the proper compiler barriers that
	112	correspond to the requested memory ordering (using
	113	`atomic_signal_fence()`).
cb323159 A	114
	115
	116	Best practices for the use of atomics in XNU
	117	--------------------------------------------
	118
	119	For most generic code, the `os_atomic_*` functions from
f427ee49	120	`<os/atomic_private.h>` are the perferred interfaces.
cb323159 A	121
	122	`__sync_`, `__c11_` and `__atomic_*` compiler builtins should not be used.
	123
	124	`<stdatomic.h>` functions may be used if:
	125
	126	- compiler coalescing / reordering is desired (refcounting
	127	implementations may desire this for example).
	128
cb323159 A	129
	130	Qualifying atomic variables with `_Atomic` or even
	131	`_Atomic volatile` is encouraged, however authors must
	132	be aware that a direct access to this variable will
	133	result in quite heavy memory barriers.
	134
	135	The consume memory ordering should not be used
	136	(See dependency memory order later in this documentation).
	137
	138	Note: `<libkern/OSAtomic.h>` provides a bunch of legacy
	139	atomic interfaces, but this header is considered obsolete
	140	and these functions should not be used in new code.
	141
	142
	143	High level overview of `os_atomic_*` interfaces
	144	-----------------------------------------------
	145
	146	### Compiler barriers and memory fences
	147
	148	`os_compiler_barrier(mem_order?)` provides a compiler barrier,
	149	with an optional barrier ordering. It is implemented with C11's
	150	`atomic_signal_fence()`. The barrier ordering argument is optional
	151	and defaults to the `acq_rel` compiler barrier (which prevents the
	152	compiler to reorder code in any direction around this barrier).
	153
	154	`os_atomic_thread_fence(mem_order)` provides a memory barrier
	155	according to the semantics of `atomic_thread_fence()`. It always
	156	implies the equivalent `os_compiler_barrier()` even on UP systems.
	157
	158	### Init, load and store
	159
	160	`os_atomic_init`, `os_atomic_load` and `os_atomic_store` provide
	161	facilities equivalent to `atomic_init`, `atomic_load_explicit`
	162	and `atomic_store_explicit` respectively.
	163
	164	Note that `os_atomic_load` and `os_atomic_store` promise that they will
	165	compile to a plain load or store. `os_atomic_load_wide` and
	166	`os_atomic_store_wide` can be used to have access to atomic loads and store
	167	that involve more costly codegen (such as compare exchange loops).
	168
	169	### Basic RMW (read/modify/write) atomic operations
	170
	171	The following basic atomic RMW operations exist:
	172
	173	- `inc`: atomic increment (equivalent to an atomic add of `1`),
	174	- `dec`: atomic decrement (equivalent to an atomic sub of `1`),
	175	- `add`: atomic add,
	176	- `sub`: atomic sub,
	177	- `or`: atomic bitwise or,
	178	- `xor`: atomic bitwise xor,
	179	- `and`: atomic bitwise and,
	180	- `andnot`: atomic bitwise andnot (equivalent to atomic and of ~value),
	181	- `min`: atomic min,
	182	- `max`: atomic max.
	183
	184	For any such operation, two variants exist:
	185
	186	- `os_atomic_${op}_orig` (for example `os_atomic_add_orig`)
	187	which returns the value stored at the specified location
	188	before the atomic operation took place
	189	- `os_atomic_${op}` (for example `os_atomic_add`) which
	190	returns the value stored at the specified location
	191	after the atomic operation took place
	192
193	This convention is picked for two reasons:
194
195	1. `os_atomic_add(p, value, ...)` is essentially equivalent to the C
196	in place addition `(*p += value)` which returns the result of the
197	operation and not the original value of `*p`.
198
199	2. Most subtle atomic algorithms do actually require the original value
200	stored at the location, especially for bit manipulations:
201	`(os_atomic_or_orig(p, bit, relaxed) & bit)` will atomically perform
202	`*p \|= bit` but also tell you whether `bit` was set in the original value.
203
204	Making it more explicit that the original value is used is hence
205	important for readers and worth the extra five keystrokes.
206
207	Typically:
208
209	```c
210	static int _Atomic i = 0;
211
212	printf("%d\n", os_atomic_inc_orig(&i)); // prints 0
213	printf("%d\n", os_atomic_inc(&i)); // prints 2
214	```
215
216	### Atomic swap / compare and swap
217
218	`os_atomic_xchg` is a simple wrapper around `atomic_exchange_explicit`.
219
220	There are two variants of `os_atomic_cmpxchg` which are wrappers around
221	`atomic_compare_exchange_strong_explicit`. Both of these variants will
222	return false/0 if the compare exchange failed, and true/1 if the expected
223	value was found at the specified location and the new value was stored.
224
225	1. `os_atomic_cmpxchg(address, expected, new_value, mem_order)` which
226	will atomically store `new_value` at `address` if the current value
227	is equal to `expected`.
228
229	2. `os_atomic_cmpxchgv(address, expected, new_value, orig_value, mem_order)`
230	which has an extra `orig_value` argument which must be a pointer to a local
231	variable and will be filled with the current value at `address` whether the
232	compare exchange was successful or not. In case of success, the loaded value
233	will always be `expected`, however in case of failure it will be filled with
234	the current value, which is helpful to redrive compare exchange loops.
235
236	Unlike `atomic_compare_exchange_strong_explicit`, a single ordering is
237	specified, which only takes effect in case of a successful compare exchange.
238	In C11 speak, `os_atomic_cmpxchg*` always specifies `memory_order_relaxed`
239	for the failure case ordering, as it is what is used most of the time.
240
241	There is no wrapper around `atomic_compare_exchange_weak_explicit`,
242	as `os_atomic_rmw_loop` offers a much better alternative for CAS-loops.
243
244	### `os_atomic_rmw_loop`
245
246	This expressive and versatile construct allows for really terse and
247	way more readable compare exchange loops. It also uses LL/SC constructs more
248	efficiently than a compare exchange loop would allow.
249
250	Instead of a typical CAS-loop in C11:
251
252	```c
253	int _Atomic *address;
254	int old_value, new_value;
255	bool success = false;
256
257	old_value = atomic_load_explicit(address, memory_order_relaxed);
258	do {
259	if (!validate(old_value)) {
260	break;
261	}
262	new_value = compute_new_value(old_value);
263	success = atomic_compare_exchange_weak_explicit(address, &old_value,
264	new_value, memory_order_acquire, memory_order_relaxed);
265	} while (__improbable(!success));
266	```
267
268	`os_atomic_rmw_loop` allows this form:
269
270	```c
271	int _Atomic *address;
272	int old_value, new_value;
273	bool success;
274
275	success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
276	if (!validate(old_value)) {
277	os_atomic_rmw_loop_give_up(break);
278	}
279	new_value = compute_new_value(old_value);
280	});
281	```
282
283	Unlike the C11 variant, it lets the reader know in program order that this will
284	be a CAS loop, and exposes the ordering upfront, while for traditional CAS loops
285	one has to jump to the end of the code to understand what it does.
286
287	Any control flow that attempts to exit its scope of the loop needs to be
288	wrapped with `os_atomic_rmw_loop_give_up` (so that LL/SC architectures can
289	abort their opened LL/SC transaction).
290
291	Because these loops are LL/SC transactions, it is undefined to perform
292	any store to memory (register operations are fine) within these loops,
293	as these may cause the store-conditional to always fail.
294	In particular nesting of `os_atomic_rmw_loop` is invalid.
295
296	Use of `continue` within an `os_atomic_rmw_loop` is also invalid, instead an
297	`os_atomic_rmw_loop_give_up(goto again)` jumping to an `again:` label placed
298	before the loop should be used in this way:
299
300	```c
301	int _Atomic *address;
302	int old_value, new_value;
303	bool success;
304
305	again:
306	success = os_atomic_rmw_loop(address, old_value, new_value, acquire, {
307	if (needs_some_store_that_can_thwart_the_transaction(old_value)) {
308	os_atomic_rmw_loop_give_up({
309	// Do whatever you need to do/store to central memory
310	// that would cause the loop to always fail
311	do_my_rmw_loop_breaking_store();
312
313	// And only then redrive.
314	goto again;
315	});
316	}
317	if (!validate(old_value)) {
318	os_atomic_rmw_loop_give_up(break);
319	}
320	new_value = compute_new_value(old_value);
321	});
322	```
323
324	### the dependency memory order
325
326	Because the C11 consume memory order is broken in various ways,
327	most compilers, clang included, implement it as an equivalent
328	for `memory_order_acquire`. However, its concept is useful
329	for certain algorithms.
330
f427ee49	331	As an attempt to provide a replacement for this, `<os/atomic_private.h>`
cb323159 A	332	implements an entirely new dependency memory ordering.
	333
	334	The purpose of this ordering is to provide a relaxed load followed by an
	335	implicit compiler barrier, that can be used as a root for a chain of hardware
	336	dependencies that would otherwise pair with store-releases done at this address,
	337	very much like the consume memory order is intended to provide.
	338
	339	However, unlike the consume memory ordering where the compiler had to follow
	340	the dependencies, the dependency memory ordering relies on explicit
	341	annotations of when the dependencies are expected:
	342
	343	- loads through a pointer loaded with a dependency memory ordering
	344	will provide a hardware dependency,
	345
	346	- dependencies may be injected into other loads not performed through this
	347	particular pointer with the `os_atomic_load_with_dependency_on` and
	348	`os_atomic_inject_dependency` interfaces.
	349
	350	Here is an example of how it is meant to be used:
	351
	352	```c
	353	struct foo {
	354	long value;
	355	long _Atomic flag;
	356	};
	357
	358	void
	359	publish(struct foo *p, long value)
	360	{
	361	p->value = value;
	362	os_atomic_store(&p->flag, 1, release);
	363	}
	364
	365
	366	bool
	367	broken_read(struct foo p, long value)
	368	{
	369	/*
	370	* This isn't safe, as there's absolutely no hardware dependency involved.
	371	* Using an acquire barrier would of course fix it but is quite expensive...
	372	*/
	373	if (os_atomic_load(&p->flag, relaxed)) {
	374	*value = p->value;
	375	return true;
	376	}
	377	return false;
	378	}
	379
	380	bool
	381	valid_read(struct foo p, long value)
	382	{
	383	long flag = os_atomic_load(&p->flag, dependency);
	384	if (flag) {
	385	/*
	386	* Further the chain of dependency to any loads through `p`
	387	* which properly pair with the release barrier in `publish`.
	388	*/
	389	*value = os_atomic_load_with_dependency_on(&p->value, flag);
	390	return true;
	391	}
	392	return false;
	393	}
	394	```
	395
396	There are 4 interfaces involved with hardware dependencies:
397
398	1. `os_atomic_load(..., dependency)` to initiate roots of hardware dependencies,
399	that should pair with a store or rmw with release semantics or stronger
400	(release, acq\_rel or seq\_cst),
401
402	2. `os_atomic_inject_dependency` can be used to inject the dependency provided
403	by a dependency load, or any other value that has had a dependency
404	injected,
405
406	3. `os_atomic_load_with_dependency_on` to do an otherwise related relaxed load
407	that still prolongs a dependency chain,
408
409	4. `os_atomic_make_dependency` to create an opaque token out of a given
410	dependency root to inject into multiple loads.
411
412
413	Note: this technique is NOT safe when the compiler can reason about the
414	pointers that you are manipulating, for example if the compiler can know that
415	the pointer can only take a couple of values and ditch all these manually
416	crafted dependency chains. Hopefully there will be a future C2Y standard that
417	provides a similar construct as a language feature instead.