]>
Commit | Line | Data |
---|---|---|
f427ee49 A |
1 | # XNU General purpose allocators |
2 | ||
3 | ## Introduction | |
4 | ||
5 | XNU proposes two ways to allocate memory: | |
6 | - the VM subsystem that provides allocations at the granularity of pages (with | |
7 | `kernel_memory_allocate` and similar interfaces); | |
8 | - the zone allocator subsystem (`<kern/zalloc.h>`) which is a slab-allocator of | |
9 | objects of fixed size. | |
10 | ||
11 | This document describes all the allocator variants around the zone allocator, | |
12 | how to use them and what their security model is. | |
13 | ||
14 | In addition to that, `<kern/kalloc.h>` provides a variable-size general purpose | |
15 | allocator implemented as a collection of zones of fixed size, and overflowing to | |
16 | `kernel_memory_allocate` for allocations larger than a few pages (32KB when this | |
17 | document was being written but this is subject to change/tuning in the future). | |
18 | ||
19 | ||
20 | The Core Kernel allocators rely on the following headers: | |
21 | - `<kern/zalloc.h>` and `<kern/kalloc.h>` for its API surface, which most | |
22 | clients should find sufficient, | |
23 | - `<kern/zalloc_internal.h>` and `<kern/zcache_internal.h>` for interfaces that | |
24 | need to be exported for introspection and implementation purposes, and is not | |
25 | meant for general consumption. | |
26 | ||
27 | ## TL;DR | |
28 | ||
29 | This section will give a rapid decision tree of which allocation method to use, | |
30 | and general best practices. The rest of the document goes into more details and | |
31 | offers more information that can explain the rationale behind these | |
32 | recommendations. | |
33 | ||
34 | ### Which allocator to use, and other advices | |
35 | ||
36 | 1. If you are allocating memory that is never freed, use `zalloc_permanent*`. If | |
37 | the allocation is larger than a page, then it will use | |
38 | `kernel_memory_allocate` with the `KMA_PERMANENT` flag on your behalf. | |
39 | The allocation is assumed to always succeed (this is mostly reserved for early | |
40 | allocations that need to scale with the configuration of the machine and | |
41 | cannot be decided at compile time), and will be zeroed. | |
42 | ||
43 | 2. If the memory you are allocating is temporary and will not escape the scope | |
44 | of the syscall it's used for, use `kheap_alloc` and `kheap_free` with the | |
45 | `KHEAP_TEMP` heap. Note that temporary paths should use `zalloc(ZV_NAMEI)`. | |
46 | ||
47 | 3. If the memory you are allocating will not hold pointers, and even more so | |
48 | when the content of that piece of memory can be directly influenced by | |
49 | user-space, then use `kheap_alloc` and `kheap_free` with the | |
50 | `KHEAP_DATA_BUFFERS` heap. | |
51 | ||
52 | 4. In general we prefer zalloc or kalloc interfaces, and would like to abandon | |
53 | any legacy MALLOC/FREE interfaces over time. | |
54 | ||
55 | For all `kalloc` or `kheap_alloc` variants, these advices apply: | |
56 | ||
57 | - If your allocation size is of fixed size, of a sub-page size, and done with | |
58 | the `Z_WAITOK` semantics (allocation can block), consider adding `Z_NOFAIL`, | |
59 | - If you `bzero` the memory on allocation, prefer passing `Z_ZERO` which can be | |
60 | optimized away more often than not. | |
61 | ||
62 | ### Considerations for zones | |
63 | ||
64 | Performance wise, it is problematic to make a zone when the kernel tends to have | |
65 | less than several pages worth of elements allocated at all times (think commonly | |
66 | 200k+ objects). When a zone is underutilized, then fragmentation becomes a | |
67 | problem. | |
68 | ||
69 | Zones with a really high traffic of allocation and frees should consider using | |
70 | zone caching, but this comes at a memory usage cost and needs to be evaluated. | |
71 | ||
72 | Security wise, the following questions need answering: | |
73 | - Is this type "interesting" to confuse with another, if yes, having a separate | |
74 | zone allows for usage of `zone_require()` and will by default sequester the | |
75 | virtual address space; | |
76 | - Is this type holding user "bytes", if yes, then it might be interesting to use | |
77 | a zone view (like the `ZV_NAMEI` one for paths) instead; | |
78 | - Is the type zeroed on allocation all the time? if yes, enabling | |
79 | `ZC_ZFREE_CLEARMEM` will likely be a really marginal incremental cost that can | |
80 | discover write-after-free bugs. | |
81 | ||
82 | ## Variants | |
83 | ||
84 | There are several allocation wrappers in XNU, present for various reasons | |
85 | ranging from additional accounting features (IOKit's `IONew`), conformance to | |
86 | langauge requirements (C++ various `new` operators) or organical historical | |
87 | reasons. | |
88 | ||
89 | `zalloc` and `kalloc` are considered the primitive allocation interfaces which | |
90 | are used to implement all the other ones. The following table documents all | |
91 | interfaces and their various properties. | |
92 | ||
93 | <table> | |
94 | <tr> | |
95 | <th>Interface</th> | |
96 | <th>Core XNU</th> | |
97 | <th>Private Export</th> | |
98 | <th>Public Export</th> | |
99 | <th>Comments</th> | |
100 | </tr> | |
101 | <tr><th colspan="5">Core primitives</th></tr> | |
102 | <tr> | |
103 | <th>zalloc</th> | |
104 | <td>Yes</td> | |
105 | <td>Yes</td> | |
106 | <td>No</td> | |
107 | <td> | |
108 | The number of zones due to their implementation is limited. | |
109 | ||
110 | Until this limitation is lifted, general exposition to arbitrary | |
111 | kernel extensions is problematic. | |
112 | </td> | |
113 | </tr> | |
114 | <tr> | |
115 | <th>kheap_alloc</th> | |
116 | <td>Yes</td> | |
117 | <td>No</td> | |
118 | <td>No</td> | |
119 | <td> | |
120 | This is the true core implementation of `kalloc`, see documentation about | |
121 | kalloc heaps. | |
122 | </td> | |
123 | </tr> | |
124 | <tr> | |
125 | <th>kalloc</th> | |
126 | <td>Yes</td> | |
127 | <td>Yes, Redirected</td> | |
128 | <td>No</td> | |
129 | <td> | |
130 | In XNU, `kalloc` is equivalent to `kheap_alloc(KHEAP_DEFAULT)`. | |
131 | <br /> | |
132 | In kernel extensions, `kalloc` is equivalent to `kheap_alloc(KHEAP_KEXT)`. | |
133 | <br /> | |
134 | Due to legacy contracts where allocation and deallocation happen on | |
135 | different sides of the XNU/Kext boundary, `kfree` will allow to free to | |
136 | either heaps. New code should consider using the proper `kheap_*` variant | |
137 | instead. | |
138 | </td> | |
139 | </tr> | |
140 | ||
141 | <tr><th colspan="5">Popular wrappers</th></tr> | |
142 | <tr> | |
143 | <th>IOMalloc</th> | |
144 | <td>Yes</td> | |
145 | <td>Yes, Redirected</td> | |
146 | <td>Yes, Redirected</td> | |
147 | <td> | |
148 | `IOMalloc` is a straight wrapper around `kalloc` and behaves like | |
149 | `kalloc`. It does provide some debugging features integrated with `IOKit` | |
150 | and is the allocator that Drivers should use. | |
151 | <br/> | |
152 | Only kernel extensions that are providing core infrastructure | |
153 | (filesystems, sandbox, ...) and are out-of-tree core kernel components | |
154 | should use the primitive `zalloc` or `kalloc` directly. | |
155 | </td> | |
156 | </tr> | |
157 | <tr> | |
158 | <th>C++ new</th> | |
159 | <td>Yes</td> | |
160 | <td>Yes, Redirected</td> | |
161 | <td>Yes, Redirected</td> | |
162 | <td> | |
163 | C++'s various operators around `new` and `delete` are implemented by XNU. | |
164 | It redirects to the `KHEAP_KEXT` kalloc heap as there is no use of C++ | |
165 | default operator new in Core Kernel. | |
166 | <br/> | |
167 | When creating a subclass of `OSObject` with the IOKit macros to do so, an | |
168 | `operator new` and `operator delete` is provided for this object that will | |
169 | anchor this type to the `KHEAP_DEFAULT` heap when the class is defined in | |
170 | Core XNU, or to the `KHEAP_KEXT` heap when the class is defined in a | |
171 | kernel extension. | |
172 | </td> | |
173 | </tr> | |
174 | <tr> | |
175 | <th>MALLOC</th> | |
176 | <td>Yes</td> | |
177 | <td>Obsolete, Redirected</td> | |
178 | <td>No</td> | |
179 | <td> | |
180 | This is a legacy BSD interface that functions mostly like `kalloc`. | |
181 | For kexts, `FREE()` will allow to free either to `KHEAP_DEFAULT` or | |
182 | `KHEAP_KEXT` due to legacy interfaces that allocate on one side of the | |
183 | kext/core kernel boundary and free on the other. | |
184 | </td> | |
185 | </tr> | |
186 | ||
187 | <tr><th colspan="5">Obsolete wrappers</th></tr> | |
188 | <tr> | |
189 | <th>mcache</th> | |
190 | <td>Yes</td> | |
191 | <td>Kinda</td> | |
192 | <td>Kinda</td> | |
193 | <td> | |
194 | The mcache/mbuf subsystem is mostly used by the BSD networking subsystem. | |
195 | Code that is not interacting with these interfaces should not adopt | |
196 | mcaches. | |
197 | </td> | |
198 | </tr> | |
199 | <tr> | |
200 | <th>OSMalloc</th> | |
201 | <td>No</td> | |
202 | <td>Obsolete, Redirected</td> | |
203 | <td>Obsolete, Redirected</td> | |
204 | <td> | |
205 | `<libkern/OSMalloc.h>` is a legacy subsystem that is no longer | |
206 | recommended. It provides extremely slow and non scalable accounting | |
207 | and no new code should use it. `IOMalloc` should be used instead. | |
208 | </td> | |
209 | </tr> | |
210 | <tr> | |
211 | <th>MALLOC_ZONE</th> | |
212 | <td>No</td> | |
213 | <td>Obsolete, Redirected</td> | |
214 | <td>No</td> | |
215 | <td> | |
216 | `MALLOC_ZONE` used to be a weird wrapper around `zalloc` but with poorer | |
217 | security guarantees. It has been completely removed from XNU and should | |
218 | not be used. | |
219 | <br/> | |
220 | For backward compatbility reasons, it is still exported, but behaves | |
221 | exactly like `MALLOC` otherwise. | |
222 | </td> | |
223 | </tr> | |
224 | <tr> | |
225 | <th>kern_os_*</th> | |
226 | <td>No</td> | |
227 | <td>Obsolete, Redirected</td> | |
228 | <td>Obsolete, Redirected</td> | |
229 | <td> | |
230 | These symbols used to back the implementation of C++ `operator new` and | |
231 | are only kept for backward compatibility reasons. Those should not be used | |
232 | by anyone directly. | |
233 | </td> | |
234 | </tr> | |
235 | </table> | |
236 | ||
237 | ||
238 | ## The Zone allocator: concepts, performance and security | |
239 | ||
240 | Zones are created with `zone_create()`, and really meant never to be destroyed. | |
241 | Destructible zones are here for legacy reasons, and not all features are | |
242 | available to them. | |
243 | ||
244 | Zones allocate their objects from a specific fixed size map called the Zone Map. | |
245 | This map is subdivided in a few submaps that provide different security | |
246 | properties: | |
247 | ||
248 | - the VA Restricted map: it is used by the VM subsystem only, and allows for | |
249 | extremely tight packing of pointers used by the VM subsystem. This submap | |
250 | doesn't use sequestering. | |
251 | - the general map: it is used by default by zones, and on embedded | |
252 | defaults to using full VA sequestering (see below). | |
253 | - the "bag of bytes" map: it is used for zones that provide various buffers | |
254 | whose content is under the control of user-space. Segregating these | |
255 | allocations from the other submaps closes attacks using such allocations to | |
256 | spray kernel objects that live in the general map. | |
257 | ||
258 | It is worth noting that use of any allocation function in interrupt context is | |
259 | never allowed in XNU, as none of our allocators are re-entrant and interrupt | |
260 | safe. | |
261 | ||
262 | ### Basic features | |
263 | ||
264 | `<kern/zalloc.h>` defines several flags that can be used to alter the blocking | |
265 | behavior of `zalloc` and `kalloc`: | |
266 | ||
267 | - `Z_NOWAIT` can be used to require a fully non blocking behavior, which can be | |
268 | used for allocations under spinlock and other preemption disabled contexts; | |
269 | - `Z_NOPAGEWAIT` allows for the allocator to block (typically on mutexes), | |
270 | but not to wait for available pages if there are none; | |
271 | - `Z_WAITOK` means that the zone allocator can wait and block. | |
272 | ||
273 | It is worth noting that unless the zone is exhaustible or "special" (which is | |
274 | mostly the case for VM zones), then `zalloc` will never fail (but might block | |
275 | for arbitrarily long if the zone map is under a lot of pressure). This is not | |
276 | true of `kalloc` when the allocation is served by the VM. | |
277 | ||
278 | It is worth noting that `Z_ZERO` is provided so that the allocation returned by | |
279 | the allocator is always zeroed. This should be used instead of manual usage of | |
280 | `bzero` as the zone allocator is able to optimize it away when certain security | |
281 | features that already guarantee the zeroing are engaged. | |
282 | ||
283 | ||
284 | ### Zone Caching | |
285 | ||
286 | Zones that have relatively fast allocation/deallocation patterns can use zone | |
287 | caching (passing `ZC_CACHING`) to `zone_create()`. This enables per-CPU caches, | |
288 | which hold onto several allocations per CPU. This should not be done lightly, | |
289 | especially for zones holding onto large elements. | |
290 | ||
291 | ### Type confusion (Zone Sequestering and `zone_require()`) | |
292 | ||
293 | In order to be slightly more resilient to Use after Free (UaF) bugs, XNU | |
294 | provides two techniques: | |
295 | ||
296 | - using the `ZC_SEQUESTER` flag to `zone_create()`; | |
297 | - manual use of `zone_require()` or `zone_id_require()`. | |
298 | ||
299 | The first form will cause the virtual address ranges that a given zone uses | |
300 | to never be returned to the system, which essentially pins this address range | |
301 | for holding allocations of this particular zone forever. When a zone is strongly | |
302 | typed, it means that only objects of that particular type can ever be located | |
303 | at this address. | |
304 | ||
305 | `zone_require()` is an interface that can be used prior to memory use to assert | |
306 | that the memory belongs to a given zone. | |
307 | ||
308 | Both these techniques can be used to dramatically reduce type confusion bugs. | |
309 | For example, the task zone uses both sequestering and judicious usage of | |
310 | `zone_require()` in crucial parts which makes faking a `task_t` and using it | |
311 | to confuse the kernel extremely difficult. | |
312 | ||
313 | When `zone_require()` can be used exhaustively in choking points, then | |
314 | sequestering is no longer necessary to protect this type. For example, the | |
315 | `ipc_port_t`, will take the `ip_lock()` or an `ip_reference()` prior to any | |
316 | interesting use. These primitives have been extended to include a | |
317 | `zone_id_require()` (the fastest existing form of `zone_require()`) which gives | |
318 | us an exhaustive protection. As a result, it allows us not to sequester the | |
319 | ports zone. This is interesting because userspace can cause spikes of | |
320 | allocations of ports and this protects us from zone map exhaustion or more | |
321 | generally increase cost to describe the sequestered address space of this zone | |
322 | due to a high peak usage. | |
323 | ||
324 | ### Usage of Zones in IOKit | |
325 | ||
326 | IOKit is a subsystem that is often used by attackers, and reducing type | |
327 | confusion attacks against it is desireable. For this purpose, XNU exposes the | |
328 | ability to create a zone rather than being allocated in a kalloc heap. | |
329 | ||
330 | Using the `OSDefineMetaClassAndStructorsWithZone` or any other | |
331 | `OSDefineMetaClass.*WithZone` interface will cause the object's `operator new` | |
332 | and `operator delete` to back the storage of these objects with zones. This is | |
333 | available to first party kexts, and usage should be reserved to types that can | |
334 | easily be allocated by user-space and in large quantities enough that the | |
335 | induced fragmentation is acceptable. | |
336 | ||
337 | ### Auto-zeroing | |
338 | ||
339 | A lot of bugs come from partially initialized data, or write-after-free. | |
340 | To mitigate these issues, zones provide two level of protection: | |
341 | ||
342 | - page clearing | |
343 | - element clear on free (`ZC_ZFREE_CLEARMEM`). | |
344 | ||
345 | Page clearing is used when new pages are added to the zone. The original version | |
346 | of the zone allocator would cram pages into zones without changing their | |
347 | content. Memory crammed into a zone will be cleared from its content. | |
348 | This helps mitigate leaking/using uninitialized data. | |
349 | ||
350 | Element clear on free is an increased protection that causes `zfree()` to erase | |
351 | the content of elements when they are returned to the zone. When an element is | |
352 | allocated from a zone with this property set, then the allocator will check that | |
353 | the element hasn't been tampered with before it is handed back. This is | |
354 | particularly interesting when the allocation codepath always clears the returned | |
355 | element: when using the `Z_ZERO` (resp. `M_ZERO`) with `zalloc` or `kalloc` | |
356 | (resp. `MALLOC`), then the zone allocator knows not to issue this extraneous | |
357 | zeroing. | |
358 | ||
359 | `ZC_ZFREE_CLEARMEM` at the time this document was written was default for any | |
360 | zone where elements are smaller than 2 cachelines. This technique is | |
361 | particularly interesting because things such as locks, refcounts or pointers | |
362 | valid states can't be all zero. It makes exploitation of a Use-after-free more | |
363 | difficult when this is engaged. | |
364 | ||
365 | ### Poisoning | |
366 | ||
367 | The zone allocator also does statistical poisoning (see source for details). | |
368 | ||
369 | It also always zeroes the first 2 cachelines of any allocation on free, when | |
370 | `ZC_ZFREE_CLEARMEM` isn't engaged. It sometimes mitigates certain kind of linear | |
371 | buffer overflows. It also can be leveraged by types that have refcounts or locks | |
372 | if those are placed "early" in the type definition, as zero is not a valid value | |
373 | for such concepts. | |
374 | ||
375 | ### Per-CPU allocations | |
376 | ||
377 | The zone allocator provides `ZC_PERCPU` as a way to declare a per-cpu zone. | |
378 | Allocations from this zone are returning NCPU elements with a known stride. | |
379 | ||
380 | It is expected that such allocations are not performed in a rapid pattern, and | |
381 | zone caching is not available for them. (zone caching actually is implemented | |
382 | on top of a per-cpu zone). | |
383 | ||
384 | Usage of per-cpu zone should be limited to extremely performance sensitive | |
385 | codepaths or global counters due to the enormous amplification factor on | |
386 | many-core systems. | |
387 | ||
388 | ### Permanent allocations | |
389 | ||
390 | The kernel sometimes needs to provide persistent allocations that depend on | |
391 | parameters that aren't compile time constants, but will not vary over time (NCPU | |
392 | is an obvious example here). | |
393 | ||
394 | The zone subsystem provides a `zalloc_permanent*` family of functions that help | |
395 | allocating memory in such a fashion in a very compact way. | |
396 | ||
397 | Unlike the typical zone allocators, this allows for arbitrary sizes, in a | |
398 | similar fashion to `kalloc`. These functions will never fail (if the allocation | |
399 | fails, the kernel will panic), and always return zeroed memory. Trying to free | |
400 | these allocations results in a kernel panic. | |
401 | ||
402 | ||
403 | ## kalloc: a heap of zones | |
404 | ||
405 | Kalloc is a general malloc-like allocator that is backed by zones when the size | |
406 | of the allocation is sub-page (actually smaller than 32K at the time this | |
407 | document was written, but under KASAN or other memory debugging techniques, this | |
408 | limit for the usable payload might actually be lower). Larger allocations use | |
409 | `kernel_memory_allocate` (KMA). | |
410 | ||
411 | The kernel calls the collection of zones that back kalloc a "kalloc heap", and | |
412 | provides 3 builtin ones: | |
413 | ||
414 | - `KHEAP_DEFAULT`, the "default" heap, is the one that serves `kalloc` in Core | |
415 | Kernel (XNU proper); | |
416 | - `KHEAP_KEXT`, the kernel extension heap, is the one that serves `kalloc` in | |
417 | kernel extensions (see "redirected" symbols in the Variants table above); | |
418 | - `KHEAP_DATA_BUFFERS` which is a special heap, which allocates out of the "User | |
419 | Data" submap, and is meant for allocation of payloads that hold no pointer and | |
420 | tend to be under the control of user space (paths, pipe buffers, OSData | |
421 | backing stores, ...). | |
422 | ||
423 | In addition to that, the kernel provides an extra "magical" kalloc heap: | |
424 | `KHEAP_TEMP`, it is for all purposes an alias of `KHEAP_DEFAULT` but enforces | |
425 | extra semantics: allocations and deallocations out of this heap must be | |
426 | performed "in scope". It is meant for allocations that are made to support a | |
427 | syscall, and that will be freed before that syscall returns to user-space. | |
428 | ||
429 | The usage of `KHEAP_TEMP` will ensure that there is no outstanding allocation at | |
430 | various points (such as return-to-userspace) and will panic the system if this | |
431 | property is broken. The `kheap_temp_debug=1` boot-arg can be used on development | |
432 | kernels to debug such issues when the occur. | |
433 | ||
434 | As far as security policies are concerned, the default and kext heap are fully | |
435 | segregated per size-class. The data buffers heap is isolated in the user data | |
436 | submaps, and hence can never produce adresses aliasing with any other kind of | |
437 | allocations in the system. | |
438 | ||
439 | ||
440 | ## Accounting (Zone Views and Kalloc Heap Aliases) | |
441 | ||
442 | The zone subsystem provides several accounting properties that are reported by | |
443 | the `zprint(1)` command. Historically, some zones have been introduced to help | |
444 | with accounting, to the cost of increased fragmentation (the more allocations | |
445 | are issued from the same zone, the lower the fragmentation). It is now possible | |
446 | to define zone views and kalloc heap aliases, which are two similar concepts for | |
447 | zones and kalloc heaps respectively. | |
448 | ||
449 | Zone views are declared (in headers) and defined (in modules) with | |
450 | `ZONE_VIEW_DECLARE` and `ZONE_VIEW_DEFINE`, and can be an alias either for | |
451 | another regular zone, or a specific zone of a kalloc heap. This is for example | |
452 | used for the `ZV_NAMEI` zone out of which temporary paths are allocated (this is | |
453 | an alias to the `KHEAP_DATA_BUFFERS` 1024 bytes zone). Extra accounting is | |
454 | issued for these views and are also reported by `zprint(1)`. | |
455 | ||
456 | In a similar fashion, `KALLOC_HEAP_DECLARE` and `KALLOC_HEAP_DEFINE` can be used | |
457 | to declare a kalloc heap alias that gets its own accounting. It is particularly | |
458 | useful to track leaks and various other things. | |
459 | ||
460 | The accounting of zone and heap views isn't free (and has a per-CPU cost) and | |
461 | should be used wisely. However, if the alternative is a fully separated zone, | |
462 | then the memory cost of the accounting would likely be dwarfed by the | |
463 | fragmentation cost of the new zone. | |
464 | ||
465 | At this time, views can only be made by Core Kernel. | |
466 |