fsck_hfs/dfalib/FixDecompsNotes.txt

   1 #
   2 #       File:           fsckFixDecompsNotes.txt
   3 #
   4 #       Contains:       Notes on fsckFixDecomps function and related tools
   5 #
   6 #       Copyright:      © 2002 by Apple Computer, Inc., all rights reserved.
   7 #
   8 #       CVS change log:
   9 #
  10 #               $Log: FixDecompsNotes.txt,v $
  11 #               Revision 1.2  2002/12/20 01:20:36  lindak
  12 #               Merged PR-2937515-2 into ZZ100
  13 #               Old HFS+ decompositions need to be repaired
  14 #
  15 #               Revision 1.1.4.1  2002/12/16 18:55:22  jcotting
  16 #               integrated code from text group (Peter Edberg) that will correct some
  17 #               illegal names created with obsolete Unicode 2.1.2 decomposition rules
  18 #               Bug #: 2937515
  19 #               Submitted by: jerry cottingham
  20 #               Reviewed by: don brady
  21 #
  22 #               Revision 1.1.2.1  2002/10/25 17:15:23  jcotting
  23 #               added code from Peter Edberg that will detect and offer replacement
  24 #               names for file system object names with pre-Jaguar decomp errors
  25 #               Bug #: 2937515
  26 #               Submitted by: jerry cottingham
  27 #               Reviewed by: don brady
  28 #
  29 #               Revision 1.2  2002/10/16 20:17:21  pedberg
  30 #               Add more notes
  31 #
  32 #               Revision 1.1  2002/10/16 06:53:54  pedberg
  33 #               [3066897, 2937515] Start adding notes about this code
  34 #
  35 #       ---------------------------------------------------------------------------
  36
  37 Code provided per Radar #3066897 to fix bug #2937515.
  38
  39 The Unicode decomposition used to date for HFS+ volumes - as described in
  40   <http://developer.apple.com/technotes/tn/tn1150.html#CanonicalDecomposition>
  41   <http://developer.apple.com/technotes/tn/tn1150table.html>
  42 - is based on a modified version of the decomposition rules for Unicode 2.1.2
  43 (but even those were not correctly implemented for certain combinations of
  44 multiple combining marks). Unicode has updated the decomposition and combining
  45 mark reordering rules and data many times since then, but they have locked them
  46 down for Unicode 3.1. This is because Unicode 3.1 is the basis of the Unicode
  47 normalization forms such as NFC and NFD. We began supporting these normalization
  48 formats in Jaguar.
  49
  50 Because of this, the Apple Unicode cross-functional committee decided to do a
  51 one-time change to update the decomposition rules used for HFS+ volumes from the
  52 Unicode 2.1.2 rules to the Unicode 3.1 rules. TEC and the kernel encoding
  53 converters made this change in Jaguar. One other piece that was supposed to
  54 happen was an enhancement to fsck to convert filenames on HFS+ volumes from the
  55 old decomposition to the new.
  56
  57 That fsck change did not happen in Jaguar, and as a result there are bugs such
  58 as 2937515 (in which users are seeing partial garbage for filenames). The update
  59 affects the decomposition of Greek characters - about 80 of them (18 of which
  60 correspond to characters in MacGreek). It also affects the decomposition of a
  61 few others: around 23 Latin-script characters and 18 Cyrillic characters (none
  62 of which correspond to anything in a traditional Mac encoding), 8 Arabic
  63 characters (5 of which do correspond to MacArabic characters), 16 Indic, Thai, &
  64 Lao characters (3 of which correspond to characters in Mac encodings). It also
  65 potentially affects the ordering of all combining marks.
  66
  67 This directory contains code provided per 3066897 that fsck can use to address
  68 this problem for HFS+ volumes.
  69
  70 ----
  71 A. Data structure
  72
  73 The data is organized into a two-level trie. The first level is a table that
  74 maps the high-order 12 bits of a UniChar to an index value. The value is -1 if
  75 no character with those high 12 bits has either a decomposition update or a
  76 nonzero combining class; otherwise, it is an index into an array of ranges that
  77 map the low 4 bits of the UniChar to the necessary data. There are two such
  78 arrays of ranges; one provides the mappings to combining class values, the other
  79 provides the mappings to decomposition update information. The latter is in the
  80 form of an index into an array of sequences that contain an action code, an
  81 optional list of additional characters that need to be matched for a complete
  82 sequence match (in the case where a 2-element or 3-element sequence needs to be
  83 updated), and the replacement decomposition sequence.
  84
  85 There is one additional twist for the first-level trie table. Since the
  86 characters that have classor decomposition data are all either in the range
  87 0x0000-30FF or 0xFB00-FFFF, we can save 3K space in the table by eliminating the
  88 middle. Before looking up a UTF16 character in the table, we first add 0x0500 to
  89 it; the resulting shifted UniChar is in the range 0x0000-35FF. So if the shifted
  90 UniChar is >= 0x3600, we don't bother looking in the table.
  91
  92 The table data is generated automatically by the fsckMakeDecompData tool; the
  93 sources for this tool contain an array with the raw data for characters that
  94 either have nonzero combining class or begin a sequence of characters that may
  95 need to be updated. The tool generates the index, the two range arrays, and the
  96 list of decomposition update actions.
  97
  98 ----
  99 B. Files
 100
 101 * fsckDecompDataEnums.h contains enums related to the data tables
 102
 103 * fsckMakeDecompData.c contains the raw data source; when this tool is compiled
 104 and run, it writes to standard output the contents of the binary data tables;
 105 this should be directed into a file fsckDecompData.h.
 106
 107 * fsckFixDecomps.h contains the interface for the fsckFixDecomps function (and
 108 related types)
 109
 110 * fsckFixDecomps.c contains the function code.
 111
 112 ----
 113 C. Function interface
 114
 115 The basic interface (defined in fsckFixDecomps.h) is:
 116
 117 Boolean fsckFixDecomps( ConstHFSUniStr255Param inFilename, HFSUniStr255
 118 *outFilename );
 119
 120 If inFilename needs updating and the function was able to do this without
 121 overflowing the 255-character limit, it returns 1 (true) and outFIlename
 122 contains the update file. If inFilename did not need updating, or an update
 123 would overflow the limit, the function returns 0 (false) and the contents of
 124 outFilename are undefined.
 125
 126 The function needs a couple of types from Apple interface files (not standard C
 127 ones): HFSUniStr255 and Boolean. For now these are defined in fsckFixDecomps.h
 128 if NO_APPLE_INCLUDES is 1. For building with fsck_hfs, the appropriate includes
 129 should be put into fsckFixDecomps.h.
 130
 131 For the record, hfs_format.h defines HFSUniStr255 as follows:
 132
 133 struct HFSUniStr255 {
 134         u_int16_t       length;                 /* number of unicode characters */
 135         u_int16_t       unicode[255];   /* unicode characters */
 136 };
 137 typedef struct HFSUniStr255 HFSUniStr255;
 138 typedef const HFSUniStr255 *ConstHFSUniStr255Param;
 139
 140 ----
 141 D. Function implementation
 142
 143 Characters that don't require any special handling have combining class 0 and do
 144 not begin a decomposition sequence (of 1-3 characters) that needs updating. For
 145 these characters, the function just copies them from inFilename to outFilename
 146 and sets the pointer outNameCombSeqPtr to NULL (when this pointer is not NULL,
 147 it points to the beginning of a sequence of combining marks that continues up to
 148 the current character; if the current character is combining, it may need to be
 149 reordered into that sequence). The copying operation in cheap, and postponing it
 150 until we know the filename needs modification would make the code much more
 151 complicated.
 152
 153 This copying operation may be invoked from many places in the code, some deeply
 154 nested - any time the code determines that the current character needs no
 155 special handling. For this reason it has a label (CopyBaseChar) and is located
 156 at the end of the character processing loop; various places in the code use goto
 157 statements to jump to it (this is a situation where they are justified).
 158
 159 The main function loop has 4 sections.
 160
 161 First, it quickly determines if the high 12 bits of the character indicate that
 162 it is in a range that has neither nonzero combining class nor any decomposition
 163 sequences that need updating. If so, the code jumps straight to CopyBaseChar.
 164
 165 Second, the code determines if the character is part of a sequence that needs
 166 updating. It checks if the current character has a corresponding action in the
 167 replaceData array. If so, depending on the action, it may need to check for
 168 additional matching characters in inFilename. If the sequence of 1-3 characters
 169 is successfully matched, then a replacement sequence of 1-3 characters is
 170 inserted at the corresponding position in outFilename. While this replacement
 171 sequence is being inserted, each character must be checked to see if it has
 172 nonzero combining class and needs reordering (some replacement sequences consist
 173 entirely of combining characters and may interact with combining characters in
 174 the filename before the updated sequence).
 175
 176 Third, the code handles characters whose high-order 12 bits indicated that some
 177 action was required, but were not part of sequences that needed updating (these
 178 may include characters that were examined in the second part but were part of
 179 sequences that did not completely match, so there are also goto fallthroughs to
 180 this code - labeled CheckCombClass - from the second part). These characters
 181 have to be checked for nonzero combining class; if so, they are reordered as
 182 necessary. Each time a new nonzero class character is encountered, it is added
 183 to outFIlename at the correct point in any active combining character sequence
 184 (with other characters in the sequence moved as necessary), so the sequence
 185 pointed to by outNameCombSeqPtr is always in correct order up to the current
 186 character.
 187
 188 Finally, the fourth part has the default handlers to just copy characters to
 189 outFilename.
 190