]>
Commit | Line | Data |
---|---|---|
1 | <HTML> | |
2 | <HEAD> | |
3 | <!-- This HTML file has been created by texi2html 1.54 | |
4 | from gettext.texi on 25 January 1999 --> | |
5 | ||
6 | <TITLE>GNU gettext utilities - Producing Binary MO Files</TITLE> | |
7 | <link href="gettext_7.html" rel=Next> | |
8 | <link href="gettext_5.html" rel=Previous> | |
9 | <link href="gettext_toc.html" rel=ToC> | |
10 | ||
11 | </HEAD> | |
12 | <BODY> | |
13 | <p>Go to the <A HREF="gettext_1.html">first</A>, <A HREF="gettext_5.html">previous</A>, <A HREF="gettext_7.html">next</A>, <A HREF="gettext_12.html">last</A> section, <A HREF="gettext_toc.html">table of contents</A>. | |
14 | <P><HR><P> | |
15 | ||
16 | ||
17 | <H1><A NAME="SEC32" HREF="gettext_toc.html#TOC32">Producing Binary MO Files</A></H1> | |
18 | ||
19 | ||
20 | ||
21 | <H2><A NAME="SEC33" HREF="gettext_toc.html#TOC33">Invoking the <CODE>msgfmt</CODE> Program</A></H2> | |
22 | ||
23 | ||
24 | <PRE> | |
25 | Usage: msgfmt [<VAR>option</VAR>] <VAR>filename</VAR>.po ... | |
26 | </PRE> | |
27 | ||
28 | <DL COMPACT> | |
29 | ||
30 | <DT><SAMP>`-a <VAR>number</VAR>'</SAMP> | |
31 | <DD> | |
32 | <DT><SAMP>`--alignment=<VAR>number</VAR>'</SAMP> | |
33 | <DD> | |
34 | Align strings to <VAR>number</VAR> bytes (default: 1). | |
35 | ||
36 | <DT><SAMP>`-h'</SAMP> | |
37 | <DD> | |
38 | <DT><SAMP>`--help'</SAMP> | |
39 | <DD> | |
40 | Display this help and exit. | |
41 | ||
42 | <DT><SAMP>`--no-hash'</SAMP> | |
43 | <DD> | |
44 | Binary file will not include the hash table. | |
45 | ||
46 | <DT><SAMP>`-o <VAR>file</VAR>'</SAMP> | |
47 | <DD> | |
48 | <DT><SAMP>`--output-file=<VAR>file</VAR>'</SAMP> | |
49 | <DD> | |
50 | Specify output file name as <VAR>file</VAR>. | |
51 | ||
52 | <DT><SAMP>`--strict'</SAMP> | |
53 | <DD> | |
54 | Direct the program to work strictly following the Uniforum/Sun | |
55 | implementation. Currently this only affects the naming of the output | |
56 | file. If this option is not given the name of the output file is the | |
57 | same as the domain name. If the strict Uniforum mode is enable the | |
58 | suffix <TT>`.mo'</TT> is added to the file name if it is not already | |
59 | present. | |
60 | ||
61 | We find this behaviour of Sun's implementation rather silly and so by | |
62 | default this mode is <EM>not</EM> selected. | |
63 | ||
64 | <DT><SAMP>`-v'</SAMP> | |
65 | <DD> | |
66 | <DT><SAMP>`--verbose'</SAMP> | |
67 | <DD> | |
68 | Detect and diagnose input file anomalies which might represent | |
69 | translation errors. The <CODE>msgid</CODE> and <CODE>msgstr</CODE> strings are | |
70 | studied and compared. It is considered abnormal that one string | |
71 | starts or ends with a newline while the other does not. | |
72 | ||
73 | Also, if the string represents a format string used in a | |
74 | <CODE>printf</CODE>-like function both strings should have the same number of | |
75 | <SAMP>`%'</SAMP> format specifiers, with matching types. If the flag | |
76 | <CODE>c-format</CODE> or <CODE>possible-c-format</CODE> appears in the special | |
77 | comment <KBD>#,</KBD> for this entry a check is performed. For example, the | |
78 | check will diagnose using <SAMP>`%.*s'</SAMP> against <SAMP>`%s'</SAMP>, or <SAMP>`%d'</SAMP> | |
79 | against <SAMP>`%s'</SAMP>, or <SAMP>`%d'</SAMP> against <SAMP>`%x'</SAMP>. It can even handle | |
80 | positional parameters. | |
81 | ||
82 | Normally the <CODE>xgettext</CODE> program automatically decides whether a | |
83 | string is a format string or not. This algorithm is not perfect, | |
84 | though. It might regard a string as a format string though it is not | |
85 | used in a <CODE>printf</CODE>-like function and so <CODE>msgfmt</CODE> might report | |
86 | errors where there are none. Or the other way round: a string is not | |
87 | regarded as a format string but it is used in a <CODE>printf</CODE>-like | |
88 | function. | |
89 | ||
90 | So solve this problem the programmer can dictate the decision to the | |
91 | <CODE>xgettext</CODE> program (see section <A HREF="gettext_3.html#SEC17">Special Comments preceding Keywords</A>). The translator should not | |
92 | consider removing the flag from the <KBD>#,</KBD> line. This "fix" would be | |
93 | reversed again as soon as <CODE>msgmerge</CODE> is called the next time. | |
94 | ||
95 | <DT><SAMP>`-V'</SAMP> | |
96 | <DD> | |
97 | <DT><SAMP>`--version'</SAMP> | |
98 | <DD> | |
99 | Output version information and exit. | |
100 | ||
101 | </DL> | |
102 | ||
103 | <P> | |
104 | If input file is <SAMP>`-'</SAMP>, standard input is read. If output file | |
105 | is <SAMP>`-'</SAMP>, output is written to standard output. | |
106 | ||
107 | </P> | |
108 | ||
109 | ||
110 | <H2><A NAME="SEC34" HREF="gettext_toc.html#TOC34">The Format of GNU MO Files</A></H2> | |
111 | ||
112 | <P> | |
113 | The format of the generated MO files is best described by a picture, | |
114 | which appears below. | |
115 | ||
116 | </P> | |
117 | <P> | |
118 | The first two words serve the identification of the file. The magic | |
119 | number will always signal GNU MO files. The number is stored in the | |
120 | byte order of the generating machine, so the magic number really is | |
121 | two numbers: <CODE>0x950412de</CODE> and <CODE>0xde120495</CODE>. The second | |
122 | word describes the current revision of the file format. For now the | |
123 | revision is 0. This might change in future versions, and ensures | |
124 | that the readers of MO files can distinguish new formats from old | |
125 | ones, so that both can be handled correctly. The version is kept | |
126 | separate from the magic number, instead of using different magic | |
127 | numbers for different formats, mainly because <TT>`/etc/magic'</TT> is | |
128 | not updated often. It might be better to have magic separated from | |
129 | internal format version identification. | |
130 | ||
131 | </P> | |
132 | <P> | |
133 | Follow a number of pointers to later tables in the file, allowing | |
134 | for the extension of the prefix part of MO files without having to | |
135 | recompile programs reading them. This might become useful for later | |
136 | inserting a few flag bits, indication about the charset used, new | |
137 | tables, or other things. | |
138 | ||
139 | </P> | |
140 | <P> | |
141 | Then, at offset <VAR>O</VAR> and offset <VAR>T</VAR> in the picture, two tables | |
142 | of string descriptors can be found. In both tables, each string | |
143 | descriptor uses two 32 bits integers, one for the string length, | |
144 | another for the offset of the string in the MO file, counting in bytes | |
145 | from the start of the file. The first table contains descriptors | |
146 | for the original strings, and is sorted so the original strings | |
147 | are in increasing lexicographical order. The second table contains | |
148 | descriptors for the translated strings, and is parallel to the first | |
149 | table: to find the corresponding translation one has to access the | |
150 | array slot in the second array with the same index. | |
151 | ||
152 | </P> | |
153 | <P> | |
154 | Having the original strings sorted enables the use of simple binary | |
155 | search, for when the MO file does not contain an hashing table, or | |
156 | for when it is not practical to use the hashing table provided in | |
157 | the MO file. This also has another advantage, as the empty string | |
158 | in a PO file GNU <CODE>gettext</CODE> is usually <EM>translated</EM> into | |
159 | some system information attached to that particular MO file, and the | |
160 | empty string necessarily becomes the first in both the original and | |
161 | translated tables, making the system information very easy to find. | |
162 | ||
163 | </P> | |
164 | <P> | |
165 | The size <VAR>S</VAR> of the hash table can be zero. In this case, the | |
166 | hash table itself is not contained in the MO file. Some people might | |
167 | prefer this because a precomputed hashing table takes disk space, and | |
168 | does not win <EM>that</EM> much speed. The hash table contains indices | |
169 | to the sorted array of strings in the MO file. Conflict resolution is | |
170 | done by double hashing. The precise hashing algorithm used is fairly | |
171 | dependent of GNU <CODE>gettext</CODE> code, and is not documented here. | |
172 | ||
173 | </P> | |
174 | <P> | |
175 | As for the strings themselves, they follow the hash file, and each | |
176 | is terminated with a <KBD>NUL</KBD>, and this <KBD>NUL</KBD> is not counted in | |
177 | the length which appears in the string descriptor. The <CODE>msgfmt</CODE> | |
178 | program has an option selecting the alignment for MO file strings. | |
179 | With this option, each string is separately aligned so it starts at | |
180 | an offset which is a multiple of the alignment value. On some RISC | |
181 | machines, a correct alignment will speed things up. | |
182 | ||
183 | </P> | |
184 | <P> | |
185 | Nothing prevents a MO file from having embedded <KBD>NUL</KBD>s in strings. | |
186 | However, the program interface currently used already presumes | |
187 | that strings are <KBD>NUL</KBD> terminated, so embedded <KBD>NUL</KBD>s are | |
188 | somewhat useless. But MO file format is general enough so other | |
189 | interfaces would be later possible, if for example, we ever want to | |
190 | implement wide characters right in MO files, where <KBD>NUL</KBD> bytes may | |
191 | accidently appear. | |
192 | ||
193 | </P> | |
194 | <P> | |
195 | This particular issue has been strongly debated in the GNU | |
196 | <CODE>gettext</CODE> development forum, and it is expectable that MO file | |
197 | format will evolve or change over time. It is even possible that many | |
198 | formats may later be supported concurrently. But surely, we have to | |
199 | start somewhere, and the MO file format described here is a good start. | |
200 | Nothing is cast in concrete, and the format may later evolve fairly | |
201 | easily, so we should feel comfortable with the current approach. | |
202 | ||
203 | </P> | |
204 | ||
205 | <PRE> | |
206 | byte | |
207 | +------------------------------------------+ | |
208 | 0 | magic number = 0x950412de | | |
209 | | | | |
210 | 4 | file format revision = 0 | | |
211 | | | | |
212 | 8 | number of strings | == N | |
213 | | | | |
214 | 12 | offset of table with original strings | == O | |
215 | | | | |
216 | 16 | offset of table with translation strings | == T | |
217 | | | | |
218 | 20 | size of hashing table | == S | |
219 | | | | |
220 | 24 | offset of hashing table | == H | |
221 | | | | |
222 | . . | |
223 | . (possibly more entries later) . | |
224 | . . | |
225 | | | | |
226 | O | length & offset 0th string ----------------. | |
227 | O + 8 | length & offset 1st string ------------------. | |
228 | ... ... | | | |
229 | O + ((N-1)*8)| length & offset (N-1)th string | | | | |
230 | | | | | | |
231 | T | length & offset 0th translation ---------------. | |
232 | T + 8 | length & offset 1st translation -----------------. | |
233 | ... ... | | | | | |
234 | T + ((N-1)*8)| length & offset (N-1)th translation | | | | | | |
235 | | | | | | | | |
236 | H | start hash table | | | | | | |
237 | ... ... | | | | | |
238 | H + S * 4 | end hash table | | | | | | |
239 | | | | | | | | |
240 | | NUL terminated 0th string <----------------' | | | | |
241 | | | | | | | |
242 | | NUL terminated 1st string <------------------' | | | |
243 | | | | | | |
244 | ... ... | | | |
245 | | | | | | |
246 | | NUL terminated 0th translation <---------------' | | |
247 | | | | | |
248 | | NUL terminated 1st translation <-----------------' | |
249 | | | | |
250 | ... ... | |
251 | | | | |
252 | +------------------------------------------+ | |
253 | </PRE> | |
254 | ||
255 | <P><HR><P> | |
256 | <p>Go to the <A HREF="gettext_1.html">first</A>, <A HREF="gettext_5.html">previous</A>, <A HREF="gettext_7.html">next</A>, <A HREF="gettext_12.html">last</A> section, <A HREF="gettext_toc.html">table of contents</A>. | |
257 | </BODY> | |
258 | </HTML> |