[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] Some thoughts on glyphs
Re: [Groff] Some thoughts on glyphs
Mon, 26 Aug 2002 12:32:10 +0200 (CEST)
in April I suggested to extend the \[...] escape to support composite
I reexamined my old letter and found some deficiencies, so here my
new proposal. Please comment.
Extending the \[...] escape to support composite characters and glyphs
2002, Aug. 25th
A *character* is an abstract entity used for input.
A *glyph* is a real instance of a character. For example, an upright
`A' and a slanted `A' are two different glyphs but the same character.
A single input character can consist of more than a single glyph.
Example: `U umlaut' can be composed from two glyphs (`U' and
A single output glyph can consist of more than a single input
character. Example: `U' + `combining umlaut' can be represented by a
single `U umlaut' glyph. Two cases must be distinguished:
. Base glyphs with modifiers are partially handled by the input
engine. Unicode defines some rules how composite characters
should behave; to normalize input some reordering has to be done.
Since any modifier can be applied to any base glyph, a fallback
method must be provided to render such cases even if ugly output
is produced. I won't discuss this here.
. A *ligature* is a typographical enhancement not necessary to
comprehend the input (Arabic and Indic scripts are special; since
groff can't support them anyway I will ignore these issues here).
It's up to the font to handle ligatures.
[TeX uses the ligature mechanism for something different also,
namely as an input aid to enter special characters (example: ``
becomes the left double quotation mark).]
I won't discuss ligatures in this document. This needs a trivial
extension to the font file format only (which is unfortunately
non-trivial to implement).
In the near future, Unicode support will be added to groff. Unicode
defines a huge amount of characters (currently more than 95000; about
20000 to come soon) which are mapped to even more glyphs -- the
perhaps most extreme example is Tamil where about 50 characters map to
more than 3000 glyphs. We have to find a solution how to name those
glyphs in a systematic manner.
Current groff implementation
In AT&T troff, the \(xx escape directly accesses glyph `xx'. Since
the input character set is ASCII only, there is no need for naming
additional input characters.
In groff, the \[...] escape can address both input characters and
output glyphs: Using the \[charXXX] construction it is possible to
enter 8bit characters. For example, \[char65] is completely identical
to input character `A'.
This dualism has no negative inpact on the suggested solution.
Extension 1: Glyph naming
The idea is to use a subset of Adobe's solution to this problem. The
algorithm for deriving glyph names can be found at
Adobe has already defined a small update to handle glyphs for Unicode
characters with code values larger than 0xFFFF; this is not published
yet on the web (but has been discussed on the OpenType list). Most
of the stuff below is based on that update.
. The set of groff glyph names which can't be algorithmically
derived will be frozen (referred below as `groff glyph names').
groff defines around 400 such glyph names; an almost complete list
can be found in the groff_char.7 manual page.
. A glyph for Unicode character U+XXXX[X[X]] which is not a
composite character will be named `uXXXX[X[X]]'. `X' must be an
uppercase hexadecimal digit. Examples: u1234, u008E, u12DB8. The
largest Unicode value is 0x10FFFF. There must be at least four
`X' digits; if necessary, add leading zeroes (after the `u'). No
zero padding is allowed for character codes greater than 0xFFFF.
Surrogates (i.e., Unicode values greater than 0xFFFF represented
with character codes from the surrogate area U+D800-U+DFFF) are
not allowed too.
A glyph representing more than a single input character will be
`u' <component1> `_' <component2> `_' <component3> ...
For simplicity, all Unicode characters which are composites must
be decomposed maximally (this is normalization form KD in the
Unicode standard); for example, `u00CA_0301' is not a valid glyph
name since U+00CA (LATIN CAPITAL LETTER E WITH CIRCUMFLEX) can be
further decomposed into U+0045 (LATIN CAPITAL LETTER E) and U+0302
(COMBINING CIRCUMFLEX ACCENT). `uu045_0302_0301' is thus the
glyph name for U+1EBE, LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND
. groff will maintain a table to decompose all algorithmically
derived glyph names which are composites itself. For example,
`u0100' (LATIN LETTER A WITH MACRON) will be automatically
decomposed into `u0041_0304'. Additionally, a groff glyph name is
preferred to an algorithmically derived glyph name; groff will
also automatically do the mapping. Example: The glyph
`u0045_0302' will be mapped to `^E'.
. groff glyph names can't be used in composite glyph names; for
Extension 2: Character naming
Well, this is not really an extension. For backwards compatibility,
the \[charXXX] feature will be preserved. I think it is safe to
assume that today all tools are 8bit-clean, so entering 8bit
characters is a non-issue. After transition to UTF8, entering
character codes greater than 0xFF is a non-issue also. So the only
question is how to enter Unicode values before the transition.
Answer: This is not possible, but it isn't necessary at all!
Normally, a user is not interested in a Unicode value but a glyph
which represents this character code, and this can be done with the
\[uXXXX] notation introduced above. With other words, there is no
need to add a new escape for entering Unicode values.
Extension 3: New syntax form of \[...]
The new syntax I propose is
\[<component1> <component2> ...]
groff resolves \[...] with more than a single component as follows:
. Any component which is found in the groff glyph list will be
converted to the `uXXXX' form.
. Any component `uXXXX' which is found in the list of decomposable
glyphs will be decomposed.
. The resulting elements are then concatenated with `_' inbetween,
dropping the leading `u' in all elements but the first.
No check for the existence of any component (similar to .tr) will be
`A' maps to `u0041', `ho' maps to `u02DB', thus the final glyph
name would be `u0041_02DB'. Note this is not the expected result:
The ogonek glyph `ho' is a spacing ogonek, but for a proper
composite a non-spacing ogonek (U+0328) is necessary. To avoid
adding another bunch of (simple) glyph names for non-spacing
accents I suggest that `ho' and friends can be mapped to
non-spacing variants with a new request like this:
.composite <glyph1> <glyph2>
This maps glyph name <glyph1> to glyph name <glyph2> if it is used
in \[...] with more than one component. Using
.composite ho u0328
we finally get `u0041_0328'. Again, this mapping is based on
glyph names only; no check for the existence of either glyph is
\[E a^ aa]
\[E ^ ']
`^E' maps to `u0045_0302', thus the final glyph name is
`u0045_0302_0301' in all forms (I've omitted the necessary calls
BTW, it will not be possible to define glyphs with names like `A ho'
within a groff font file. This is not really a limitation; instead,
you have to define `u0041_0328'.
I've completely dropped the idea that groff does something like
`\z\[ho]A' automatically if `\[A ho]' is not defined. Here a revised
version how a latin2 input encoding could be implemented, assuming
standard PS fonts:
.\" The rather generic .composite calls could be in a file
.\" `glyph.tmac' which is always loaded at start-up of groff.
.composite ho u0328
.composite ah u030C
.composite aa u0301
. trin \\$1\\$1
. if !c\\$2 \
. if (\n[.$] == 3) \
. char \\$2 \\$3
. if !c\\$1 \
. trin \\$1\\$2
.latin2-tr \[char161] "\[A ho]" "\o'A\[ho]'"
.latin2-tr \[char162] \[ab]
.latin2-tr \[char163] \[/L]
.latin2-tr \[char164] \[Cs]
.latin2-tr \[char165] "\[L ah]" "\o'L\[ah]'"
.latin2-tr \[char166] "\[S aa]" "\o'L\[aa]'"
- Re: [Groff] Some thoughts on glyphs,
Werner LEMBERG <=