[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] Some thoughts on glyphs

From: Werner LEMBERG
Subject: Re: [Groff] Some thoughts on glyphs
Date: Mon, 26 Aug 2002 12:32:10 +0200 (CEST)

Dear friends,

in April I suggested to extend the \[...] escape to support composite

I reexamined my old letter and found some deficiencies, so here my
new proposal.  Please comment.


Extending the \[...] escape to support composite characters and glyphs

2002, Aug. 25th


A *character* is an abstract entity used for input.

A *glyph* is a real instance of a character.  For example, an upright
`A' and a slanted `A' are two different glyphs but the same character.

A single input character can consist of more than a single glyph.
Example: `U umlaut' can be composed from two glyphs (`U' and

A single output glyph can consist of more than a single input
character.  Example: `U' + `combining umlaut' can be represented by a
single `U umlaut' glyph.  Two cases must be distinguished:

  . Base glyphs with modifiers are partially handled by the input
    engine.  Unicode defines some rules how composite characters
    should behave; to normalize input some reordering has to be done.
    Since any modifier can be applied to any base glyph, a fallback
    method must be provided to render such cases even if ugly output
    is produced.  I won't discuss this here.

  . A *ligature* is a typographical enhancement not necessary to
    comprehend the input (Arabic and Indic scripts are special; since
    groff can't support them anyway I will ignore these issues here).

    It's up to the font to handle ligatures.

    [TeX uses the ligature mechanism for something different also,
    namely as an input aid to enter special characters (example: ``
    becomes the left double quotation mark).]

    I won't discuss ligatures in this document.  This needs a trivial
    extension to the font file format only (which is unfortunately
    non-trivial to implement).


In the near future, Unicode support will be added to groff.  Unicode
defines a huge amount of characters (currently more than 95000; about
20000 to come soon) which are mapped to even more glyphs -- the
perhaps most extreme example is Tamil where about 50 characters map to
more than 3000 glyphs.  We have to find a solution how to name those
glyphs in a systematic manner.

Current groff implementation

In AT&T troff, the \(xx escape directly accesses glyph `xx'.  Since
the input character set is ASCII only, there is no need for naming
additional input characters.

In groff, the \[...] escape can address both input characters and
output glyphs: Using the \[charXXX] construction it is possible to
enter 8bit characters.  For example, \[char65] is completely identical
to input character `A'.

This dualism has no negative inpact on the suggested solution.

Extension 1: Glyph naming

The idea is to use a subset of Adobe's solution to this problem.  The
algorithm for deriving glyph names can be found at

Adobe has already defined a small update to handle glyphs for Unicode
characters with code values larger than 0xFFFF; this is not published
yet on the web (but has been discussed on the OpenType list).  Most
of the stuff below is based on that update.

  . The set of groff glyph names which can't be algorithmically
    derived will be frozen (referred below as `groff glyph names').
    groff defines around 400 such glyph names; an almost complete list
    can be found in the groff_char.7 manual page.

  . A glyph for Unicode character U+XXXX[X[X]] which is not a
    composite character will be named `uXXXX[X[X]]'.  `X' must be an
    uppercase hexadecimal digit.  Examples: u1234, u008E, u12DB8.  The
    largest Unicode value is 0x10FFFF.  There must be at least four
    `X' digits; if necessary, add leading zeroes (after the `u').  No
    zero padding is allowed for character codes greater than 0xFFFF.
    Surrogates (i.e., Unicode values greater than 0xFFFF represented
    with character codes from the surrogate area U+D800-U+DFFF) are
    not allowed too.

    A glyph representing more than a single input character will be

      `u' <component1> `_' <component2> `_' <component3> ...



    For simplicity, all Unicode characters which are composites must
    be decomposed maximally (this is normalization form KD in the
    Unicode standard); for example, `u00CA_0301' is not a valid glyph
    further decomposed into U+0045 (LATIN CAPITAL LETTER E) and U+0302
    (COMBINING CIRCUMFLEX ACCENT).  `uu045_0302_0301' is thus the

  . groff will maintain a table to decompose all algorithmically
    derived glyph names which are composites itself.  For example,
    `u0100' (LATIN LETTER A WITH MACRON) will be automatically
    decomposed into `u0041_0304'.  Additionally, a groff glyph name is
    preferred to an algorithmically derived glyph name; groff will
    also automatically do the mapping.  Example: The glyph
    `u0045_0302' will be mapped to `^E'.

  . groff glyph names can't be used in composite glyph names; for


    is invalid.

Extension 2: Character naming

Well, this is not really an extension.  For backwards compatibility,
the \[charXXX] feature will be preserved.  I think it is safe to
assume that today all tools are 8bit-clean, so entering 8bit
characters is a non-issue.  After transition to UTF8, entering
character codes greater than 0xFF is a non-issue also.  So the only
question is how to enter Unicode values before the transition.
Answer: This is not possible, but it isn't necessary at all!
Normally, a user is not interested in a Unicode value but a glyph
which represents this character code, and this can be done with the
\[uXXXX] notation introduced above.  With other words, there is no
need to add a new escape for entering Unicode values.

Extension 3: New syntax form of \[...]

The new syntax I propose is

  \[<component1> <component2> ...]

groff resolves \[...] with more than a single component as follows:

  . Any component which is found in the groff glyph list will be
    converted to the `uXXXX' form.

  . Any component `uXXXX' which is found in the list of decomposable
    glyphs will be decomposed.

  . The resulting elements are then concatenated with `_' inbetween,
    dropping the leading `u' in all elements but the first.

No check for the existence of any component (similar to .tr) will be


  \[A ho]

    `A' maps to `u0041', `ho' maps to `u02DB', thus the final glyph
    name would be `u0041_02DB'.  Note this is not the expected result:
    The ogonek glyph `ho' is a spacing ogonek, but for a proper
    composite a non-spacing ogonek (U+0328) is necessary.  To avoid
    adding another bunch of (simple) glyph names for non-spacing
    accents I suggest that `ho' and friends can be mapped to
    non-spacing variants with a new request like this:

      .composite <glyph1> <glyph2>

    This maps glyph name <glyph1> to glyph name <glyph2> if it is used
    in \[...] with more than one component.  Using

      .composite ho u0328

    we finally get `u0041_0328'.  Again, this mapping is based on
    glyph names only; no check for the existence of either glyph is

  \[^E u0301]
  \[^E aa]
  \[E a^ aa]
  \[E ^ ']

    `^E' maps to `u0045_0302', thus the final glyph name is
    `u0045_0302_0301' in all forms (I've omitted the necessary calls
    to .composite).

BTW, it will not be possible to define glyphs with names like `A ho'
within a groff font file.  This is not really a limitation; instead,
you have to define `u0041_0328'.

I've completely dropped the idea that groff does something like
`\z\[ho]A' automatically if `\[A ho]' is not defined.  Here a revised
version how a latin2 input encoding could be implemented, assuming
standard PS fonts:

  .\" The rather generic .composite calls could be in a file
  .\" `glyph.tmac' which is always loaded at start-up of groff.
  .composite ho u0328
  .composite ah u030C
  .composite aa u0301

  .de latin2-tr
  .  trin \\$1\\$1
  .  if !c\\$2 \
  .    if (\n[.$] == 3) \
  .      char \\$2 \\$3
  .  if !c\\$1 \
  .    trin \\$1\\$2
  .latin2-tr \[char161] "\[A ho]" "\o'A\[ho]'"
  .latin2-tr \[char162] \[ab]
  .latin2-tr \[char163] \[/L]
  .latin2-tr \[char164] \[Cs]
  .latin2-tr \[char165] "\[L ah]" "\o'L\[ah]'"
  .latin2-tr \[char166] "\[S aa]" "\o'L\[aa]'"

reply via email to

[Prev in Thread] Current Thread [Next in Thread]