groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why do Unicode Characters in the PDF Outline show up as, for example


From: Deri
Subject: Re: Why do Unicode Characters in the PDF Outline show up as, for example, [u1FOA1]?
Date: Mon, 10 Aug 2020 13:23:59 +0100

On Sunday, 9 August 2020 05:58:15 BST T. Kurt Bond wrote:
> Anyway, in the output file (attached to this e-mail)  the unicode
> characters show up fine in the body text fine, but in the PDF Outline the
> characters show up as [uXXXX] text instead of the actual character. Does
> anybody know why this is?  I know that if I do something similar for
> Heirloom troff the PDF Outline *does* contain the Unicode characters.

In the PDF Reference text strings are defined as:-

=============================================================================

3.8.1 Text Strings

Certain strings contain information that is intended to be human-readable, such
as text annotations, bookmark names, article names, document information, and
so forth. Such strings are referred to as text strings. Text strings are 
encoded in
either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a
superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode
is described in the Unicode Standard by the Unicode Consortium (see the Bibli-
ography).

For text strings encoded in Unicode, the first two bytes must be 254 followed by
255, representing the Unicode byte order marker, U+FEFF . (This sequence con-
flicts with the PDFDocEncoding character sequence thorn ydieresis, which is un-
likely to be a meaningful beginning of a word or phrase.) The remainder of the
string consists of Unicode character codes, according to the UTF-16 encoding
specified in the Unicode standard, version 2.0. Commonly used Unicode values
are represented as 2 bytes per character, with the high-order byte appearing 
first
in the string.

==============================================================================

Since groff works internally with ascii, the \[uXXXX] form of input is 
converted to a separate node which is a named glyph in the appropriate font. In 
the groff_out format this can be seen as "Cu2640", for example, which tells the 
output driver to look for the named glyph in a particular font.

This is only true for text which is destined for the output stream, parameters 
to .pdfhref are just treated as ascii, i.e PDFDocEncoding.

Cheers 

Deri



reply via email to

[Prev in Thread] Current Thread [Next in Thread]