bug-groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #65601] [troff] bogus 'bogus composite' errors introduced by commit


From: G. Branden Robinson
Subject: [bug #65601] [troff] bogus 'bogus composite' errors introduced by commit 6008b6b7aa
Date: Wed, 1 May 2024 19:22:47 -0400 (EDT)

Update of bug #65601 (group groff):

                  Status:                    None => Need Info              
                 Summary: Bogus 'bogus composite' errors introduced by commit
6008b6b7aa => [troff] bogus 'bogus composite' errors introduced by commit
6008b6b7aa

    _______________________________________________________

Follow-up Comment #2:

[comment #0 original submission:]
> If preconv produces a valid composite character groff should not reject it.
Of course if the composite is not available in any available font

Unfortunately that's not the way GNU _troff_ works.  (Or I'm not understanding
the bug report.)

The list of composite characters is global.

Here's what our Texinfo manual says in Git HEAD.


 -- Escape sequence: \[base-glyph combining-component ...]
...
     GNU 'troff' resolves '\[...]' with more than a single component as
     follows:

        * Any component that is found in the GGL [groff glyph list --GBR]
          is converted to the 'uXXXX' form.

        * Any component 'uXXXX' that is found in the list of
          decomposable glyphs is decomposed.

        * The resulting elements are then concatenated with '_' in
          between, dropping the leading 'u' in all elements but the
          first.

     No check for the existence of any component (similar to 'tr'
     request) is done.

     Examples:

     '\[A ho]'
          'A' maps to 'u0041', 'ho' maps to 'u02DB', thus the final
          glyph name would be 'u0041_02DB'.  This is not the expected
          result: the ogonek glyph 'ho' is a spacing ogonek, but for a
          proper composite a non-spacing ogonek (U+0328) is necessary.
          Looking into the file 'composite.tmac', one can find
          '.composite ho u0328', which changes the mapping of 'ho' while
          a composite glyph name is constructed, causing the final glyph
          name to be 'u0041_0328'.

     '\[^E u0301]'
     '\[^E aa]'
     '\[E a^ aa]'
     '\[E ^ ']'
          '^E' maps to 'u0045_0302', thus the final glyph name is
          'u0045_0302_0301' in all forms (assuming proper calls of the
          'composite' request).

     It is not possible to define glyphs with names like 'A ho' within a
     'groff' font file.  This is not really a limitation; instead, you
     have to define 'u0041_0328'.
...
 -- Request: .composite c1 c2
     Map ordinary or special character name C1 to C2 when C1 is a
     combining component in a composite character.  See above for
     examples.  This is a strict rewriting of the special character
     name; no check is performed for the existence of a glyph for
     either.  Typically, 'composite' is used to map a spacing character
     to a combining one.  A set of default mappings for many accents can
     be found in the file 'composite.tmac', loaded by the default
     'troffrc' at startup.

     You can obtain a report of mappings defined by 'composite' on the
     standard error stream with the 'pcomposite' request.  *Note
     Debugging::.



> Personally I see little value in this error,

I do find value in it; in the ChangeLog entry, I provided my rationale.  In
the commit message I even provided exhibits of cases that should have produced
a diagnostic but did not.


    [troff]: Diagnose bogus composite character escape sequences.  That is,
    when a composite character escape sequence like \[a ~] has a bogus
    modifier (as opposed to base) character, meaning one that has not been
    defined as the source _or_ destination of a `composite` request, warn
    about it.  For instance, \[a $] is nonsense, barring a request like
    `.composite $ \[uFF00]`, which would map `$`, when used as a modifier
    character in a composite special character escape sequence, to U+FF00,
    which would be a modifier form of the dollar sign in an alternate
    universe.
...
    Input:
    .nf
    \[A a~]
    \[A ~]
    \[u0041_0301]
    \[u0041_007E] \" should fail because 007E is explicitly spacing
    \[u0041_0041] \" same reason, more obviously
    \[u0041_0301_0301] \" should fail, would have a different meaning
    \[u0041_007E_0301] \" both problems above
    
    groff 1.23.0 and earlier:
    $ groff -T ps -z EXPERIMENTS/composite_character_construction.groff
    troff:...:5: warning: special character 'u0041_007E' not defined
    troff:...:6: warning: special character 'u0041_0041' not defined
    troff:...:7: warning: special character 'u0041_0301_0301' not defined
    troff:...:8: warning: special character 'u0041_007E_0301' not defined
    $ groff -Tutf8 -z EXPERIMENTS/composite_character_construction.groff
    [no output due to Savannah #65109]
    
    Now:
    $ ./build/test-groff -T ps -z
EXPERIMENTS/composite_character_construction.groff
    troff:...:5: warning: special character 'u0041_007E' not defined
    troff:...:6: error: cannot format glyph: 'u0041_0041' is not a valid
composite character
    troff:...:7: warning: special character 'u0041_0301_0301' not defined
    troff:...:8: warning: special character 'u0041_007E_0301' not defined
    $ ./build/test-groff -T utf8 -z
EXPERIMENTS/composite_character_construction.groff
    troff:...:6: error: cannot format glyph: 'u0041_0041' is not a valid
composite character


> the existing error reporting of a special character not defined is more
helpful since if you find a font which contains the correct glyph, the error
will be gone.

Is this true in full generality?  Does it also apply to output devices that
don't even have a "charset" section in their fonts because they're "unicode"
[sic] devices?

groff_font(5):


     unicode
             The output device supports the complete Unicode repertoire.
             This directive is useful only for devices which produce
             character entities instead of glyphs.

             If unicode is present, no charset section is required in
             the font description files since the Unicode handling built
             into groff is used.  However, if there are entries in a
             font description file’s charset section, they either
             override the default mappings for those particular
             characters or add new mappings (normally for composite
             characters).

             The utf8, html, and xhtml output devices use this
             directive.


(I feel that that's a badly named directive.  As I understand it, it, it more
precisely means that a different glyph resolution mechanism is used--or none
at all, instead assuming that the device is happy to attempt to combine any
sequence of Unicode code points as a grapheme cluster.)

> I'm sure there are users capable of creating a font with all sorts of weird
composite glyphs, why should we police what they can do?

Because we have no mechanism for defining font-specific composite character
*components*.  (Meaning: "foo" in `\[a foo]`; contrast with the composed
composite characters contemplated by the second paragraph of the "unicode"
directive description quoted above.)  Maybe we should, but that in turn would
mean having font-specific macro files that users' documents would need to
load.

And we'd probably need a tool to generate them.

Might be better/more scalable to ask authors of such documents issue the
`composite` requests itself.  We can add commonly used ones that we are
presently missing to "composite.tmac".

My anticipation of this problem is why I added a (rather, stopped discouraging
use of an existing) mechanism to delete composite character mappings and a new
request for reporting the ones the formatter knows about.

Or people can bypass this escape sequence syntax entirely and spell their
grapheme clusters in Unicode directly as is already supported.

Our Texinfo manual again:


   * A glyph representing more than a single input character is named

          'u' COMPONENT1 '_' COMPONENT2 '_' COMPONENT3 ...

     Example: 'u0045_0302_0301'.


There may be an opportunity for some terminological revision here.  This
section of the manual is one of those I haven't finished my first revision
pass on yet.  I still have things to learn.  Maybe you can shed some light
where things are dark for me.


commit 2c76a931b81b1e22dd419c7027d3517325c23193
Author: G. Branden Robinson <g.branden.robinson@gmail.com>
Date:   Wed Jan 17 14:02:28 2024 -0600

    [troff]: Fix Savannah #64937 (del composite char).
    
    * src/roff/troff/input.cpp (map_composite_character): Stop throwing
      diagnostic message when `composite` request invoked with only one
      argument.  This has long worked just fine to delete a composite
      character mapping.  That is something a (rare) user might conceivably
      want to do.
    
    Fixes <https://savannah.gnu.org/bugs/?64937>.

commit e958bb4fc65326dd9cd0d775e96aff15e944795e
Author: G. Branden Robinson <g.branden.robinson@gmail.com>
Date:   Wed Jan 17 13:49:40 2024 -0600

    [troff]: Implement new `pcomposite` request.
    
    * src/roff/troff/input.cpp (report_composite_characters): Add.
      (init_input_requests): Wire up `pcomposite` request name to
      `report_composite_characters()`.
    
    * doc/groff.texi (Colors, Debugging):
    * man/groff.7.man (Request short reference, Debugging):
    * man/groff_diff.7.man (New requests, Debugging):
    * NEWS: Document it.




    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?65601>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]