[bug #62830] [PATCH] [grops] support CJK fonts encoded in UTF16

bug-groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #62830] [PATCH] [grops] support CJK fonts encoded in UTF16

From:	TANAKA Takuji
Subject:	[bug #62830] [PATCH] [grops] support CJK fonts encoded in UTF16
Date:	Sat, 15 Apr 2023 03:18:21 -0400 (EDT)

Follow-up Comment #7, bug #62830 (project groff):

I updated my patch.

1. Font description files

There is a precedent of font description file for Japanese support of groff by
Japanese developers (Fumitoshi UKAI et al.)

https://answers.launchpad.net/ubuntu/+source/groff/1.18.1.1-12

They defined font description named "M", "G" for Japanese.
M : Japanese Mincho style
G : Japanese Gothic style

"M", "G" are possible candidates.
But I wonder if Chinese/Korean people might feel uncomfortable.
It is the reason that I proposed font description "JPM", "JPG" and CK fonts.


2. src/devices/grohtml/post-html.cpp:

2a & 2e. Encoding US-ASCII or UTF-8, -U option

I tried three step of option setting:
 -U0 : US-ASCII : use named character references or numerical character
references
 -U1 : UTF-8 (partial) : use named character references for known characters,
UTF-8 literals for unknown characters (default)
 -U2 : UTF-8 (full) : use UTF-8 literals


2b. `to_utf8_string`.

I have moved it to libgroff/font.cpp for trial.

2c. switching text styling properties.

I have removed the function from my patch.

2d. `to_unicode`.

I have renamed it to_unicode() to to_numerical_char_ref().


3. src/devices/grops/ps.cpp

3a. I have renamed is_utf16 to  is_utf16be

3b. I have replaced wchar_t by uint16_t.

3c. postscript name and encoding.

For CJK fonts, encoding is always explicitly shown in PostScript font name
by the structure of (Specific font name)-(style)(-(character
set))-(encoding)-(direction).
For example:

/Ryumin-Light-Identity-H
/Ryumin-Light-UniJIS-UTF16-H
/Ryumin-Light-UniJIS-UTF8-H
/Ryumin-Light-EUC-H
/Ryumin-Light-RKSJ-H
/GothicBBB-Medium-Identity-H
/GothicBBB-Medium-UniJIS-UTF16-H
/GothicBBB-Medium-UniJIS-UTF8-H
/GothicBBB-Medium-EUC-H
/GothicBBB-Medium-RKSJ-H

This is a sample PostScript file:
https://github.com/t-tk/PostScript-CJK-samples/blob/master/box-multi.eps

Therefore, I think it is reasonable to get encoding information from
PostScript font names.
I guess most of PostScript interpreters do so.


4. src/include/font.h, src/libs/libgroff/font.cpp

I removed "ENABLE_UCSRANGE" macro from my patch.


5. smoke tests.

I replaced UTF-8 literal by octal code expression.


(file #54631)

    _______________________________________________________

Additional Item Attachment:

File name: cjk-ps-html_20230415.patch     Size:86 KB
   
<https://file.savannah.gnu.org/file/cjk-ps-html_20230415.patch?file_id=54631>



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?62830>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[bug #62830] [PATCH] [grops] support CJK fonts encoded in UTF16, TANAKA Takuji <=

Prev by Date: [bug #62695] [troff] requests `tag` and `taga` are undocumented
Next by Date: [bug #63587] [troff] set .R register to maximum representable integer
Previous by thread: [bug #62695] [troff] requests `tag` and `taga` are undocumented
Next by thread: [bug #46914] .ce sometimes ignores right margin
Index(es):
- Date
- Thread