Re: Unicode I/O for GM2

gm2

[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode I/O for GM2

From:	Alice Osako
Subject:	Re: Unicode I/O for GM2
Date:	Sun, 24 Mar 2024 12:24:22 -0400
User-agent:	Mozilla Thunderbird

Benjamin Kowarsch:

On Sun, 24 Mar 2024 at 11:42, Alice Osako wrote:

I am trying to find a suitable way to approach both file and console I/O for my UNICODE library, and I am concerned that - especially regarding console I/O - I may have to write some sort of external C module which would be accessed by FFI in some manner, which would act as a wrapper around the wchar_t functions and handle the transitions to and from the Unicode character types.

This is concerning, as I would rather find a native Modula-2 solution if possible, but in this instance I don't see a way to do so. Most of the existing I/O operations for numbers auto-convert the values into numeric text strings; the only exception to this is the ISO RawIO and SRawIO libraries.

Unfortunately, while I tried to use [S]RawIO to implement ReadUtf8Buffer and WriteUtf8Buffer, it didn't have the desired behavior with console output. I've checked in the two experimental procedures, if anyone is curious, but the salient point is that it doesn't work as intended. In any case, it would have been specific to ISO support.

I am not sure I understand (1) what you are trying to do and (2) what the problem is.

I am assuming though that you have UTF8 input that you want to read into an ARRAY OF UNICHAR in memory, and conversely an ARRAY OF UNICHAR that you want to write out as UTF8 output, where UNICHAR is in UCS-4.

Essentially, yes, though as things stand I am only handling the case of individual characters in the form of a UTF8Buffer; I was planning to expand on those base procedures once I had them working.

While the immediate concern is the test module, which is meant to show that the test characters are correctly manipulated by displaying them to the console, there is a general need for an I/O library for both file and console I/O.

I've tried to solve this problem a few different ways, first using the ISO RawIO operations Read and Write, then with the GCC Base library operations ReadNBytes and WriteNBytes. While I have not tested how they work for file I/O yet, for console I/O the displayed characters are being truncated to display only the first byte of the wide character:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0: 'a' [U+0061] is a valid codepoint; is a printable ASCII character; is in the BMP. -> 97 -> 'a'
1: 'a' [U+0061] is a valid codepoint; is a printable ASCII character; is in the BMP. -> 97 -> 'a'
2: 'a' [U+0061] is a valid codepoint; is a printable ASCII character; is in the BMP. -> 97 -> 'a'
3: ' ' [U+0120] is a valid codepoint; is not a printable ASCII character; is in the BMP. -> 32 -> ' '
4: '�' [U+00C1] is a valid codepoint; is not a printable ASCII character; is in the BMP. -> 193 -> '�'
5: '�' [U+00C1] is a valid codepoint; is not a printable ASCII character; is in the BMP. -> 193 -> '�'
6: 'A' [U+0141] is a valid codepoint; is not a printable ASCII character; is in the BMP. -> 65 -> 'A'
7: '' [U+FFFD] is a valid codepoint; is not a printable ASCII character; is in the BMP. -> 29 -> ''
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The current versions of the modules are

https://github.com/Schol-R-LEA/UNICODE-For-Modula-2/blob/main/defs/UniTextIO.def
https://github.com/Schol-R-LEA/UNICODE-For-Modula-2/blob/main/defs/SUniTextIO.def

https://github.com/Schol-R-LEA/UNICODE-For-Modula-2/blob/main/impls/UniTextIO.mod
https://github.com/Schol-R-LEA/UNICODE-For-Modula-2/blob/main/impls/SUniTextIO.mod

For that, all you need are procedures for reading and writing one or more octets, like procedures ReadOctet, ReadOctets, WriteOctet and WriteOctets in my BasicFileIO library.

https://github.com/m2sf/m2bsk/blob/master/src/lib/IO/BasicFileIO.def

There are several implementations of this library, all conforming to the same interface in

https://github.com/m2sf/m2pp/tree/master/src/imp/BasicFileIO

Note, the IO library evolved while working on M2PP and some parts of it haven't been backported to M2BSK yet.

Ah, I was only aware of the M2BSK code; I will take a look at the M2PP code and see if I can get that to work for my purposes.

I suggest you think about my proposal to move the library out into a separate repo, which would then be a good opportunity to consolidate it. At that point, there would be two possibilities going forward: (1) you simply link to the I/O library from your Unicode project, or (2) we could merge the Unicode library into the I/O library and add another layer on top for reading and writing Unicode text.

I think I will do that, yes. Strictly speaking, I should be testing the character manipulation separately from the I/O, anyway, but at the time it seemed expedient just to print the results out as shown above.

[Prev in Thread]

Current Thread

[Next in Thread]

Unicode I/O for GM2, Alice Osako, 2024/03/23
- Re: Unicode I/O for GM2, Benjamin Kowarsch, 2024/03/24
  - Re: Unicode I/O for GM2, Alice Osako <=
    - Re: Unicode I/O for GM2, Benjamin Kowarsch, 2024/03/25
    - Re: Unicode I/O for GM2, Alice Osako, 2024/03/25
    - Re: Unicode I/O for GM2, Gaius Mulley, 2024/03/25
    - Re: Unicode I/O for GM2, Benjamin Kowarsch, 2024/03/25
    - Re: Unicode I/O for GM2, Gaius Mulley, 2024/03/26
    - Re: Unicode I/O for GM2, Benjamin Kowarsch, 2024/03/26
    - Re: Unicode I/O for GM2, Gaius Mulley, 2024/03/26
    - Re: Unicode I/O for GM2, Benjamin Kowarsch, 2024/03/26
    - Re: Unicode I/O for GM2, Gaius Mulley, 2024/03/26
    - Re: Unicode I/O for GM2, Alice Osako, 2024/03/26

Prev by Date: Re: Portable bitwise operations library (was Re: Portability Considerations)
Next by Date: Re: Hoisting other libraries from M2BSK?
Previous by thread: Re: Unicode I/O for GM2
Next by thread: Re: Unicode I/O for GM2
Index(es):
- Date
- Thread