poke-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Questions about how to implement pickles/utf8.pk


From: Jose E. Marchesi
Subject: Re: Questions about how to implement pickles/utf8.pk
Date: Thu, 17 Sep 2020 00:58:05 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)

Hi Mohammad.

> Hi,
>
> I tried to write a pickle to poke UTF8 (`utf8.pk`). I came up with two 
> different
> types:
>
>
> ```poke
> /*
>  * len from    to       byte[0]   byte[1]   byte[2]   byte[3]
>  * 1   U+0000  U+007F   0xxxxxxx
>  * 2   U+0080  U+07FF   110xxxxx  10xxxxxx
>  * 3   U+0800  U+FFFF   1110xxxx  10xxxxxx  10xxxxxx
>  * 4   U+10000 U+10FFFF 11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
>  *
>  * ref: https://en.wikipedia.org/wiki/UTF-8
>  */
>

I would use something in this spirit:

deftype UTF8_CodePoint = uint<21>;

deftype UTF8 =
  struct
  {
    byte head;

    defvar bytes
      = (head > 0x11110000 ? 4
         : head > 0x11100000 ? 3
         : head > 0x11000000 ? 2
         : 1);
           
    byte[bytes - 1] tail;

    method get_value = UTF8_CodePoint:
      {
        defvar size = tail'size + 1;

        [... decode value from head and bytes and return it ...]
      }

    method set_value = (UTF8_CodePoint value) void:
      {
        [... set head and tail ...]
      }
      
    method _print = void:
      {
        printf ("#<%v>", get_value)
      }
  };

> ## Question 1
>
> How can I figure out the active field in a union?
>
> For these types (`UTF8_*`) I can find it by `size` attribute of the instance.
> But I think there should be a general mechanism.

There is a general mechanism: accessing a field that doesn't exist in a
struct (or union) raises E_elem.  You can try-catch it and do the right
thing.  Look at the existing pickles.  Example:

deftype Dwarf_Initial_Length =
  union
  {
    struct
    {
      uint<32> marker : (marker == 0xffff_ffff
                         && dwarf_set_bits (64));
      offset<uint<64>,B> length;
    } l64;

    offset<uint<32>,B> l32 : (l32 < 0xffff_fff0#B
                              && dwarf_set_bits (32));

    method value = offset<uint<64>,B>:
      {
        try return l32;
        catch if E_elem { return l64.length; }
      }

    method _print = void:
      {
        print ("#<");
        try printf ("%v", l64.length);
        catch if E_elem { printf ("%v", l32); }
        print (">");
      }
  };

> ## Question 2
>
> If I want to define a `decode` method for `UTF8_1` instead of the 
> `utf8_decode`
> fucntion, how can I access the `size` attribute?

You can't access attributes of the object in a method.  We would need to
introduce support for `self' or similar.  Which is doable of course.

> ## Question 3
>
> I prefer the `UTF8_2` over the `UTF8_1`, because always I have to deal with 
> only
> one field. From the user POV, it's an array with variable length (1-4).
>
> How can I access the `d` field?
> Or if you think that my question is insane, could you please explain why?

Your question is not insane :)

Right now poke doesn't perform any flattening of structs.  My plan is to
implement it, but until that happens I am afraid you will need to deal
with intermediate names for the different alternatives.

> ## Question 4
>
> I cannot write `utf8_encode` function for `UTF8_1`, because union construction
> does not work (like the problem for pinned structs [Bug 26527][2]).

I am not aware of any problem with union constructors.  How they don't
work?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]