[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Questions about how to implement pickles/utf8.pk
From: |
Jose E. Marchesi |
Subject: |
Re: Questions about how to implement pickles/utf8.pk |
Date: |
Thu, 17 Sep 2020 00:58:05 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) |
Hi Mohammad.
> Hi,
>
> I tried to write a pickle to poke UTF8 (`utf8.pk`). I came up with two
> different
> types:
>
>
> ```poke
> /*
> * len from to byte[0] byte[1] byte[2] byte[3]
> * 1 U+0000 U+007F 0xxxxxxx
> * 2 U+0080 U+07FF 110xxxxx 10xxxxxx
> * 3 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
> * 4 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> *
> * ref: https://en.wikipedia.org/wiki/UTF-8
> */
>
I would use something in this spirit:
deftype UTF8_CodePoint = uint<21>;
deftype UTF8 =
struct
{
byte head;
defvar bytes
= (head > 0x11110000 ? 4
: head > 0x11100000 ? 3
: head > 0x11000000 ? 2
: 1);
byte[bytes - 1] tail;
method get_value = UTF8_CodePoint:
{
defvar size = tail'size + 1;
[... decode value from head and bytes and return it ...]
}
method set_value = (UTF8_CodePoint value) void:
{
[... set head and tail ...]
}
method _print = void:
{
printf ("#<%v>", get_value)
}
};
> ## Question 1
>
> How can I figure out the active field in a union?
>
> For these types (`UTF8_*`) I can find it by `size` attribute of the instance.
> But I think there should be a general mechanism.
There is a general mechanism: accessing a field that doesn't exist in a
struct (or union) raises E_elem. You can try-catch it and do the right
thing. Look at the existing pickles. Example:
deftype Dwarf_Initial_Length =
union
{
struct
{
uint<32> marker : (marker == 0xffff_ffff
&& dwarf_set_bits (64));
offset<uint<64>,B> length;
} l64;
offset<uint<32>,B> l32 : (l32 < 0xffff_fff0#B
&& dwarf_set_bits (32));
method value = offset<uint<64>,B>:
{
try return l32;
catch if E_elem { return l64.length; }
}
method _print = void:
{
print ("#<");
try printf ("%v", l64.length);
catch if E_elem { printf ("%v", l32); }
print (">");
}
};
> ## Question 2
>
> If I want to define a `decode` method for `UTF8_1` instead of the
> `utf8_decode`
> fucntion, how can I access the `size` attribute?
You can't access attributes of the object in a method. We would need to
introduce support for `self' or similar. Which is doable of course.
> ## Question 3
>
> I prefer the `UTF8_2` over the `UTF8_1`, because always I have to deal with
> only
> one field. From the user POV, it's an array with variable length (1-4).
>
> How can I access the `d` field?
> Or if you think that my question is insane, could you please explain why?
Your question is not insane :)
Right now poke doesn't perform any flattening of structs. My plan is to
implement it, but until that happens I am afraid you will need to deal
with intermediate names for the different alternatives.
> ## Question 4
>
> I cannot write `utf8_encode` function for `UTF8_1`, because union construction
> does not work (like the problem for pinned structs [Bug 26527][2]).
I am not aware of any problem with union constructors. How they don't
work?