[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Sun, 09 Jan 2005 14:40:41 +0100
There seems to be a simple way to extend Bison to Unicode. Essentially, this
embarks to give meaning to the '...' construct for Unicode characters. One
way is to treat this as a UTF-8 multibyte sequence. Bison would thus treat
this as a sequence of character tokens. Now, if the .y grammar file is
assumed to be in UTF-8, then what is needed is to give 'c1 ... ck' meaning
for a suitable character sequence, by merely translating it into the
character token sequence 'c1'...'ck'.
As for the yylex handshaking, I see two possibilities: A UTF-8 mode, where a
multibyte sequence is returned one by one, in a succession of yylex calls.
An a Unicode mode, where yylex returns the full Unicode number in UTF-32.
Bison would then start its token number at number higher than 0x10FFFF, the
highest possible Unicode number. If a Unicode number is returned by yylex,
then the Bison parser translates this into a UTF-8 sequence, which is the
processed as normal.
|[Prev in Thread]
||[Next in Thread]|
- UTF-8/Unicode Bison,
Hans Aberg <=