speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Chinese with IBM TTS driver


From: Hynek Hanke
Subject: Chinese with IBM TTS driver
Date: Fri, 16 Mar 2007 20:23:43 +0100

Tomas Cerha writes:
> I'm attaching the log file with debug level 5 which I got from the
> Chinese user.

Hi Tomas,

I don't believe the input I can see in the logfile is UTF-8. Unidesc
reports:

# unidesc speechd.log
       0            3235        Basic Latin
       Invalid UTF-8 code encountered at line 54, character 3236, byte
3236.
The sequence is not a valid UTF-8 character because
the first byte, value 0xE9, bit pattern 11101001,
requires 2 continuation bytes, but of the immediately
following bytes, byte 2, value 0x3F, bit pattern
11101001 is not a valid continuation byte, since
its high bits are not 10.

It is clear why the Speech Dispatcher internal routines destroy it even
more subsequently because they assume UTF-8 input. It would probably be
much better if it reported failure instead of putting in garbage. I
think we should also reconsider the decision not to check for UTF-8
validity directly on socket input for performance reasons.

So there are some possibilities left:
        1) Input is correct, logging is wrong.
        2) Input is incorrect because spd-say contains a bug.
        3) spd-say is not being fed UTF-8 input.
I think (1) is highly unlikely as it is well tested with UTF-8 and I
don't see how a different character range would lead to a bug unless
there is a bug in glib itself. (2) is somehow tested, but I don't know
to which extent. So I think a bug in spd-say or incorrect use of spd-say
is the most likely explanation. It definitely has nothing to do with the
IBM TTS module.

Tomas, can you please write exactly how did the user use spd-say? You
said in a different post that you ensured UTF-8 is being used.

With regards,
Hynek Hanke




reply via email to

[Prev in Thread] Current Thread [Next in Thread]