speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Idea: language plugins


From: Hynek Hanke
Subject: Idea: language plugins
Date: Mon, 03 Mar 2008 17:45:25 +0100

Hello Bohdan, Milan, Jonathan,

first of all, thanks for your ideas. Processing of such things
as numbers, punctuation or shortcuts in the text
to be synthesized cannot often be reduced to plain text
to plain text substitution. Substitution of the symbol of left
parenthesis in a sentence with the words 'left parenthesis' doesn't
give a reasonable sentence for subsequent processing in the
synthesizer (from the point of view of syntactic analysis,
intonation etc.) The same goes for e.g. telephone number
substitution with the names of the corresponding digitals
(very different intonation).

If good quality is desired, the output of such processing must
be one phoneme-level or some intermediate level, so to do it
in a place outside the synthesizer is not very convenient.

Also, for the determination of the correct meaning of the
given piece of text, the context is important -- so some kind of
syntactic analysis . Since the TTS needs to do some basic analysis
itself to be able to provide correct intonation, this seems to be yet
another reason why such things would belong to the TTS,
not to upper layers.

There is yet another thing -- since context based interpretations
of the meaning of pieces of text will allways be inacurate, the
intention of course is to accompany the input text with explicit
information (e.g. in form of SSML) whenever possible. In the future,
as accessibility support improves, such thing as telephone numbers
will increasingly be SSML-marked as telephone numbers. Again,
such information is intended to be dealt with by the TTS system.

This functionality belongs to the text-to-phoneme part of the whole
process somewhere after SSML processing and syntactic analysis.
As such, attempts to do it in Speech Dispatcher or TTS API Provider
can be a useful hack for some synthesizers, but not a good
and general solution. Further, I do not really expect much will
to work on these incomplete solutions, since it is a hard thing
(language-specific, must be nearly completely SSML-aware,
avoid index marking problems etc.) so that it doesn't break
existing functionality.

I agree though it is not a good idea to do difficult things twice or more
(to have many TTS systems, each of them implementing
some kind of the above functionality). We should ideally be
aiming at some common speech synthesizer framework.

Festival already supports much of the above functionality
in a more or less advanced way and since it is extensible, I think
it might be a good tool for such things. Perhaps Festival could
be used for text-to-phoneme processing for eSpeak in
a similar way how it can be used to do this for Mbrola?

With regards,
Hynek Hanke


 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]