gnuspeech-contact
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnuspeech-contact] TRM as backend for festival


From: David Hill
Subject: Re: [gnuspeech-contact] TRM as backend for festival
Date: Sun, 11 Feb 2007 16:31:47 -0800

It would probably help your understanding if you were to read the Monet manual.  You wrote (see below):

But I have no idea how Monet
reproduces consonants. There are examples, but no trm files for them.

The .trm files are associated strictly with the tube model ("trm" = "tube resonance model") and are saved and used by the "Synthesiser" application (which is a GUI application for playing with the tube -- but only steady state configurations).  (You should probably read that manual as well).  Consonants are mostly created by the dynamics of the vocal tract changes, though there are some continuant sounds such as frication as well (e.g. /s/) but even for these transitional cues are important.  Thus it is impossible to create consonants from .trm files alone.  They were really only useful in exploring the vocal tract configurations needed to create the vocal tract "postures" needed as anchor points (loosely related to "phones" for the varying speech parameters.  The dynamic information needed for complete speech is created from these quasi-steady-state values representing vocal tract postures, plus context sensitive rules for moving from posture to posture, according to timing information that reflects the rhythmic character of British English.  This information is all held within "diphones.monet" (the rules are actually more complex than diphones in many cases and include triphones & even tetraphones).  Monet has the algorithms to use this information appropriately.  The intonation is applied to the varying stream of tube parameters generated on this basis according to a model of British English intonation based on work by M.A.K. Halliday and elaborated by our own studies by varying the pitch (Fo) parameter, but these variations are added to small pitch changes created at the posture (segmental) level by constrictions in the vocal tract -- so-called "micro-intonation -- which provide additional cues for the identification of consonants.  Many of the relevant papers are available on my university web site.

The "oi" sound is just a succession of vowel sounds with a varying pitch, so a series of what appear to be .trm values will work.  To produce speech, you need to be able to construct a more complex set of varying parameters reflecting the reality of speech.  This is what Monet does.  This is the part of Monet that needs to be extracted if all you wish to do is convert sound specifications to a speech waveform specification.  The current Monet does much more since it allows you to create the databases as well as listen to the speech that can then be produced.  The extracted part (non-interactive) that would simply use the databases to convert streams of posture symbols to an output waveform is what we call "Real-time Monet".  It has not been ported from the original NeXT implementation yet.

david

On Feb 11, 2007, at 1:06 PM, Nickolay V. Shmyrev wrote:

В Сбт, 10/02/2007 в 15:53 -0800, David Hill пишет:
I have tried accessing the samples you provided.  Only one of them
loaded and played.  It did not sound anything like speech.  The TRM is
simply the waveguide model of an acoustic tube, with control regions
applied according to the Distinctive Region Model developed by Carré,
based on earlier work by Fant.  The underlying theory is outlined in
the paper "Real-time articulatory speech-synthesis-by-rules" on my
university web site and referenced from the gnuspeech project site
(see below for the university web site URL).  Manuals for
"Synthesiser" and "Monet" also appear on that web site, towards the
end of section E of the published papers page.  In the Monet manual
there is a table showing the equivalences between IPS symbols and the
Monet symbols.  This should allow you to translate into the Festival
set.

Ok, thanks, I'll do

Monet is an interactive tool for developing data sets for arbitrary
languages.  Real-time Monet (which has not yet been ported) is the
heart of a daemon that uses these data sets to convert text to speech.
It is a stripped down version of Monet and it would be really nice if
someone would take on that task (please ;-).  Without the data sets,
and the algorithms for manipulating the parameters tracks, you don't
have a speech synthesiser, you have a rather specialised trumpet!

Well, I can do that. I just need more explanation. Is it something Steve
splitted in Framework dir? Currently Monet compiles file, only gorm
files are missing. I don't think sound is required btw, it's enough to
be able to save audio file.

The data sets developed for synthesis in "diphones.monet" were
developed based on several years of research in which British English
speech was analysed for sound data, rhythmic (duration) data, and
intonation data.  This research is reported in other papers on the
site.

Btw, have you heart about MOSHA database?
It seems that Alan already used it in unit-selection synthesis. Although
it's not free I suppose, that's why this work isn't available still. If
it will be possible to generate set of prompts (around 1000 will be
enough I suppose) with Monet and later process coefficients with
unit-selection that would be interesting thing I suppose.

If you would like to hear some samples of gnuspeech, go to my
university web site:

Yeah, I've downloaded them, but the problem is that I can reproduce
vowels, like in example "oi" you've sent. But I have no idea how Monet
reproduces consonants. There are examples, but no trm files for them.
And the examples I have (for instance the one Steve kindly sent to me),
they sound like trumpet as you've noticed :) That's why I suspect there
is a bug in trm that makes consonants generation impossible.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]