gnuspeech-contact
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnuspeech-contact] TRM as backend for festival


From: David Hill
Subject: Re: [gnuspeech-contact] TRM as backend for festival
Date: Sat, 10 Feb 2007 15:53:07 -0800

I have tried accessing the samples you provided.  Only one of them loaded and played.  It did not sound anything like speech.  The TRM is simply the waveguide model of an acoustic tube, with control regions applied according to the Distinctive Region Model developed by Carré, based on earlier work by Fant.  The underlying theory is outlined in the paper "Real-time articulatory speech-synthesis-by-rules" on my university web site and referenced from the gnuspeech project site (see below for the university web site URL).  Manuals for "Synthesiser" and "Monet" also appear on that web site, towards the end of section E of the published papers page.  In the Monet manual there is a table showing the equivalences between IPS symbols and the Monet symbols.  This should allow you to translate into the Festival set.

Monet is an interactive tool for developing data sets for arbitrary languages.  Real-time Monet (which has not yet been ported) is the heart of a daemon that uses these data sets to convert text to speech.  It is a stripped down version of Monet and it would be really nice if someone would take on that task (please ;-).  Without the data sets, and the algorithms for manipulating the parameters tracks, you don't have a speech synthesiser, you have a rather specialised trumpet!

The data sets developed for synthesis in "diphones.monet" were developed based on several years of research in which British English speech was analysed for sound data, rhythmic (duration) data, and intonation data.  This research is reported in other papers on the site.

If you would like to hear some samples of gnuspeech, go to my university web site:


click on "Published papers" in the left menu and click on the first paper in section "B. National and international invited contributions ..." and select the first paper in that section (it is the one referred to above).

at the bottom of the left side menu in the resulting page you will find a whole bunch of examples of gnuspeech synthesis.  Some short, some long.

The tube resonance model parameters are specified in the source code for the TRM.  I attach a sample set of parameters, 24 fixed (utterance-rate) parameters, and 6 blocks of 16 parameters that drive the so-called "speech-rate" parameters.  The utterance represented is "oi", lasting about 1.5 seconds (the input control rate is 4 herz and there are 6 blocks).

The 16 speech-rate parameters are, in order:

GlottalPitch, GlottalVolume, AspirationVolume, FricativeVolume, FricativePosition, FricativeCentreFrequency, 
FricativeBandWidth, radius1, radius2, radius3, raduius4, radius5, radius6, radius7, radius8, velumRadius

I hope this helps.

--------
David Hill

Simplicity, patience, compassion. These three are your greatest treasures  (Tao Te Ching #67)
---------

On Feb 8, 2007, at 9:20 AM, Nickolay V. Shmyrev wrote:

Heh, since it seems it would be hard to build Monet on Linux I've tried

to adopt trm to work with festival. Actually for me it seems it would be

interesting work, since festival predicts intonation and duration much

more precisely and is able to produce very good annotations. 


----------
Parameter set for TRM "oi" (1.5 seconds, falling pitch during diphthong)
------------------------------------------------------------------------------------------------

4         ; input control rate (1 - 1000 Hz)
60.0      ; master volume (0 - 60 dB)
1         ; number of sound output channels (1 or 2)
0.0       ; stereo balance (-1 to +1)
0         ; glottal source waveform type (0 = pulse, 1 = sine)
40.0      ; glottal pulse rise time (5 - 50 % of GP period)
22.0      ; glottal pulse fall time minimum (5 - 50 % of GP period)
45.0      ; glottal pulse fall time maximum (5 - 50 % of GP period)
2.50      ; glottal source breathiness (0 - 10 % of GS amplitude)
10.0      ; nominal tube length (10 - 20 cm)
32        ; tube temperature (25 - 40 degrees celsius)
1.00      ; junction loss factor (0 - 5 % of unity gain)
3.05      ; aperture scaling radius (3.05 - 12 cm)
0.75      ; mouth aperture coefficient (0 - 0.99)
0.72      ; nose aperture coefficient (0 - 0.99)
1.35      ; radius of nose section 1 (0 - 3 cm)
1.96      ; radius of nose section 2 (0 - 3 cm)
1.91      ; radius of nose section 3 (0 - 3 cm)
1.3       ; radius of nose section 4 (0 - 3 cm)
0.73      ; radius of nose section 5 (0 - 3 cm)
1500.0    ; throat lowpass frequency cutoff (50 - nyquist Hz)
6.0       ; throat volume (0 - 48 dB)
1         ; pulse modulation of noise (0 = off, 1 = on)
48.0      ; noise crossmix offset (30 - 60 db)
10.0    0.0     0.0     0.0     4.0     4400    600     0.8     0.8     0.4     0.4 1.78    1.78    1.26    0.8     0.0
9.5     54.0    0.0     0.0     4.0     4400    600     0.8     0.8     0.4     0.4 1.78    1.78    1.26    0.8     0.0
9.0     60.0    0.0     0.0     4.0     4400    600     0.8     0.8     0.6     0.6 1.58    1.58    1.13    1.01    0.1
8.5     60.0    0.0     0.0     4.0     4450    550     0.8     0.8     1.28    1.28 1.0     1.0     1.0     0.8     1.0
8.0     54.0    0.0     0.0     4.0     4500    500     0.8     0.8     1.68    1.58 0.8     0.8     0.5     0.4     1.0
7.0     51.0    0.0     0.0     4.0     4500    500     0.8     0.8     1.78    1.78 0.2     0.2     0.4     0.0     1.0

----------






reply via email to

[Prev in Thread] Current Thread [Next in Thread]