gnuspeech-contact
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnuspeech-contact] Status of GnuSpeech


From: D.R. Hill
Subject: [gnuspeech-contact] Status of GnuSpeech
Date: Tue, 28 Sep 2004 17:13:38 -0600 (MDT)

Hi Lee,

Thanks for your query.

Articulatory synthesis is a method of producing synthetic speech from a waveguide (or "tube") model of the human vocal & nasal tracts. Conventional synthesis either takes small segments of real speech (usually represented in Linear Predictive Coded form, with the pitch effect removed) and concatenates them, before re-imposing some pitch (intonation) contour; or sends parameters to a set of bandpass filters, to vary their frequencies, and feeds a voicing waveform and/or suitable noise through them (either in parallel or in series), with further filtering to account for things like the radiation impedance of the lips. DECTalk, as used by Stephen Hawking is an example of the latter, and the method dates back to the 50s, and is called "formant synthesis". The concatenation method is called "concatenative synthesis". Both methods have their problems and advantages.

You will find papers relevant to understanding the details and advantages of "articulatory synthesis" based on the tube model on my university web site amongst the papers that are available there:

        http://www.cpsc.ucalgary.ca/~hill

You can also find a wealth of excellent information by going to Julius Smith's website at Stanford University.

        http://ccrma.stanford.edu/~jos

and link elsewhere from there, if necessary.

We call the "tube model" an "articulatory synthesiser" because it is controlled using what is called the "Distinctive Region Model" due to Rene Carre at the ENST in Paris who built it on the basis of work in 1973 at the Speech Technology Lab, KTH, Stockholm by Gunnar Fant and his colleagues.

The essence of the control method is to vary the diameter each of eight "distinctive" regions of the tube as happens in the real vocal tract. The regions are defined by the "formant sensitivity analysis"carried out by Fant which showed that the effect of constrictions in each region have a specific, independent effect of the values of the three "formants" or resonant peaks in the speech spectrum that determine the identity of the speech sounds. Carre showed that these regions also correspond fairly closely to the distribution of the articulators in the real human vocal tract, so that our articulators are appropriately positioned to effect just the constrictions required for the DRM model. There is provision in our scheme for mapping the DRM regions onto specific articulatory gestures but, so far, we have simply used the DRM regions directly, togther with information on rhythm and intonation derived from research at the U of C and other places. Those who have heard the speech comment that it is the best they have heard, though in fact I think it still needs a lot of improvement -- we only have a first cut at the "posture" (phone articulation) data so far. The same listeners tell us it is much less tiring to listen to than conventional synthetic speech (by which they mean formant synthesis -- concatenative synthesis tends to be confined to rather short utterances for things like telephone intercepts, in practice).

The "articulatory" synthesis allows a wide variety of different voice types to by used, simply by varying things like the tube length, pitch, breathiness (especially for female speech) and so on, and the rhythm and intonation models are based on a generalised abstraction of real speech.

I hope this meets you initial needs for information. I attach a .snd file that provides a comparison of the word "hello" spoken by male, female and child voices emanating from the tube. If your system doesn't like .snd files, just change the extension to .au -- they are the same.

I should add that we have a complete database and system capable of producing continuous speech running under NeXTSTEP 3.x, and this is what is available under a GPL check the CVS repository) and what is being ported to other systems, particularly GnuStep and OS X.

Unfortunately, GnuStep is not widely available and somewhat immature still, so progress on that is slow -- the various system components include some that are GUI-intensive. I am considering building from scratch for Linux, without using GnuStep, but that is a *major* programming effort.

Thanks you for your interest.

All good wishes.

david
-----
David R. Hill, Computer Science, U. Calgary  | Imagination is more
Calgary, AB, Canada T2N 1N4 Ph: 604-947-9362 | important than knowledge.
address@hidden OR address@hidden|         (Albert Einstein)
http://www.cpsc.ucalgary.ca/~hill            | Kill your television!
----
Lee Butterman wrote:


From address@hidden Tue Sep 28 07:59:15 2004
Date: Tue, 28 Sep 2004 09:54:22 -0400
From: Lee Butterman <address@hidden>
To: address@hidden
Subject: status of gnuspeech

Hi, I was wondering about two things.  First, I've never heard articulatory
synthesis before, so I was wondering if you had any pre-synthesized examples
just to demonstrate how it sounds.  Secondly, what's the status of gnuspeech?
Will there every be, say, an MBROLA-like interface, where you've got  some
detachable module that takes phonemes (along with tables/models of their
positions in the mouth?) and then synthesizes speech?

Thanks so much,
Lee


Attachment: helloComparison.snd
Description: Basic audio


reply via email to

[Prev in Thread] Current Thread [Next in Thread]