This is a collection of examples of synthetic affective speech conveying an emotion or natural expression and
maintained by Felix Burkhardt.
Some of these samples are direct copies from natural data,
others are generated by expert-rules or derived from data-bases. The emotional labels "anger", "fear", "joy" and
"sad" are my (short) designators for "the big four" basic emotions,
not neccessarily the authors' ones.
Examples of German actors
simulating emotional arousal can be found here.
I recently held a talk on emotional speech synthesis.
Examples of German text-to-speech synthesizers can be found here.
Please, feel encouraged to let me know about own or missing
attempts to simulate emotional speech!
(fxburk@gmail.com)
N. Audibert, V. Aubergé, A. Rilliard |
 |
ICP |
2006 |
Copy prosody and intensity from satisfied and sad speech to neutral
speech using PSOLA technique. See the article "The Prosodic Dimensions
of Emotion in Speech: the Relative Weights of Parameters", Interspeech
2005 (Lisbon), for details |
 |
- |
 |
 |
- |
Stephan
Baldes |
 |
DFKI |
1999 |
Rule based emotion simulation with Entropic's formant TTS engine TrueTalk, based on Cahn's affect editor
approach |
- |
 |
- |
 |
 |
Roberto Barra Chicote |
 |
Polytechnic Univ. of Madrid |
2010 |
HMM based emotional speech synthesis, i.e. used emotional prosodic and source models on HMM coded speech data.
See the article
"Roberto Barra-Chicote, Junichi Yamagishi, Simon King, Juan Manuel Montero, Javier Macias-Guarasa, Analysis of
statistical parametric and unit selection speech synthesis systems applied to emotional speech, Speech
Communication, Volume 52, Issue 5,2010".for details.
|
 |
 |
 |
 |
 |
Murtaza
Bulut, Carlos Busso, Serdar Yildirim, Abe Kazemzadeh, Chul Min Lee, Sungbok Lee, Shrikanth Narayanan |
 |
USC/Sail |
2005 |
Emotional voice-conversion by changing prosody
(TD-PSOLA) and spectrum (LPC modification). The samples
demonstrate neutral to target-emotion conversion. See the article
"Investigating the role of phoneme-level modifications in emotional
speech resynthesis", Proc Interspeech 2005 for
details |
|
|
|
 |
- |
Murtaza
Bulut, Shri
Narayanan, Ann Syrdal |
 |
USC/Sail,
AT&T
|
2002 |
Diphone synthesis done with hand-crafted diphones
and copy-prosody for the appropriate emotion. See the article "Expressive
Speech Synthesis Using a Concatenative Synthesizer", Proc. ICSLP 2002, for
details |
 |
 |
 |
 |
- |
Felix Burkhardt |
 |
Dautsche Telekom Laboratories |
2005 |
emofilt: rule based
simulation
(prosody only) with MBROLA.
German male voice (de6) (neutral prosody
txt2pho) |
 |
 |
 |
 |
 |
English male voice (en1) |
|
 |
 |
 |
 |
Spanish male voice (es2) |
|
|
 |
 |
|
Mandarin Chinese female voice (cn1) |
 |
 |
 |
 |
 |
Frensh male voice (fr1) |
 |
 |
 |
 |
 |
Greek male voice (gr2) |
 |
 |
 |
 |
 |
Dutch male voice (nl2) |
 |
 |
 |
 |
 |
Hungarian male voice (hu1) |
 |
 |
 |
 |
 |
Italian male voice (it3) |
 |
 |
 |
 |
 |
Turkish male voice (tr1) |
|
 |
 |
 |
 |
1998 |
emofilt old version |
 |
 |
 |
 |
 |
2000 |
emoSyn 1: rule
based simulation with formant-synthesizer (Sensyn Version), the neutral
sentence is
copy-synthesis, more examples here |
 |
 |
 |
 |
 |
2000 |
emoSyn 2: copy
synthesis with formant-synthesizer (Iles
& Simmons Version) |
orig. |
 |
 |
 |
 |
synth. |
 |
 |
 |
 |
1998 |
esps2mbrola: prosody-copy synthesis with MBROLA |
orig. |
 |
 |
 |
 |
synth. |
 |
 |
 |
 |
Joao P. Cabral |
 |
L2F INESC-ID Lisboa |
2005 |
Transformation of neutral speech using
LP-PSOLA. Changes pitch, duration and energy as well as
voicequality (by transforming the residual). Emotion-rules were
derived from literature. See the article "Pitch-Synchronous
Time-Scaling for Prosodic and Voice Quality Transformations",
Proc. Interspeech 2005 for details.
male voice |
 |
 |
 |
 |
 |
female voice |
 |
 |
 |
 |
 |
Janet
Cahn |
 |
MIT |
1989 ? |
Affect
Editor: rule based emotion simulation with DecTalk |
- |
 |
 |
 |
 |
Piero Cosi, Fabio
Tesser, Roberto
Gretter, Carlo Drioli,
Graziano Tisato |
 |
ISTC-SPFD |
2004 |
Emotive
Mbrola: Italian concatenation synthesis with the Festival speech
synthesis framework and MBROLA voices.
The prosody was learned from emotional database (CART). An article
appeared at the Interspeech 2005.
male voice |
 |
 |
 |
 |
 |
female voice |
 |
 |
 |
 |
 |
with manipulation of voice
quality |
male voice |
 |
 |
 |
 |
female voice |
 |
 |
 |
 |
Cosy Voice |
|
FunAudioLLM |
2025 |
Version 2.0 end-2-end emotional speech synthesis |
 |
 |
 |
 |
- |
Coqui.ai |
 |
Coqui |
2023 |
Ana Florence (one of many voices): Deep learning based latent space synthesis. |
 |
 |
 |
 |
Björn Granström,
Rolf Carlson ?
|
 |
KTH |
1998 ? |
KTH Royal
Institute of Technology (orig. link broken): swedish copy
formant-synthesis. |
 |
 |
 |
 |
- |
Akemi Iida |
 |
ATR |
2000 ? |
Chatr
Emotion: japanese concatenation
synthesis using emotional databases with CHATR. See the article "A Speech Synthesis System with Emotion for
Assisting Communication", Proc. ISCA Workshop on Speech and Emotion, Belfast, 2000. Since than Chatr
has been expanded to NATR |
- |
 |
 |
 |
- |
new CHATR with Emotion |
- |
 |
 |
 |
- |
David from IRCAM |
 |
IRCAM |
2015 |
Not a speech synthesizer, but re-synthesis to "emotionalize" existing speech by pitch-shifting, filtering,
adding vibrato and inflections, see the
article "DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in
running speech"
, Behaviour Research Methods (2017) for details |
 |
- |
 |
 |
 |
Gregor Hofer |
 |
Univ. Edinburgh |
2004 |
Gregor Hofer's master's
thesis, unit selection database recorded in neutral as well as
happy and angry style. More
samples. |
 |
 |
 |
- |
- |
half-emotional by mixing units from neutral and
emotional data |
 |
 |
- |
- |
Ignasi
Iriondo, Francesc Alías, Javier Melenchón, M. Angeles
Llorca |
 |
Univ. Ramon
Lull |
2003 |
Catalan diphone synthesis with emotion rules, see
article "Modeling and Synthesizing Emotional Speech for Catalan
Text-to-Speech Synthesis" (Proc. ADS 2004) for details |
- |
 |
 |
 |
 |
Syaheerah L. Lutfi |
 |
Polytechnic Univ. of Madrid |
2006 |
Template/rule-based synthesis of prosody parameters using Mbrola synthesis, see the article "Syaheerah L.
Lutfi, Raja N. Ainon and Zuraidah M. Don. eXpressive Text Reader Automation Layer (eXTRA): Template driven
‘Emotion Layer’ in Malay Concatenated Synthesized Speech, Proc. of International Conference on
Speech Databases and Assessment (Oriental COCOSDA 06), Penang, Malaysia, Dec 2006" for details.
|
 |
 |
- |
- |
- |
Cynthia Breazeal |
 |
MIT |
2000 |
Kismet:
rule based emotion simulation with DecTalk (like Affect
Editor) |
 |
 |
 |
 |
 |
with words |
 |
 |
 |
 |
 |
Keisuke Miyanaga, Makoto Tachibana,
Junichi Yamagishi, Koji Onishi, Takashi Masuko,
Takao
Kobayashi
|
 |
Tokyo Institute of
technology, Kobayashi Lab. |
2004 |
HMM (data-based) modeling of emotional expression
(spectral and prosodic), enables mixing: see article HMM-Based
Speech Synthesis with Various Speaking Styles Using Model
Interpolation or further demos:
each emotion individually modeled |
 |
|
|
|
- |
emotion as contextual factor (like
phonetic/linguistic factors) |
|
|
 |
 |
- |
Juan M. Montero
Martínez |
 |
Univ. Madrid |
1998 ? |
montero
1: rule based emotion simulation with spanish
diphone-synthesizer |
 |
 |
 |
 |
- |
montero
2: rule based emotion simulation with
KTH-formant-synthesizer |
 |
 |
 |
 |
- |
Shinya Mori, Tsuyoshi Moriyama, Shinji Ozawa |
 |
Dept.
of Media and Image Technology, Tokyo Polytechnic University |
2008 |
PSOLA transformation based on data-base trained prosody modification
rules. The algorithm includes possibility of graded expressions.
See the article "Emotional Speech Synthesis Using
Subspace Constraints in Prosody", Proc. ICME 2006, or follow this
link |
half |
 |
 |
 |
- |
full |
 |
 |
 |
- |
Iain
Murray |
 |
Univ. Dundee |
1989 ? |
HAMLET:
rule-based simulated emotions with formant speech synthesis |
 |
 |
 |
 |
 |
1997 |
Laureate:
hand optimized concatenation tts |
 |
 |
 |
 |
 |
1997 |
LAERTES:
rule-based simulation using laureate |
 |
 |
 |
 |
 |
Pierre-yves Oudeyer |
 |
Sony |
2000 |
cartoon speech:
nonsense speech for sony pet-robots based on concatenative
synthesis. Article: Oudeyer P-Y. "The Synthesis of Cartoon
Emotional Speech", Proc. of the 1st International Conference on
Prosody, Aix-en-Provence, eds. B. Bel; I. Marlien. 2002 |
low intensity |
 |
 |
 |
- |
high intensity |
 |
 |
 |
- |
with japanese words |
 |
 |
 |
- |
? |
 |
IBM Tokyo Research
Laboratory |
1996 |
ProTalker, from
IBM tokyo research laboratory. |
 |
 |
 |
- |
- |
S. R. M. Prasanna and D. Govind |
 |
Indian Institute of Technology Guwahati |
2010 |
Modeling pitch and excitation (voice source and EGG estimated with the zero frequency filtering approach) with
linear prediction
using the EMO-DB. See the article "Analysis of Excitation Source Information in Emotional Speech" from
Interspeech 2010, Makuhari, for details.
|
- |
 |
 |
- |
 |
Erhard Rank/Hannes
Pirker |
 |
ÖFAI |
1998 |
VieCtoS:
demos of the demisyllable LPC-synthesizer VieCtoS of the Austrian
Research Institute for Artificial Intelligence (ÖFAI) copying
emotional speech |
 |
 |
 |
 |
 |
Marc
Schröder |
 |
DFKI |
2002 |
While working on the NECA-Project
and his ph.d. thesis Marc developed a system capable of
producing emotional speech based on a description from emotional
dimensions (arousal, valence, potency). I tried to map his results
to basic emotions. The system is based on MBROLA as
DSP
and MARY as NLP. In order to
control voice-quality, six databases for MBROLA were developed: For
male and female each normal, tense voice and lax voice. |
- |
 |
 |
 |
 |
Ingmar Steiner and Marc Schröder, and Marcela
Charfuelan |
 |
Mary DFKI Pavoque project |
2011 |
Non-uniform unit selection with a database annotated with expressive styles, see the article
"Steiner, Ingmar, Marc Schröder, Marcela Charfuelan (2010) "Symbolic vs. acoustics-based style control for
expressive unit selection" 7th ISCA Workshop on Speech Synthesis, Kyoto, Japan", for details.
|
 |
 |
 |
 |
- |
Jun Sato
|
|
? |
1998 ? |
Japanese demo of emotional synthetic speech
generated by art. neural nets. Article: J. Sato and S. Morishima,
"Emotion modeling in speech production using emotion space", in
Proc. IEEE Int. Workshop on Robot and Human Communication, Tsukuba,
Japan, Nov. 1996 |
 |
 |
 |
 |
- |
Oytun Türk and Marc Schröder |
 |
DFKI |
2008 |
GMM based voice conversion, i.e. the spectral envelope of neutral speech gets transfered to the
target emotional speech in order to simulate emotional voice quality without having to record a whole emotional
database. See the article "A Comparison of Voice Conversion Methods for Transforming Voice Quality in Emotional
Speech Synthesis", Proc. Interspeech 2008, Brisbane.
|
orig.  |
 |
 |
 |
- |
synth. (GMM method) |
 |
 |
 |
- |
E. Zovato,
A. Pacchiotti, S. Quazza, S. Sandri |
 |
Loquendo |
2004 |
From Loquendo. Rule-based prosody
PSOLA-like manipulation of non-uniform unit-selection engine, see
Towards emotional
speech synthesis: a rule based approach. Presented at the 5th
ISCA Workshop on Speech Synthesis in Pittsburgh 2004. other examples from Loqendo
|
 |
 |
 |
 |
- |
Kun Zhou |
 |
National University of Singapore |
2023 |
Deep learning approach, actually featuring mixed emotions, see
the paper Speech Synthesis with Mixed Emotions
|
- |
 |
 |
 |
- |
Yuexin Cao |
 |
Ping An Technology |
2020 |
Deep learning approach, combining VAEs with GANS, the actual emotion labels used for the conversion is
supervised. See
the paper
Nonparallel Emotional Speech Conversion Using VAE-GAN
|
 |
 |
 |
 |
- |
Speech Synthesis with Mixed Emotions
Acapela
|
|
Acapela |
2013 |
Non uniform unit selection synthesis with different emotional and style-varying voices.
UK English Voice
|
Elizabeth |
Peter neutral |
Peter happy |
Peter sad |
|
|
|
 |
|
2013 |
Non uniform unit selection synthesis with different emotional and style-varying voices.
French Voice
|
Antoine neutral |
Antoine happy |
Antoine sad |
Antoine from afar |
Antoine up close |
|
|
|
 |
 |
|
2013 |
Non uniform unit selection synthesis with different emotional and style-varying voices.
US English Voice
|
Will neutral |
Will happy |
Will sad |
Will bad guy |
Will from afar |
Will up close |
|
|
|
 |
 |
 |
|
Baird, Amiriparian and Schuller |
 |
University of Augsburg |
2020 |
WaveNet (deep learning) based approach to generate new emotional speech trained on the Italian DEMoS corpus
to enrich the training samples, please see the article
for detailes |
Simulation of joy |
Simulation of sadness |
 |
 |
|
Greg Beller
|
|
IRCAM |
2005 |
Transform real voices thanks to a content-based
transformation with a phase vocoder algorithm.
Time-stretch and transpose coefficients change over the utterance
depending on the expressivity and the context of units
(whether they're part of an accentuated syllable or whether they're
consonants, for instance...).
|
original |
sad transformation |
bored transformation |
frightened transformation |
|
|
|
 |
|
2009 |
Synthesis of laughter based on a spring model and the repetition of
specific phones. Examples found here
|
synthesized laughter |
|
|
? |
 |
Cepstral |
2004 |
Non-uniform unit selection. Databases recorded with a
certain style. |
Damian: dark personality |
Duchess: sensitive voice |
Shouty: non-sensitive voice |
 |
|
 |
|
? |
 |
ETI Eloquence (as Eloquent belongs to Scansoft
now, the link is broken) |
1998 |
Rule-based formant-syntresis. The emotional
expression was hand-optimized (demonstration from Eloquent) |
expressive male |
expressive female |
 |
 |
|
Ellen Eide et al |
 |
IBM
Watson research Center |
2004 |
Non-uniform unit-selection trained with an
expressive prosody model, paralinguistic events and expressive
units. Described in the article A Corpus-Based Approach to
<AHEM/> Expressive Speech Synthesis |
good news |
bad news |
question |
other |
for comparison: IBM-CTTS taken from website. Note
that research-engine is advanced technology compared to
product. |
|
|
|
 |
 |
|
Chengyi Wang et al. |
|
Microsoft Vall-e |
2023 |
Neural speech synthesis that takes a text and a short (3 sec) speech sample as input and generates an
audio with the textual content in the style and voice of the speech sample |
Angry |
Sleepy |
Neutral |
Amused |
Disgusted |
human:
|
|
|
|
|
tts:
|
|
|
|
|
|
IBM Watson |
 |
IBM
Watson Developer Cloud |
2016 |
Watson voice Allison: Non-uniform unit-selection in three styles: Good News, Apology, and Uncertain
|
apology |
uncertain |
good news |
 |
 |
|
|
Eva Szekely et al |
 |
University College Dublin |
2011 |
HMM synthesis based on expressive Speech extracted automatically from audio textbook readings by
clustering Glottal parameters. Described in the article
Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters |
soft style |
tense style |
expressive style |
|
|
 |
|
Enrico Zovato et al |
 |
Loquendo |
2004 |
Non-uniform unit-selection enriched by
paralinguistic events and expressive units. Other examples from Loquendo. |
German (Kathrin)/(Stefan) |
French (Juliette) |
English (Simon) |
/
|
 |
 |
|
F. Malfrere |
 |
MBROLIGN
from TCTS Lab, Mons |
1999 |
data-based prosody synthesis with MBROLA
(the
decision-tree algorithm was trained on a database spoken with the
appropriate affect) |
neutral |
nervous |
astonished |
shy |
|
|
 |
|
|
Google Deepmind WaveNet |
 |
Google Deepmind WaveNet
|
2016 |
Synthesis by PCM value prediction from deep neural net. See the paper "WAVENET: A GENERATIVE MODEL FOR
RAW AUDIO", by Oord et. al for details. Samples for models trained on three different speakers. Sample
for speech without given text, computed by probability from previous samples. |
Speaker 1 |
Speaker 2 |
Speaker 3 |
No text given |
 |
 |
 |
|
|
Kaiser and Schallner |
 |
Nuremburg Institute for Market decisions |
2020 |
The system uses a Tacotron (deep learning) based architecture utilizing global Style Tokens (GST) and
text-prediction for style embeddings (TPSE) for an emotional speech synthesizer to investigate speech
assistants in German, see the paper
for a description.
|
|
Ivo-Software |
 |
Ivona |
2011 |
Non-uniform Selection Synthesis, Character voice |
Chipmunk |
|
|
John Stallo |
 |
Metaface
from Curtin University |
2000 |
Implementation of Virtual Human Markup Language, which
includes emotional expression. TTS based on Festival and
John Stallo's work on "Simulating Emotional Speech
for a Talking Head" which implements prosody rules. |
happy-cry |
happy-go-lucky |
so-afraid |
|
 |
|
|
Jürgen Trouvain, Sarah Schmidt, Marc Schröder, William Barry |
 |
Personality modeling
from Saarland
University |
2006 |
Modeling of different personality types with the DFKI Mary
system. Modification was done with respect to pitch range and level,
speech tempo and voice loudness. See the article "Modelling
personality features by changing prosody in synthetic speech",
Proc. Speech Prosody 2006, Dresden, for details. |
neutral |
competent |
excited |
rugged |
sincere |
sophisticated |
|
 |
|
|
|
 |
|
Meta (facebook) |
 |
dGSLM
|
2022 |
Generating Chit chat trained form human conversations with (textless) transformer models. |
dialog generative spoken language modeling |
|
|
? |
 |
ModelTalker
from the University of Delaware |
1998 |
biphone synthesis (biphone-inventory searches for
best diphones at synthesis time). The emotional expression was
generated by prosody-rules |
neutral |
happy |
surprised |
frustrated |
sad |
contradictive |
assertive |
 |
 |
 |
 |
 |
 |
 |
|
Nick Campbell. further samples |
 |
NATR from ATR |
2004 |
Non-uniform unit-selection working on a very large
database (recording a woman for 3 years) including affective
labeling and extralinguistic sounds. See the article
"Extra-Semantic Protocols; Input Requirements for the Synthesis of
Dialogue Speech" (Proc. ADS 2004) for details. |
sample of a telefone conversation between NATR
(female voice) and a young man (originally talking to his
friend). |
|
|
? |
 |
Rhetorical, now owned by Scansoft. Eduardo |
2002 |
non-uniform unit-selection synthesis done by Rhetorical, character
for a Jim Beam marketing campaign, manually optimized |
in conversation with rhetorical's voice "american
valley girl" |
 |
|
Marc
Schröder |
 |
DFKI |
2002 |
more examples (see above) |
excited |
scared |
sad |
angry 1 |
angry 2 |
angry 3 |
bored |
content 1 |
content 2 |
content 3 |
happy |
 |
 |
|
|
 |
|
 |
 |
 |
 |
 |
|
MARY
|
 |
DFKI |
2005 |
Limited domain unit-selection with emotional units. |
standard EmotionML sample with Pavoque database from the Mary 5.1 version |
excited soccer reports |
 |
 |
|
Mariet Theune |
 |
Human
Media Interaction lab from Univ. Twente |
2006 |
Extracting prosodic rules to enhance speaking style for story
tellers and apply to TTS. See the article "Generating Expressive Speech for Storytelling
Applications" by Mariet Theune, Koen Meijs, Dirk Heylen and Roeland Ordelman. IEEE Transactions on
Audio, Speech and Language Processing 14(4) |
no suspense |
sudden climax |
increasing climax |
|
 |
 |
|
Olivier Rosec and Yannis
Agiomyrgiannakis |
 |
Voice transformation based on an ARX-LF vocoder from Orange Labs, Lannion |
2009 |
Voice transformation based on parameter modification with a combination of LF-excitation and
hamrmonic model.
See the article: Yannis Agiomyrgiannakis, Olivier Rosec: "ARX-LF-based source-filter methods for voice
modification and
transformation", Proc. ICASSP 09
|
man original |
man whisper |
man breathy |
man tense |
woman original |
woman as man |
woman as child |
 |
 |
 |
|
 |
|
|
|
Raul Fernandez and Bhuvana
Ramabhadran, IBM TJ Watson research center |
 |
Emphatic Speech |
2007 |
Classifying emphatic speech in a non-uniform unit selection database. See the article "Automatic
Exploration of Corpus-Specific Properties for Expressive Text-to-Speech." by Raul Fernandez and Bhuvana
Ramabhadran. Proc. 6th ISCA workshop on speech synthesis, Bonn, 2007 |
baseline neutral units with normal text |
baseline plus collected emphatic units with marked emphasis text |
mined emphasis corpus with marked emphasis text |
 |
 |
 |
|
Shiva Sundaram and Shrikant Narayanan |
 |
SAIL |
2007 |
Synthesis of laughter based on a mass-spring model and LPC synthesis. See the article
Sundaram, s. and Narayanan, S., "Automatic acoustic synthesis of
human-like laughter", Jasa No1, p. 527-535, 2007.
|
snythesized laughter |
|
|
Speechconcept/CereProc |
 |
Cerevoice |
2009 |
Non-uniform unit selection synthesis with emotional expression, Courtesy of Speechconcept |
Voice Alex |
Voice Gudrun |
Voice Nick |
Voice Saskia |
 |
 |
 |
 |
|
Tacotron |
 |
Tacotron with Global Style
Tokens |
2018 |
Deep artificial neural nets, described in the paper Style Tokens:
Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. In short: so called style
tokens are learned from reference samples for a specific speaking style in an unsupervised manner and can then be
used as embeddings on a neural net that transforms text to spectrograms. The reference samples or text don't need
to be part of the original training and they can be applied to a different degree. |
Style 1 |
Style 2 |
Style 3 |
Style 4 |
Style 5 |
 |
 |
 |
 |
 |
|
Talkback |
 |
Talkback
2000 (original link broken) |
1998 |
concatenation synthesis, demonstration from
Talking Technologies |
expressive male |
expressive female |
 |
 |
|
Tuomo Raitio and Antti Suni |
 |
Glott-HMM synthesis |
2013 |
Statistical parametric speech synthesis, the shouted voice was created by adapting the statistical speech
models. Part of an analysis-by-synthesis experiment described in the article "Analysis and Synthesis of Shouted
Speech" by Tuomo Raitio et al, at Interspeech 2013. |
female normal |
female shouting |
male normal |
male shouting |
 |
 |
|
 |
|
Voxygen |
 |
Voxygen |
2012 |
French non-uniform unit-selection, different expressive voices |
Voice "Papi" reading an arbitrary news item. |
Dark Vador |
Yeti |
Witch |
|
 |
 |
 |
|
Yossef Ben-Ezra ? |
 |
Vivotext |
2012 |
Technology unknown, demonstration from webpage |
happy child |
sad child |
 |
 |
|
Lyrebird |
 |
Synthesis based on a deep artificial neural net adapted to new speakers voice characteristics (this page's
author) from just about 100 short samples |
|
Preserving Word-level Emphasis in Speech-to-speech Translation |
|
Preserving Word-level Emphasis in Speech-to-speech Translation example English to Japanese using Linear
Regression HSMMs. As described here or see the Interspeech 2015
article Preserving Word-level Emphasis in Speech-to-speech Translation using Linear Regression HSMMs by
Quoc Truong Do, Shinnosuke Takamichi, Sakriani Sakti, Graham Neubig, Tomoki Toda and Satoshi Nakamura |
English native speaker |
Japanese translation without emphasis |
Japanese translation with emphasis |
 |
|
 |
|
Accent cloning based on articulatory adaption |
 |
Copying accent from a native speaker to a learning non-native speaker by driving an (copysynthesis)
articulatory synthesizer for the learner with gestures from the native speaker. As described here or see the Interspeech 2015 article
Articulatory-based conversion of foreing accents with deep neural networks by Sandesh Aryal and Ricardo
Gutierrez-Osuna
|
MFCC resinthesized native speaker |
MFCC resinthesized learning speaker |
DNN based accent conversion |
 |
 |
 |
|
Voice cloning |
 |
A multilinugal speech synthesizer that learned one speaker's voice.
From Microsoft research (Frank Soong) article |
English (source) |
Italian |
Mandarin |
Spanish |
 |
 |
 |
 |
|
Vocal Tract Lab |
 |
articulatory synthesizer singing dona nobis pacem |
|
Glove Talk
|
 |
formant synthesis controlled by data glove, for further information see Sidney S. Fels and Geoffrey E. Hinton.
Glove-TalkII: A neural network interface which maps gestures to parallel formant speech synthesizer controls.
IEEE Transactions on Neural Networks. Volume 9. No. 1. Pages 205-212. 1998., see Sageev Oore demonstrate Glove Talk (43 Mb)
|
|
Pavarobotti |
 |
the singing robot: articulatory synthesis |
|
HAL |
 |
From the film "2001: a space odyssey", the ideal
(NOT a synthesizer but an actor!) |
content |
sad/threatening |
regretful |
 |
 |
 |
|
First song in
synthetic speech |
 |
"Bicycle Built for Two", Bell Laboratories, by Louis
Gerstman and Max Mathews, 1961. This song was reprised by the `decorticated' Hal. |
 |
ESPER |
 |
Extracting Speaker Information From Children's
Stories for Speech Synthesis: non-uniform unit selection |
 |
Feelix
from Lola
Canamero and Jacob
Fredslund |
 |
Emotional music by Rasmus Lunding for a lego robot displaying
emotions by mimic. |
 |
Luong Hieu-Thi et al |
 |
University of Science, Ho Chi Minh City, Vietnam,
2016.
Experiment to construct a DNN based synthesizer, that inputs phonetic linguistic features and outputs acoustic
features usable for a STRAIGHT synthesizer, from limited data of about 100 speakers while adding speakerid, age
and gender information to the input and then testing several configurations with respect to speaker adaption,
morphing and age and gender conversion by interchanging the relevant features and keeping the others fixed. See
the paper the paper "Adapting and Controlling DNN-based Speech Synthesis using Input Codes" for details.
Synthesized samples are from the 112 dimension Discriminant Condition configuration. |
synthesized with model |
original/new speaker |
original/gender change |
original/younger |
original/older |
|
/
|
/
|
/
|
/
|
|
name |
description |
timeframe |
partners |
Semaine |
The Semaine project is an EU-FP7 1st call STREP project and aims to build a SAL, a Sensitive Artificial
Listener, a multimodal dialogue system which can:
interact with humans with a virtual character
sustain an interaction with a user for some time
react appropriately to the user's non-verbal behaviour
In the end, this SAL-system will be released to a large extent as an open source research tool to the community.
|
2008-2011 |
DFKI, CNRS, Imperial, Queen's Belfast, Uni. Twente, TU Munich |
Speech and Emotion |
We study the effects of emotional state on speech, as well as the effects of emotional speech on listeners.
This is achieved by recording skin conductance, goose bumps, blood pressure, and other peripheral measures of
emotional state as well as vocal parameters. The outcome of these analyses is not only useful for basic research
in emotion psychology, but also for clinical research and forensic studies, in the areas of developmental and
pedagogical psychology, as well as in industrial and organizational psychology. |
10/2007-7/2008 |
Univ. Kiel, LMU Munich |
HUMAINE |
HUMAINE (Human-Machine
Interaction Network on Emotion) is a Network of Excellence in the EU's
Sixth Framework Programme, in the IST (Information Society
Technologies) Thematic Priority. HUMAINE aims to lay the
foundations for European development of systems that can register,
model and/or influence human emotional and emotion-related states
and processes - 'emotion-oriented systems'. |
2004-2008 |
Queen's University, Belfast, Deutsches
Forschungszentrum für Künstliche Intelligenz GmbH,
Institute of Communication and Computer Systems - National
Technical University of Athens, Université de Genève,
FPSE, University of Hertfordshire, Istituto Trentino Di Cultura,
Université de Paris VIII, Österreichische
Studiengesellschaft für Kybernetik, Kungliga Tekniska
Högskolan, Stockholm, Universität Augsburg,
Università Degli Studi di Bari, Ecole Polytechnique
Fédérale de Lausanne,
Friedrich-Alexander-Universität , Università Degli Studi
di Genova, University of Haifa, Imperial College of Science,
Technology and Medicine, Inesc Id - Instituto de Engenharia de
Sistemas e Computadores: Investigação e Desenvolvimento
em Lisboa, King's College, London, Centre National De La Recherche
Scientifique, University of Oxford, University of Salford, Tel Aviv
University, Trinity College, La Cantoche Production, France
Télécom SA, T-Systems Nova GmbH,, Instituto Superior
Técnico, Lisbon,
University of Southern California,
University of Zagreb,
University of Twente,
University of Sheffield,
University of Leeds,
Institute of Cognitive Sciences and Technologies, Rome.
|
Ermis |
Emotionally Rich Man-machine Intelligent
System, EU-project. Scope: The development of a prototype
system for human computer interaction than can interpret its users'
attitude or emotional state, e.g., activation/interest,boredom, and
anger, in terms of their speech and/or their facial gestures and
expressions Adopted technologies: Linguistic and paralinguistic
speech analysis and robust speech recognition, facial expression
analysis, interpretation of the user's emotional state using
hybrid, neurofuzzy, techniques, while being in accordance with the
MPEG-4 standard. |
2002-2005 |
ALTEC S.A., ILSP, ICCS-NTUA, EYETRONICS EYE, KUL
Belgium, Queens University Belfast, KCL UK, MIT Germany, FRANCE
TELECOM, BRITISH TELECOM |
JST/CREST
ESP |
(Japan Science and Technology) / (Core Research
for Evoluational Science and Technology) Expressive Speech
Processing. The goal of the five-year ESP Project is to produce
a corpus of natural daily speech in order to design speech
technology applications that are sensitive to the various ways in
which people use changes in speaking style and voice quality to
signal the intentions underlying each utterance, i.e., to add
information to spoken utterances beyond that carried by the text or
the words in the speech alone. The corpus is to include emotional
speech, but also samples to illustrate attitudinal aspects of
speech, such as politeness, hesitation, friendliness, anger, and
social-distance. The most obvious applications of the resulting
technology will be in speech synthesis, but the research also
involves speech-recognition technology for the labeling and
annotation of the speech databases, and the development of a
grammar of spoken language in order to take into account
supra-linguistic (i.e., paralinguistic and extralinguistic)
information. |
2000-2005 |
ATR, NAIST Graduate Institute, and Kobe
University |
PF-Star |
Preparing future multisensorial interaction
research, EU-project. Scope: PF-STAR intends to contribute to
establish future activities in the field of Multisensorial and
Multilingual communication (Interface Technologies) on firmer bases
by providing technological baselines, comparative evaluations, and
assessment of prospects of core technologies, which future research
and development efforts can build from. To this end, the project
will address three crucial areas: technologies for speech-to-speech
translation, the detection and expressions of emotional states, and
core speech technologies for children. For each of them, promising
technologies/approaches will be selected, further developed and
aligned towards common baselines. The results will be assessed and
evaluated with respect to both their performances and future
prospects. |
2002-2004 |
ITC - irst, Italy, RWTH Computer Science
Department, Germany, Institute for Pattern Recognition of
Friedrich-Alexander-Universität Erlangen-Nürnberg,
Germany, Interactive Systems Laboratories at Universitaet
Karlsruhe, Germany, Kungl Tekniska Högskolan, Sweden,
Department of Electronic, Electrical & Computing Engineering of
the University of Birmingham, United Kingdom, Istituto di Scienze e
Tecnologie della Cognizione, Sezione di Padova - "Fonetica e
Dialettologia", Italy |
Neca |
European Project: NECA promotes the concept of
multi-modal communication with animated synthetic personalities. A
particular focus in the project lies on communication between
animated characters that exhibit credible personality traits and
affective behavior. The key challenge of the project is the
fruitful combination of different research strands including
situation-based generation of natural language and speech,
semiotics of non-verbal expression in situated social
communication, and the modelling of emotions and personality. |
2001-2003 |
OFAI - Austrian Research Institute for Artificial
Intelligence, DFKI German Research Center for Artificial
Intelligence,FREESERVE UK internet portal, Information Technology
Research Institute at the University of Brighton, Sysis Interactive
Simulations AG ,Institute of Phonetics at the University of the
Saarland |
Interface |
European Project to make man/machine interaction
more natural. The objective of the project is to define new models
and implement advanced tools for audio-video analysis, synthesis
and representation in order to provide essential technologies for
the implementation of large-scale virtual and augmented
environments. |
2000-2002 |
DIST - University of Genoa, L&H Belgium, Image
Coding Group - Linköping University Sweden, Universitat
Politecnica de Catalunya, Ecole Polytechnique Fédérale de
Lausanne, Université de Genève, FPSE, Informatics and
Telematics Institute Greece, Tecnologia Automazione Uomo,
University of Maribor Slovenia, Curtin University of Technology
Australia, Umea University Sweden, Centre National de la Recherche
Scientifique France, W Interactive SA Fra |
Mimic |
Voice, Accent and Emotion Adaptive Text to
Speech Synthesis. The objective of this project is to develop a
speaker-adaptive text to speech synthesis with applications to high
quality automatic voice dialogue, personalised voice for the
disabled, broadcast studio voice processing, interpreted telephony,
very low bit rate phonetic speech coding, and multimedia
communication. |
2000 ? |
Brunel University, London |
TU-Berlin |
DFG-Projekt: Phonetische Reduktion und
Elaboration bei emotionaler Sprechweise. recording of German
emotional database
and analysis. |
1998-2000 |
TU-Berlin |
EMOVOX |
Voice variability related to speaker-emotional
state in Automatic Speaker Verification. Extending the Verivox
approach. this project aims to systematically explore the effects
of transient speaker state changes on the acoustic speech signal,
by plotting the space of speaker state-dependent variation in
speech. Based upon the knowledge gained by analyzing the speech
recorded from speakers in a number of induced cognitive and
emotional states, new methods of structured training in ASV systems
will be developed. To collect emotional speech data for the Emovox
project, we will create an interactive computer program designed to
induce various target emotions and stress in speakers. |
1999-2000 ? |
Université de Genève, FPSE |
SUSAS |
Speech Under Simulated and Actual
Stress.This database was put together by Duke University and
Air Force Research Laboratory for researchers who are interested in
the characteristics and effects of speech under stress on speech
processing recognizers. SUSAS was created at the Robust Speech
Processing Laboratory in the Department of Electrical and Computer
Engineering at Duke University. |
1997 ? |
Univ. of Colorado at Boulder, CSLR, RSPL |
VeriVox |
Voice Variability in Speaker Verification.
The main aim of VeriVox is to improve the reliability of automatic
speaker verification (ASV), by developing novel,
phonetically-informed methods for coping with the variation in a
speaker's voice. |
1996 ? |
Queen Margaret College, Edinburgh, IKP, Rheinische
Friedrich-Wilhelms-Universität Bonn, Faculté de
Psychologie et des Sciences de l'Education, University of Geneva,
University of Cambridge, Trinity College, Dublin, University of
Cambridge, Laboratoire Parole et Langage, CNRS Université de
Provence, Ensigma,CNRS, Délégation Régionale
Normandie |
Univ. of
Reading |
the Emotion in Speech Project. Of the many
types of suprasegmental and affective information that have been
found to occur in speech, relatively few have been coded in such a
way as to permit inclusion of them in large-scale machine-readable
speech databases. However, as the demand for more natural and
unconstrained speech material grows, it becomes increasingly
necessary to look at ways of doing this. This project brought
together expertise in phonetics and phonology and in cognitive
psychology in order to examine emotional speech and to produce a
database of such speech to put alongside the emotionally neutral
material found in most spoken language databases. |
1995-2000 ? |
University of Reading, Department of Psychology at
the University of Leeds |
VAESS |
Voices, Attitudes and Emotions in Speech
Synthesis. The aim of the VAESS project is to develop a fully
portable (hand-held) communicator with versatile, high quality
speech output. By combining the latest advances in speech
technology withstate-of-the-art hardware, the capabilities of
current speech prostheses will be extended. A deliberate choice was
made to base the communicator around a standard personal computer.
The potential uses of the portable communicator are then increased
to encompass those of any equivalent computer; there are
significant benefits from this in the workplace where standard
computer applications are needed. The VAESS Project is funded by
the European Union Technology Initiative for the Disabled and
Elderley Programme (TIDE). |
1995 ? |
Sheffield University, Center for
PersonKommunikation at the University of Aalborg. Department of
Speech and Music Acoustics at KTH in Stockholm, Telia Promoter
Infovox AB in Stockholm, BiDesign Ltd in Tamworth., Barnsley
District General Hospital NHS Trust |
VOX
- 6298 |
The Analysis and Synthesis of Speaker
Characteristics. The VOX Working Group is investigating speech
databases with different types of speakers, different affective
conditions of emotion and attitude, and different casual versus
careful styles of speaking: each considered with reference to
acoustic, perceptual and physiological representation. Speech
synthesis can be used to empirically test such
characterisations. |
1992-1995 |
Université de Genève,
IKP-Universität Bonn, CNRS-Institut de Phonetique, CNRS -
LIMSI, Trinity College Dublin, KTH, University of Cambridge,
University of Reading, University of Sheffield |