site navigation


Expressive Synthetic Speech

emotional faces (pictures taken from Paul Ekman )

last update: March 23rd 2023

This is a collection of examples of synthetic affective speech conveying an emotion or natural expression and maintained by Felix Burkhardt. Some of these samples are direct copies from natural data, others are generated by expert-rules or derived from data-bases. The emotional labels "anger", "fear", "joy" and "sad" are my (short) designators for "the big four" basic emotions, not neccessarily the authors' ones.
Examples of German actors simulating emotional arousal can be found here.
I recently held a talk on emotional speech synthesis.
Examples of German text-to-speech synthesizers can be found here.
Please, feel encouraged to let me know about own or missing attempts to simulate emotional speech! (fxburk@gmail.com)

contents

All Audiofiles are Mp3-format (128 or 64 kB/s)

author visual affil. year (approx) description neutral anger joy sad fear
N. Audibert, V. Aubergé, A. Rilliard ICP logo ICP 2006 Copy prosody and intensity from satisfied and sad speech to neutral speech using PSOLA technique. See the article "The Prosodic Dimensions of Emotion in Speech: the Relative Weights of Parameters", Interspeech 2005 (Lisbon), for details mp3 - mp3 mp3 -
Stephan Baldes DFKI logo DFKI 1999 Rule based emotion simulation with Entropic's formant TTS engine TrueTalk, based on Cahn's affect editor approach - mp3 - mp3 mp3
Roberto Barra Chicote gth logo Polytechnic Univ. of Madrid 2010 HMM based emotional speech synthesis, i.e. used emotional prosodic and source models on HMM coded speech data. See the article "Roberto Barra-Chicote, Junichi Yamagishi, Simon King, Juan Manuel Montero, Javier Macias-Guarasa, Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech, Speech Communication, Volume 52, Issue 5,2010".for details. mp3 mp3 mp3 mp3 mp3
Murtaza Bulut, Carlos Busso, Serdar Yildirim, Abe Kazemzadeh, Chul Min Lee, Sungbok Lee, Shrikanth Narayanan emotion logo by M. Bulut USC/Sail 2005 Emotional voice-conversion by changing prosody (TD-PSOLA) and spectrum (LPC modification). The samples demonstrate neutral to target-emotion conversion. See the article "Investigating the role of phoneme-level modifications in emotional speech resynthesis", Proc Interspeech 2005 for details mp3 mp3 mp3 mp3 -
Murtaza Bulut, Shri Narayanan, Ann Syrdal emotion logo by M. Bulut USC/Sail, AT&T 2002 Diphone synthesis done with hand-crafted diphones and copy-prosody for the appropriate emotion. See the article "Expressive Speech Synthesis Using a Concatenative Synthesizer", Proc. ICSLP 2002, for details mp3 mp3 mp3 mp3 -
Felix Burkhardt emofilt logo Dautsche Telekom Laboratories 2005 emofilt: rule based simulation (prosody only) with MBROLA.
German male voice (de6) (neutral prosody txt2pho)
mp3 mp3 mp3 mp3 mp3
English male voice (en1) mp3 mp3 mp3 mp3 mp3
Spanish male voice (es2) mp3 mp3 mp3 mp3 mp3
Mandarin Chinese female voice (cn1) mp3 mp3 mp3 mp3 mp3
Frensh male voice (fr1) mp3 mp3 mp3 mp3 mp3
Greek male voice (gr2) mp3 mp3 mp3 mp3 mp3
Dutch male voice (nl2) mp3 mp3 mp3 mp3 mp3
Hungarian male voice (hu1) mp3 mp3 mp3 mp3 mp3
Italian male voice (it3) mp3 mp3 mp3 mp3 mp3
Turkish male voice (tr1) mp3 mp3 mp3 mp3 mp3
1998 emofilt old version mp3 mp3 mp3 mp3 mp3
2000 emoSyn 1: rule based simulation with formant-synthesizer (Sensyn Version), the neutral sentence is copy-synthesis, more examples here mp3 mp3 mp3 mp3 mp3
2000 emoSyn 2: copy synthesis with formant-synthesizer (Iles & Simmons Version) orig. mp3 mp3 mp3 mp3
synth. mp3 mp3 mp3 mp3
1998 esps2mbrola: prosody-copy synthesis with MBROLA orig. mp3 mp3 mp3 mp3
synth. mp3 mp3 mp3 mp3
Joao P. Cabral INESC-ID logo L2F INESC-ID Lisboa 2005 Transformation of neutral speech using LP-PSOLA. Changes pitch, duration and energy as well as voicequality (by transforming the residual). Emotion-rules were derived from literature. See the article "Pitch-Synchronous Time-Scaling for Prosodic and Voice Quality Transformations", Proc. Interspeech 2005 for details.
male voice
mp3 mp3 mp3 mp3 mp3
female voice mp3 mp3 mp3 mp3 mp3
Janet Cahn Affect Editor logo MIT 1989 ? Affect Editor: rule based emotion simulation with DecTalk - mp3 mp3 mp3 mp3
Piero Cosi, Fabio Tesser, Roberto Gretter, Carlo Drioli, Graziano Tisato ISTC logo ISTC-SPFD 2004 Emotive Mbrola: Italian concatenation synthesis with the Festival speech synthesis framework and MBROLA voices. The prosody was learned from emotional database (CART). An article appeared at the Interspeech 2005.
male voice
mp3 mp3 mp3 mp3 mp3
female voice mp3 mp3 mp3 mp3 mp3
with manipulation of voice quality male voice mp3 mp3 mp3 mp3
female voice mp3 mp3 mp3 mp3
Coqui.ai cartoon pictures of emotions from KTH Coqui 2023 Ana Florence (one of many voices): Deep learning based latent space synthesis. mp3 mp3 mp3 mp3
Björn Granström, Rolf Carlson ? cartoon pictures of emotions from KTH KTH 1998 ? KTH Royal Institute of Technology (orig. link broken): swedish copy formant-synthesis. mp3 mp3 mp3 mp3 -
author visual affil. year (approx) description neutral anger joy sad fear
Akemi Iida Chatr logo ATR 2000 ? Chatr Emotion: japanese concatenation synthesis using emotional databases with CHATR. See the article "A Speech Synthesis System with Emotion for Assisting Communication", Proc. ISCA Workshop on Speech and Emotion, Belfast, 2000. Since than Chatr has been expanded to NATR - mp3 mp3 mp3 -
new CHATR with Emotion - mp3 mp3 mp3 -
David from IRCAM logo IRCAM 2015 Not a speech synthesizer, but re-synthesis to "emotionalize" existing speech by pitch-shifting, filtering, adding vibrato and inflections, see the article "DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in running speech" , Behaviour Research Methods (2017) for details mp3 - mp3 mp3 mp3
Gregor Hofer Univ. Edinburgh school of informatics logo Univ. Edinburgh 2004 Gregor Hofer's master's thesis, unit selection database recorded in neutral as well as happy and angry style. More samples. mp3 mp3 mp3 - -
half-emotional by mixing units from neutral and emotional data mp3 mp3 - -
Ignasi Iriondo, Francesc Alías, Javier Melenchón, M. Angeles Llorca La Salle logo Univ. Ramon Lull 2003 Catalan diphone synthesis with emotion rules, see article "Modeling and Synthesizing Emotional Speech for Catalan Text-to-Speech Synthesis" (Proc. ADS 2004) for details - mp3 mp3 mp3 mp3
Syaheerah L. Lutfi logo Polytechnic Univ. of Madrid 2006 Template/rule-based synthesis of prosody parameters using Mbrola synthesis, see the article "Syaheerah L. Lutfi, Raja N. Ainon and Zuraidah M. Don. eXpressive Text Reader Automation Layer (eXTRA): Template driven ‘Emotion Layer’ in Malay Concatenated Synthesized Speech, Proc. of International Conference on Speech Databases and Assessment (Oriental COCOSDA 06), Penang, Malaysia, Dec 2006" for details. mp3 mp3 - - -
Cynthia Breazeal picture of Kismet MIT 2000 Kismet: rule based emotion simulation with DecTalk (like Affect Editor) mp3 mp3 mp3 mp3 mp3
with words mp3 mp3 mp3 mp3 mp3
Keisuke Miyanaga, Makoto Tachibana, Junichi Yamagishi, Koji Onishi, Takashi Masuko, Takao Kobayashi kobayashi lab logo Tokyo Institute of technology, Kobayashi Lab. 2004 HMM (data-based) modeling of emotional expression (spectral and prosodic), enables mixing: see article HMM-Based Speech Synthesis with Various Speaking Styles Using Model Interpolation or further demos: each emotion individually modeled mp3 mp3 mp3 mp3 -
emotion as contextual factor (like phonetic/linguistic factors) mp3 mp3 mp3 mp3 -
Juan M. Montero Martínez speechgroup uni madrid logo Univ. Madrid 1998 ? montero 1: rule based emotion simulation with spanish diphone-synthesizer mp3 mp3 mp3 mp3 -
montero 2: rule based emotion simulation with KTH-formant-synthesizer mp3 mp3 mp3 mp3 -
Shinya Mori, Tsuyoshi Moriyama, Shinji Ozawa t-kougei logo Dept. of Media and Image Technology, Tokyo Polytechnic University 2008 PSOLA transformation based on data-base trained prosody modification rules. The algorithm includes possibility of graded expressions. See the article "Emotional Speech Synthesis Using Subspace Constraints in Prosody", Proc. ICME 2006, or follow this link half mp3 mp3 mp3 -
full mp3 mp3 mp3 -
Iain Murray HAMLET logo Univ. Dundee 1989 ? HAMLET: rule-based simulated emotions with formant speech synthesis mp3 mp3 mp3 mp3 mp3
1997 Laureate: hand optimized concatenation tts mp3 mp3 mp3 mp3 mp3
1997 LAERTES: rule-based simulation using laureate mp3 mp3 mp3 mp3 mp3
Pierre-yves Oudeyer sony cartoon image Sony 2000 cartoon speech: nonsense speech for sony pet-robots based on concatenative synthesis. Article: Oudeyer P-Y. "The Synthesis of Cartoon Emotional Speech", Proc. of the 1st International Conference on Prosody, Aix-en-Provence, eds. B. Bel; I. Marlien. 2002 low intensity mp3 mp3 mp3 -
high intensity mp3 mp3 mp3 -
with japanese words mp3 mp3 mp3 -
? protalker robot image IBM Tokyo Research Laboratory 1996 ProTalker, from IBM tokyo research laboratory. mp3 mp3 mp3 - -
S. R. M. Prasanna and D. Govind Öfai logo Indian Institute of Technology Guwahati 2010 Modeling pitch and excitation (voice source and EGG estimated with the zero frequency filtering approach) with linear prediction using the EMO-DB. See the article "Analysis of Excitation Source Information in Emotional Speech" from Interspeech 2010, Makuhari, for details. - mp3 mp3 - mp3
Erhard Rank/Hannes Pirker Öfai logo ÖFAI 1998 VieCtoS: demos of the demisyllable LPC-synthesizer VieCtoS of the Austrian Research Institute for Artificial Intelligence (ÖFAI) copying emotional speech mp3 mp3 mp3 mp3 mp3
Marc Schröder MARY logo DFKI 2002 While working on the NECA-Project and his ph.d. thesis Marc developed a system capable of producing emotional speech based on a description from emotional dimensions (arousal, valence, potency). I tried to map his results to basic emotions. The system is based on MBROLA as DSP and MARY as NLP. In order to control voice-quality, six databases for MBROLA were developed: For male and female each normal, tense voice and lax voice. - mp3 mp3 mp3 mp3
Ingmar Steiner and Marc Schröder, and Marcela Charfuelan MARY logo Mary DFKI Pavoque project 2011 Non-uniform unit selection with a database annotated with expressive styles, see the article "Steiner, Ingmar, Marc Schröder, Marcela Charfuelan (2010) "Symbolic vs. acoustics-based style control for expressive unit selection" 7th ISCA Workshop on Speech Synthesis, Kyoto, Japan", for details. mp3 mp3 mp3 mp3 -
Jun Sato ? 1998 ? Japanese demo of emotional synthetic speech generated by art. neural nets. Article: J. Sato and S. Morishima, "Emotion modeling in speech production using emotion space", in Proc. IEEE Int. Workshop on Robot and Human Communication, Tsukuba, Japan, Nov. 1996 mp3 mp3 mp3 mp3 -
Oytun Türk and Marc Schröder dfki logo DFKI 2008 GMM based voice conversion, i.e. the spectral envelope of neutral speech gets transfered to the target emotional speech in order to simulate emotional voice quality without having to record a whole emotional database. See the article "A Comparison of Voice Conversion Methods for Transforming Voice Quality in Emotional Speech Synthesis", Proc. Interspeech 2008, Brisbane. orig. mp3 mp3 mp3 mp3 -
synth. (GMM method) mp3 mp3 mp3 -
E. Zovato, A. Pacchiotti, S. Quazza, S. Sandri Loquendo logo Loquendo 2004 From Loquendo. Rule-based prosody PSOLA-like manipulation of non-uniform unit-selection engine, see Towards emotional speech synthesis: a rule based approach. Presented at the 5th ISCA Workshop on Speech Synthesis in Pittsburgh 2004. other examples from Loqendo mp3 mp3 mp3 mp3 -
Kun Zhou logo missing National University of Singapore 2023 Deep learning approach, actually featuring mixed emotions, see the paper Speech Synthesis with Mixed Emotions - mp3 mp3 mp3 -
Yuexin Cao logo missing Ping An Technology 2020 Deep learning approach, combining VAEs with GANS, the actual emotion labels used for the conversion is supervised. See the paper Nonparallel Emotional Speech Conversion Using VAE-GAN mp3 mp3 mp3 mp3 -
Speech Synthesis with Mixed Emotions

examples not simulating the "big four"

author visual affil. year (approx) description samples
Acapela Acapela Acapela 2013 Non uniform unit selection synthesis with different emotional and style-varying voices.
UK English Voice
Elizabeth Peter
neutral
Peter
happy
Peter
sad
mp3 mp3 mp3 mp3
2013 Non uniform unit selection synthesis with different emotional and style-varying voices.
French Voice
Antoine
neutral
Antoine
happy
Antoine
sad
Antoine
from afar
Antoine
up close
mp3 mp3 mp3 mp3 mp3
2013 Non uniform unit selection synthesis with different emotional and style-varying voices.
US English Voice
Will
neutral
Will
happy
Will
sad
Will
bad guy
Will
from afar
Will
up close
mp3 mp3 mp3 mp3 mp3 mp3
Baird, Amiriparian and Schuller logo University of Augsburg 2020 WaveNet (deep learning) based approach to generate new emotional speech trained on the Italian DEMoS corpus to enrich the training samples, please see the article for detailes
Simulation of joy Simulation of sadness
mp3 mp3
Greg Beller Greg Beller IRCAM 2005 Transform real voices thanks to a content-based transformation with a phase vocoder algorithm. Time-stretch and transpose coefficients change over the utterance depending on the expressivity and the context of units (whether they're part of an accentuated syllable or whether they're consonants, for instance...).
original sad
transformation
bored
transformation
frightened
transformation
mp3 mp3 mp3 mp3
2009 Synthesis of laughter based on a spring model and the repetition of specific phones. Examples found here
synthesized laughter
mp3
? Cepstral logo Cepstral 2004 Non-uniform unit selection. Databases recorded with a certain style.
Damian: dark personality Duchess: sensitive voice Shouty: non-sensitive voice
mp3 mp3 mp3
? Eloquent logo ETI Eloquence (as Eloquent belongs to Scansoft now, the link is broken) 1998 Rule-based formant-syntresis. The emotional expression was hand-optimized (demonstration from Eloquent)
expressive male expressive female
mp3 mp3
Ellen Eide et al IBM logo IBM Watson research Center 2004 Non-uniform unit-selection trained with an expressive prosody model, paralinguistic events and expressive units. Described in the article A Corpus-Based Approach to <AHEM/> Expressive Speech Synthesis
good news bad news question other for comparison: IBM-CTTS taken from website. Note that research-engine is advanced technology compared to product.
mp3 mp3 mp3 mp3 mp3
Chengyi Wang et al. Microsoft Vall-e 2023 Neural speech synthesis that takes a text and a short (3 sec) speech sample as input and generates an audio with the textual content in the style and voice of the speech sample
Angry Sleepy Neutral Amused Disgusted
human: mp3 mp3 mp3 mp3 mp3
tts: mp3 mp3 mp3 mp3 mp3
IBM Watson IBM logo IBM Watson Developer Cloud 2016 Watson voice Allison: Non-uniform unit-selection in three styles: Good News, Apology, and Uncertain
apology uncertain good news
mp3 mp3 mp3
Eva Szekely et al UCD logo University College Dublin 2011 HMM synthesis based on expressive Speech extracted automatically from audio textbook readings by clustering Glottal parameters. Described in the article Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters
soft style tense style expressive style
mp3 mp3 mp3
Enrico Zovato et al Loquendo logo Loquendo 2004 Non-uniform unit-selection enriched by paralinguistic events and expressive units. Other examples from Loquendo.
German (Kathrin)/(Stefan) French (Juliette) English (Simon)
mp3 / mp3 mp3 mp3
F. Malfrere MBROLIGN logo MBROLIGN from TCTS Lab, Mons 1999 data-based prosody synthesis with MBROLA (the decision-tree algorithm was trained on a database spoken with the appropriate affect)
neutral nervous astonished shy
mp3 mp3 mp3 mp3
Google Deepmind WaveNet logo Google Deepmind WaveNet 2016 Synthesis by PCM value prediction from deep neural net. See the paper "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO", by Oord et. al for details. Samples for models trained on three different speakers. Sample for speech without given text, computed by probability from previous samples.
Speaker 1 Speaker 2 Speaker 3 No text given
mp3 mp3 mp3 mp3
Kaiser and Schallner logo Nuremburg Institute for Market decisions 2020 The system uses a Tacotron (deep learning) based architecture utilizing global Style Tokens (GST) and text-prediction for style embeddings (TPSE) for an emotional speech synthesizer to investigate speech assistants in German, see the paper for a description.
Excited Happy Uninvolved
mp3 mp3 mp3
Ivo-Software logo Ivona 2011 Non-uniform Selection Synthesis, Character voice
Chipmunk
mp3
John Stallo Metaface face Metaface from Curtin University 2000 Implementation of Virtual Human Markup Language, which includes emotional expression. TTS based on Festival and John Stallo's work on "Simulating Emotional Speech for a Talking Head" which implements prosody rules.
happy-cry happy-go-lucky so-afraid
mp3 mp3 mp3
Jürgen Trouvain, Sarah Schmidt, Marc Schröder, William Barry logo Personality modeling from Saarland University 2006 Modeling of different personality types with the DFKI Mary system. Modification was done with respect to pitch range and level, speech tempo and voice loudness. See the article "Modelling personality features by changing prosody in synthetic speech", Proc. Speech Prosody 2006, Dresden, for details.
neutral competent excited rugged sincere sophisticated
mp3 mp3 mp3 mp3 mp3 mp3
Meta (facebook) logo dGSLM 2022 Generating Chit chat trained form human conversations with (textless) transformer models.
dialog generative spoken language modeling
mp3
? modeltalker logo ModelTalker from the University of Delaware 1998 biphone synthesis (biphone-inventory searches for best diphones at synthesis time). The emotional expression was generated by prosody-rules
neutral happy surprised frustrated sad contradictive assertive
mp3 mp3 mp3 mp3 mp3 mp3 mp3
Nick Campbell. further samples NATR logo NATR from ATR 2004 Non-uniform unit-selection working on a very large database (recording a woman for 3 years) including affective labeling and extralinguistic sounds. See the article "Extra-Semantic Protocols; Input Requirements for the Synthesis of Dialogue Speech" (Proc. ADS 2004) for details.
sample of a telefone conversation between NATR (female voice) and a young man (originally talking to his friend).
mp3
? image of eduardo Rhetorical, now owned by Scansoft. Eduardo 2002 non-uniform unit-selection synthesis done by Rhetorical, character for a Jim Beam marketing campaign, manually optimized
in conversation with rhetorical's voice "american valley girl"
mp3
Marc Schröder DFKI logo DFKI 2002 more examples (see above)
excited scared sad angry 1 angry 2 angry 3 bored content 1 content 2 content 3 happy
mp3 mp3 mp3 mp3 mp3 mp3 mp3 mp3 mp3 mp3 mp3
MARY DFKI logo DFKI 2005 Limited domain unit-selection with emotional units.
standard EmotionML sample with Pavoque database from the Mary 5.1 version excited soccer reports
mp3 mp3
Mariet Theune human media interaction logo Human Media Interaction lab from Univ. Twente 2006 Extracting prosodic rules to enhance speaking style for story tellers and apply to TTS. See the article "Generating Expressive Speech for Storytelling Applications" by Mariet Theune, Koen Meijs, Dirk Heylen and Roeland Ordelman. IEEE Transactions on Audio, Speech and Language Processing 14(4)
no suspense sudden climax increasing climax
mp3 mp3 mp3
Olivier Rosec and Yannis Agiomyrgiannakis orange logo Voice transformation based on an ARX-LF vocoder from Orange Labs, Lannion 2009 Voice transformation based on parameter modification with a combination of LF-excitation and hamrmonic model. See the article: Yannis Agiomyrgiannakis, Olivier Rosec: "ARX-LF-based source-filter methods for voice modification and transformation", Proc. ICASSP 09
man original man whisper man breathy man tense woman original woman as man woman as child
mp3 mp3 mp3 mp3 mp3 mp3 mp3
Raul Fernandez and Bhuvana Ramabhadran, IBM TJ Watson research center ibm logo Emphatic Speech 2007 Classifying emphatic speech in a non-uniform unit selection database. See the article "Automatic Exploration of Corpus-Specific Properties for Expressive Text-to-Speech." by Raul Fernandez and Bhuvana Ramabhadran. Proc. 6th ISCA workshop on speech synthesis, Bonn, 2007
baseline neutral units with normal text baseline plus collected emphatic units with marked emphasis text mined emphasis corpus with marked emphasis text
mp3 mp3 mp3
Shiva Sundaram and Shrikant Narayanan sail logo SAIL 2007 Synthesis of laughter based on a mass-spring model and LPC synthesis. See the article Sundaram, s. and Narayanan, S., "Automatic acoustic synthesis of human-like laughter", Jasa No1, p. 527-535, 2007.
snythesized laughter
mp3
Speechconcept/CereProc logo Cerevoice 2009 Non-uniform unit selection synthesis with emotional expression, Courtesy of Speechconcept
Voice Alex Voice Gudrun Voice Nick Voice Saskia
mp3 mp3 mp3 mp3
Tacotron Google logo Tacotron with Global Style Tokens 2018 Deep artificial neural nets, described in the paper Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. In short: so called style tokens are learned from reference samples for a specific speaking style in an unsupervised manner and can then be used as embeddings on a neural net that transforms text to spectrograms. The reference samples or text don't need to be part of the original training and they can be applied to a different degree.
Style 1 Style 2 Style 3 Style 4 Style 5
mp3 mp3 mp3 mp3 mp3
Talkback talkingtechnologies logo Talkback 2000 (original link broken) 1998 concatenation synthesis, demonstration from Talking Technologies
expressive male expressive female
mp3 mp3
Tuomo Raitio and Antti Suni hut.fi logo Glott-HMM synthesis 2013 Statistical parametric speech synthesis, the shouted voice was created by adapting the statistical speech models. Part of an analysis-by-synthesis experiment described in the article "Analysis and Synthesis of Shouted Speech" by Tuomo Raitio et al, at Interspeech 2013.
female normal female shouting male normal male shouting
mp3 mp3 mp3 mp3
Voxygen logo Voxygen 2012 French non-uniform unit-selection, different expressive voices
Voice "Papi" reading an arbitrary news item. Dark Vador Yeti Witch
mp3 mp3 mp3 mp3
Yossef Ben-Ezra ? logo Vivotext 2012 Technology unknown, demonstration from webpage
happy child sad child
mp3 mp3

related examples

Lyrebird logo image Synthesis based on a deep artificial neural net adapted to new speakers voice characteristics (this page's author) from just about 100 short samples mp3
Preserving Word-level Emphasis in Speech-to-speech Translation logo image Preserving Word-level Emphasis in Speech-to-speech Translation example English to Japanese using Linear Regression HSMMs. As described here or see the Interspeech 2015 article Preserving Word-level Emphasis in Speech-to-speech Translation using Linear Regression HSMMs by Quoc Truong Do, Shinnosuke Takamichi, Sakriani Sakti, Graham Neubig, Tomoki Toda and Satoshi Nakamura
English native speaker Japanese translation without emphasis Japanese translation with emphasis
mp3 mp3 mp3
Accent cloning based on articulatory adaption logo image Copying accent from a native speaker to a learning non-native speaker by driving an (copysynthesis) articulatory synthesizer for the learner with gestures from the native speaker. As described here or see the Interspeech 2015 article Articulatory-based conversion of foreing accents with deep neural networks by Sandesh Aryal and Ricardo Gutierrez-Osuna
MFCC resinthesized native speaker MFCC resinthesized learning speaker DNN based accent conversion
mp3 mp3 mp3
Voice cloning MSR image A multilinugal speech synthesizer that learned one speaker's voice. From Microsoft research (Frank Soong) article
English (source) Italian Mandarin Spanish
mp3 mp3 mp3 mp3
Vocal Tract Lab image of vocal Tract Lab articulatory synthesizer singing dona nobis pacem mp3
Glove Talk image of glovetalk formant synthesis controlled by data glove, for further information see Sidney S. Fels and Geoffrey E. Hinton. Glove-TalkII: A neural network interface which maps gestures to parallel formant speech synthesizer controls. IEEE Transactions on Neural Networks. Volume 9. No. 1. Pages 205-212. 1998., see Sageev Oore demonstrate Glove Talk (43 Mb) mp3
Pavarobotti image of pavarobotti the singing robot: articulatory synthesis mp3
HAL HAL image From the film "2001: a space odyssey", the ideal (NOT a synthesizer but an actor!)
content sad/threatening regretful
mp3 mp3 mp3
First song in synthetic speech image of pavarobotti "Bicycle Built for Two", Bell Laboratories, by Louis Gerstman and Max Mathews, 1961. This song was reprised by the `decorticated' Hal. mp3
ESPER image of pavarobotti Extracting Speaker Information From Children's Stories for Speech Synthesis: non-uniform unit selection mp3
Feelix from Lola Canamero and Jacob Fredslund image of Feelix Emotional music by Rasmus Lunding for a lego robot displaying emotions by mimic. mp3
Luong Hieu-Thi et al University of sciences Vietnam logo University of Science, Ho Chi Minh City, Vietnam, 2016.
Experiment to construct a DNN based synthesizer, that inputs phonetic linguistic features and outputs acoustic features usable for a STRAIGHT synthesizer, from limited data of about 100 speakers while adding speakerid, age and gender information to the input and then testing several configurations with respect to speaker adaption, morphing and age and gender conversion by interchanging the relevant features and keeping the others fixed. See the paper the paper "Adapting and Controlling DNN-based Speech Synthesis using Input Codes" for details. Synthesized samples are from the 112 dimension Discriminant Condition configuration.
synthesized with model original/new speaker original/gender change original/younger original/older
mp3 mp3/ mp3 mp3/ mp3 mp3/ mp3 mp3/ mp3

missing samples

links connected to emotional speech-synthesis


projects that deal with affective speech

name description timeframe partners
Semaine The Semaine project is an EU-FP7 1st call STREP project and aims to build a SAL, a Sensitive Artificial Listener, a multimodal dialogue system which can: interact with humans with a virtual character sustain an interaction with a user for some time react appropriately to the user's non-verbal behaviour In the end, this SAL-system will be released to a large extent as an open source research tool to the community. 2008-2011 DFKI, CNRS, Imperial, Queen's Belfast, Uni. Twente, TU Munich
Speech and Emotion We study the effects of emotional state on speech, as well as the effects of emotional speech on listeners. This is achieved by recording skin conductance, goose bumps, blood pressure, and other peripheral measures of emotional state as well as vocal parameters. The outcome of these analyses is not only useful for basic research in emotion psychology, but also for clinical research and forensic studies, in the areas of developmental and pedagogical psychology, as well as in industrial and organizational psychology. 10/2007-7/2008 Univ. Kiel, LMU Munich
HUMAINE HUMAINE (Human-Machine Interaction Network on Emotion) is a Network of Excellence in the EU's Sixth Framework Programme, in the IST (Information Society Technologies) Thematic Priority. HUMAINE aims to lay the foundations for European development of systems that can register, model and/or influence human emotional and emotion-related states and processes - 'emotion-oriented systems'. 2004-2008 Queen's University, Belfast, Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Institute of Communication and Computer Systems - National Technical University of Athens, Université de Genève, FPSE, University of Hertfordshire, Istituto Trentino Di Cultura, Université de Paris VIII, Österreichische Studiengesellschaft für Kybernetik, Kungliga Tekniska Högskolan, Stockholm, Universität Augsburg, Università Degli Studi di Bari, Ecole Polytechnique Fédérale de Lausanne, Friedrich-Alexander-Universität , Università Degli Studi di Genova, University of Haifa, Imperial College of Science, Technology and Medicine, Inesc Id - Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento em Lisboa, King's College, London, Centre National De La Recherche Scientifique, University of Oxford, University of Salford, Tel Aviv University, Trinity College, La Cantoche Production, France Télécom SA, T-Systems Nova GmbH,, Instituto Superior Técnico, Lisbon, University of Southern California, University of Zagreb, University of Twente, University of Sheffield, University of Leeds, Institute of Cognitive Sciences and Technologies, Rome.
Ermis Emotionally Rich Man-machine Intelligent System, EU-project. Scope: The development of a prototype system for human computer interaction than can interpret its users' attitude or emotional state, e.g., activation/interest,boredom, and anger, in terms of their speech and/or their facial gestures and expressions Adopted technologies: Linguistic and paralinguistic speech analysis and robust speech recognition, facial expression analysis, interpretation of the user's emotional state using hybrid, neurofuzzy, techniques, while being in accordance with the MPEG-4 standard. 2002-2005 ALTEC S.A., ILSP, ICCS-NTUA, EYETRONICS EYE, KUL Belgium, Queens University Belfast, KCL UK, MIT Germany, FRANCE TELECOM, BRITISH TELECOM
JST/CREST ESP (Japan Science and Technology) / (Core Research for Evoluational Science and Technology) Expressive Speech Processing. The goal of the five-year ESP Project is to produce a corpus of natural daily speech in order to design speech technology applications that are sensitive to the various ways in which people use changes in speaking style and voice quality to signal the intentions underlying each utterance, i.e., to add information to spoken utterances beyond that carried by the text or the words in the speech alone. The corpus is to include emotional speech, but also samples to illustrate attitudinal aspects of speech, such as politeness, hesitation, friendliness, anger, and social-distance. The most obvious applications of the resulting technology will be in speech synthesis, but the research also involves speech-recognition technology for the labeling and annotation of the speech databases, and the development of a grammar of spoken language in order to take into account supra-linguistic (i.e., paralinguistic and extralinguistic) information. 2000-2005 ATR, NAIST Graduate Institute, and Kobe University
PF-Star Preparing future multisensorial interaction research, EU-project. Scope: PF-STAR intends to contribute to establish future activities in the field of Multisensorial and Multilingual communication (Interface Technologies) on firmer bases by providing technological baselines, comparative evaluations, and assessment of prospects of core technologies, which future research and development efforts can build from. To this end, the project will address three crucial areas: technologies for speech-to-speech translation, the detection and expressions of emotional states, and core speech technologies for children. For each of them, promising technologies/approaches will be selected, further developed and aligned towards common baselines. The results will be assessed and evaluated with respect to both their performances and future prospects. 2002-2004 ITC - irst, Italy, RWTH Computer Science Department, Germany, Institute for Pattern Recognition of Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany, Interactive Systems Laboratories at Universitaet Karlsruhe, Germany, Kungl Tekniska Högskolan, Sweden, Department of Electronic, Electrical & Computing Engineering of the University of Birmingham, United Kingdom, Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova - "Fonetica e Dialettologia", Italy
Neca European Project: NECA promotes the concept of multi-modal communication with animated synthetic personalities. A particular focus in the project lies on communication between animated characters that exhibit credible personality traits and affective behavior. The key challenge of the project is the fruitful combination of different research strands including situation-based generation of natural language and speech, semiotics of non-verbal expression in situated social communication, and the modelling of emotions and personality. 2001-2003 OFAI - Austrian Research Institute for Artificial Intelligence, DFKI German Research Center for Artificial Intelligence,FREESERVE UK internet portal, Information Technology Research Institute at the University of Brighton, Sysis Interactive Simulations AG ,Institute of Phonetics at the University of the Saarland
Interface European Project to make man/machine interaction more natural. The objective of the project is to define new models and implement advanced tools for audio-video analysis, synthesis and representation in order to provide essential technologies for the implementation of large-scale virtual and augmented environments. 2000-2002 DIST - University of Genoa, L&H Belgium, Image Coding Group - Linköping University Sweden, Universitat Politecnica de Catalunya, Ecole Polytechnique Fédérale de Lausanne, Université de Genève, FPSE, Informatics and Telematics Institute Greece, Tecnologia Automazione Uomo, University of Maribor Slovenia, Curtin University of Technology Australia, Umea University Sweden, Centre National de la Recherche Scientifique France, W Interactive SA Fra
Mimic Voice, Accent and Emotion Adaptive Text to Speech Synthesis. The objective of this project is to develop a speaker-adaptive text to speech synthesis with applications to high quality automatic voice dialogue, personalised voice for the disabled, broadcast studio voice processing, interpreted telephony, very low bit rate phonetic speech coding, and multimedia communication. 2000 ? Brunel University, London
TU-Berlin DFG-Projekt: Phonetische Reduktion und Elaboration bei emotionaler Sprechweise. recording of German emotional database and analysis. 1998-2000 TU-Berlin
EMOVOX Voice variability related to speaker-emotional state in Automatic Speaker Verification. Extending the Verivox approach. this project aims to systematically explore the effects of transient speaker state changes on the acoustic speech signal, by plotting the space of speaker state-dependent variation in speech. Based upon the knowledge gained by analyzing the speech recorded from speakers in a number of induced cognitive and emotional states, new methods of structured training in ASV systems will be developed. To collect emotional speech data for the Emovox project, we will create an interactive computer program designed to induce various target emotions and stress in speakers. 1999-2000 ? Université de Genève, FPSE
SUSAS Speech Under Simulated and Actual Stress.This database was put together by Duke University and Air Force Research Laboratory for researchers who are interested in the characteristics and effects of speech under stress on speech processing recognizers. SUSAS was created at the Robust Speech Processing Laboratory in the Department of Electrical and Computer Engineering at Duke University. 1997 ? Univ. of Colorado at Boulder, CSLR, RSPL
VeriVox Voice Variability in Speaker Verification. The main aim of VeriVox is to improve the reliability of automatic speaker verification (ASV), by developing novel, phonetically-informed methods for coping with the variation in a speaker's voice. 1996 ? Queen Margaret College, Edinburgh, IKP, Rheinische Friedrich-Wilhelms-Universität Bonn, Faculté de Psychologie et des Sciences de l'Education, University of Geneva, University of Cambridge, Trinity College, Dublin, University of Cambridge, Laboratoire Parole et Langage, CNRS Université de Provence, Ensigma,CNRS, Délégation Régionale Normandie
Univ. of Reading the Emotion in Speech Project. Of the many types of suprasegmental and affective information that have been found to occur in speech, relatively few have been coded in such a way as to permit inclusion of them in large-scale machine-readable speech databases. However, as the demand for more natural and unconstrained speech material grows, it becomes increasingly necessary to look at ways of doing this. This project brought together expertise in phonetics and phonology and in cognitive psychology in order to examine emotional speech and to produce a database of such speech to put alongside the emotionally neutral material found in most spoken language databases. 1995-2000 ? University of Reading, Department of Psychology at the University of Leeds
VAESS Voices, Attitudes and Emotions in Speech Synthesis. The aim of the VAESS project is to develop a fully portable (hand-held) communicator with versatile, high quality speech output. By combining the latest advances in speech technology withstate-of-the-art hardware, the capabilities of current speech prostheses will be extended. A deliberate choice was made to base the communicator around a standard personal computer. The potential uses of the portable communicator are then increased to encompass those of any equivalent computer; there are significant benefits from this in the workplace where standard computer applications are needed. The VAESS Project is funded by the European Union Technology Initiative for the Disabled and Elderley Programme (TIDE). 1995 ? Sheffield University, Center for PersonKommunikation at the University of Aalborg. Department of Speech and Music Acoustics at KTH in Stockholm, Telia Promoter Infovox AB in Stockholm, BiDesign Ltd in Tamworth., Barnsley District General Hospital NHS Trust
VOX - 6298 The Analysis and Synthesis of Speaker Characteristics. The VOX Working Group is investigating speech databases with different types of speakers, different affective conditions of emotion and attitude, and different casual versus careful styles of speaking: each considered with reference to acoustic, perceptual and physiological representation. Speech synthesis can be used to empirically test such characterisations. 1992-1995 Université de Genève, IKP-Universität Bonn, CNRS-Institut de Phonetique, CNRS - LIMSI, Trinity College Dublin, KTH, University of Cambridge, University of Reading, University of Sheffield

Changelog


Demos of TTS-Systems for German

comments: felixbur@gmx.de

band

Valid XHTML 1.0!