Expressive Synthetic Speech

emotional faces (pictures taken from Paul Ekman )

last update: March 23rd 2023

This is a collection of examples of synthetic affective speech conveying an emotion or natural expression and maintained by Felix Burkhardt. Some of these samples are direct copies from natural data, others are generated by expert-rules or derived from data-bases. The emotional labels "anger", "fear", "joy" and "sad" are my (short) designators for "the big four" basic emotions, not neccessarily the authors' ones.
Examples of German actors simulating emotional arousal can be found here.
I recently held a talk on emotional speech synthesis.
Examples of German text-to-speech synthesizers can be found here.
Please, feel encouraged to let me know about own or missing attempts to simulate emotional speech! (fxburk@gmail.com)

author	visual	affil.	year (approx)	description	neutral	anger	joy	sad	fear
N. Audibert, V. Aubergé, A. Rilliard		ICP	2006	Copy prosody and intensity from satisfied and sad speech to neutral speech using PSOLA technique. See the article "The Prosodic Dimensions of Emotion in Speech: the Relative Weights of Parameters", Interspeech 2005 (Lisbon), for details		-			-
Stephan Baldes		DFKI	1999	Rule based emotion simulation with Entropic's formant TTS engine TrueTalk, based on Cahn's affect editor approach	-		-
Roberto Barra Chicote		Polytechnic Univ. of Madrid	2010	HMM based emotional speech synthesis, i.e. used emotional prosodic and source models on HMM coded speech data. See the article "Roberto Barra-Chicote, Junichi Yamagishi, Simon King, Juan Manuel Montero, Javier Macias-Guarasa, Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech, Speech Communication, Volume 52, Issue 5,2010".for details.
Murtaza Bulut, Carlos Busso, Serdar Yildirim, Abe Kazemzadeh, Chul Min Lee, Sungbok Lee, Shrikanth Narayanan		USC/Sail	2005	Emotional voice-conversion by changing prosody (TD-PSOLA) and spectrum (LPC modification). The samples demonstrate neutral to target-emotion conversion. See the article "Investigating the role of phoneme-level modifications in emotional speech resynthesis", Proc Interspeech 2005 for details					-
Murtaza Bulut, Shri Narayanan, Ann Syrdal		USC/Sail, AT&T	2002	Diphone synthesis done with hand-crafted diphones and copy-prosody for the appropriate emotion. See the article "Expressive Speech Synthesis Using a Concatenative Synthesizer", Proc. ICSLP 2002, for details					-
Felix Burkhardt		Dautsche Telekom Laboratories	2005	emofilt: rule based simulation (prosody only) with MBROLA. German male voice (de6) (neutral prosody txt2pho)
				English male voice (en1)
				Spanish male voice (es2)
				Mandarin Chinese female voice (cn1)
				Frensh male voice (fr1)
				Greek male voice (gr2)
				Dutch male voice (nl2)
				Hungarian male voice (hu1)
				Italian male voice (it3)
				Turkish male voice (tr1)
			1998	emofilt old version
			2000	emoSyn 1: rule based simulation with formant-synthesizer (Sensyn Version), the neutral sentence is copy-synthesis, more examples here
			2000	emoSyn 2: copy synthesis with formant-synthesizer (Iles & Simmons Version)	orig.
			2000		synth.
			1998	esps2mbrola: prosody-copy synthesis with MBROLA	orig.
			1998	esps2mbrola: prosody-copy synthesis with MBROLA	synth.
Joao P. Cabral		L2F INESC-ID Lisboa	2005	Transformation of neutral speech using LP-PSOLA. Changes pitch, duration and energy as well as voicequality (by transforming the residual). Emotion-rules were derived from literature. See the article "Pitch-Synchronous Time-Scaling for Prosodic and Voice Quality Transformations", Proc. Interspeech 2005 for details. male voice
Joao P. Cabral		L2F INESC-ID Lisboa	2005	female voice
Janet Cahn		MIT	1989 ?	Affect Editor: rule based emotion simulation with DecTalk	-
Piero Cosi, Fabio Tesser, Roberto Gretter, Carlo Drioli, Graziano Tisato		ISTC-SPFD	2004	Emotive Mbrola: Italian concatenation synthesis with the Festival speech synthesis framework and MBROLA voices. The prosody was learned from emotional database (CART). An article appeared at the Interspeech 2005. male voice
				female voice
				with manipulation of voice quality	male voice
				with manipulation of voice quality	female voice
Coqui.ai		Coqui	2023	Ana Florence (one of many voices): Deep learning based latent space synthesis.
Björn Granström, Rolf Carlson ?		KTH	1998 ?	KTH Royal Institute of Technology (orig. link broken): swedish copy formant-synthesis.					-
author	visual	affil.	year (approx)	description	neutral	anger	joy	sad	fear
Akemi Iida		ATR	2000 ?	Chatr Emotion: japanese concatenation synthesis using emotional databases with CHATR. See the article "A Speech Synthesis System with Emotion for Assisting Communication", Proc. ISCA Workshop on Speech and Emotion, Belfast, 2000. Since than Chatr has been expanded to NATR	-				-
Akemi Iida		ATR	2000 ?	new CHATR with Emotion	-				-
David from IRCAM		IRCAM	2015	Not a speech synthesizer, but re-synthesis to "emotionalize" existing speech by pitch-shifting, filtering, adding vibrato and inflections, see the article "DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in running speech" , Behaviour Research Methods (2017) for details		-
Gregor Hofer		Univ. Edinburgh	2004	Gregor Hofer's master's thesis, unit selection database recorded in neutral as well as happy and angry style. More samples.				-	-
Gregor Hofer		Univ. Edinburgh	2004		half-emotional by mixing units from neutral and emotional data			-	-
Ignasi Iriondo, Francesc Alías, Javier Melenchón, M. Angeles Llorca		Univ. Ramon Lull	2003	Catalan diphone synthesis with emotion rules, see article "Modeling and Synthesizing Emotional Speech for Catalan Text-to-Speech Synthesis" (Proc. ADS 2004) for details	-
Syaheerah L. Lutfi		Polytechnic Univ. of Madrid	2006	Template/rule-based synthesis of prosody parameters using Mbrola synthesis, see the article "Syaheerah L. Lutfi, Raja N. Ainon and Zuraidah M. Don. eXpressive Text Reader Automation Layer (eXTRA): Template driven ‘Emotion Layer’ in Malay Concatenated Synthesized Speech, Proc. of International Conference on Speech Databases and Assessment (Oriental COCOSDA 06), Penang, Malaysia, Dec 2006" for details.			-	-	-
Cynthia Breazeal		MIT	2000	Kismet: rule based emotion simulation with DecTalk (like Affect Editor)
Cynthia Breazeal		MIT	2000	with words
Keisuke Miyanaga, Makoto Tachibana, Junichi Yamagishi, Koji Onishi, Takashi Masuko, Takao Kobayashi		Tokyo Institute of technology, Kobayashi Lab.	2004	HMM (data-based) modeling of emotional expression (spectral and prosodic), enables mixing: see article HMM-Based Speech Synthesis with Various Speaking Styles Using Model Interpolation or further demos: each emotion individually modeled					-
		Tokyo Institute of technology, Kobayashi Lab.	2004	emotion as contextual factor (like phonetic/linguistic factors)					-
Juan M. Montero Martínez		Univ. Madrid	1998 ?	montero 1: rule based emotion simulation with spanish diphone-synthesizer					-
Juan M. Montero Martínez		Univ. Madrid	1998 ?	montero 2: rule based emotion simulation with KTH-formant-synthesizer					-
Shinya Mori, Tsuyoshi Moriyama, Shinji Ozawa		Dept. of Media and Image Technology, Tokyo Polytechnic University	2008	PSOLA transformation based on data-base trained prosody modification rules. The algorithm includes possibility of graded expressions. See the article "Emotional Speech Synthesis Using Subspace Constraints in Prosody", Proc. ICME 2006, or follow this link	half				-
Shinya Mori, Tsuyoshi Moriyama, Shinji Ozawa			2008		full				-
Iain Murray		Univ. Dundee	1989 ?	HAMLET: rule-based simulated emotions with formant speech synthesis
			1997	Laureate: hand optimized concatenation tts
			1997	LAERTES: rule-based simulation using laureate
Pierre-yves Oudeyer		Sony	2000	cartoon speech: nonsense speech for sony pet-robots based on concatenative synthesis. Article: Oudeyer P-Y. "The Synthesis of Cartoon Emotional Speech", Proc. of the 1st International Conference on Prosody, Aix-en-Provence, eds. B. Bel; I. Marlien. 2002	low intensity				-
					high intensity				-
					with japanese words				-
?		IBM Tokyo Research Laboratory	1996	ProTalker, from IBM tokyo research laboratory.				-	-
S. R. M. Prasanna and D. Govind		Indian Institute of Technology Guwahati	2010	Modeling pitch and excitation (voice source and EGG estimated with the zero frequency filtering approach) with linear prediction using the EMO-DB. See the article "Analysis of Excitation Source Information in Emotional Speech" from Interspeech 2010, Makuhari, for details.	-			-
Erhard Rank/Hannes Pirker		ÖFAI	1998	VieCtoS: demos of the demisyllable LPC-synthesizer VieCtoS of the Austrian Research Institute for Artificial Intelligence (ÖFAI) copying emotional speech
Marc Schröder		DFKI	2002	While working on the NECA-Project and his ph.d. thesis Marc developed a system capable of producing emotional speech based on a description from emotional dimensions (arousal, valence, potency). I tried to map his results to basic emotions. The system is based on MBROLA as DSP and MARY as NLP. In order to control voice-quality, six databases for MBROLA were developed: For male and female each normal, tense voice and lax voice.	-
Ingmar Steiner and Marc Schröder, and Marcela Charfuelan		Mary DFKI Pavoque project	2011	Non-uniform unit selection with a database annotated with expressive styles, see the article "Steiner, Ingmar, Marc Schröder, Marcela Charfuelan (2010) "Symbolic vs. acoustics-based style control for expressive unit selection" 7th ISCA Workshop on Speech Synthesis, Kyoto, Japan", for details.					-
Jun Sato		?	1998 ?	Japanese demo of emotional synthetic speech generated by art. neural nets. Article: J. Sato and S. Morishima, "Emotion modeling in speech production using emotion space", in Proc. IEEE Int. Workshop on Robot and Human Communication, Tsukuba, Japan, Nov. 1996					-
Oytun Türk and Marc Schröder		DFKI	2008	GMM based voice conversion, i.e. the spectral envelope of neutral speech gets transfered to the target emotional speech in order to simulate emotional voice quality without having to record a whole emotional database. See the article "A Comparison of Voice Conversion Methods for Transforming Voice Quality in Emotional Speech Synthesis", Proc. Interspeech 2008, Brisbane.	orig.				-
Oytun Türk and Marc Schröder		DFKI	2008		synth. (GMM method)				-
E. Zovato, A. Pacchiotti, S. Quazza, S. Sandri		Loquendo	2004	From Loquendo. Rule-based prosody PSOLA-like manipulation of non-uniform unit-selection engine, see Towards emotional speech synthesis: a rule based approach. Presented at the 5th ISCA Workshop on Speech Synthesis in Pittsburgh 2004. other examples from Loqendo					-
Kun Zhou		National University of Singapore	2023	Deep learning approach, actually featuring mixed emotions, see the paper Speech Synthesis with Mixed Emotions	-				-
Yuexin Cao		Ping An Technology	2020	Deep learning approach, combining VAEs with GANS, the actual emotion labels used for the conversion is supervised. See the paper Nonparallel Emotional Speech Conversion Using VAE-GAN					-

name	description	timeframe	partners
Semaine	The Semaine project is an EU-FP7 1st call STREP project and aims to build a SAL, a Sensitive Artificial Listener, a multimodal dialogue system which can: interact with humans with a virtual character sustain an interaction with a user for some time react appropriately to the user's non-verbal behaviour In the end, this SAL-system will be released to a large extent as an open source research tool to the community.	2008-2011	DFKI, CNRS, Imperial, Queen's Belfast, Uni. Twente, TU Munich
Speech and Emotion	We study the effects of emotional state on speech, as well as the effects of emotional speech on listeners. This is achieved by recording skin conductance, goose bumps, blood pressure, and other peripheral measures of emotional state as well as vocal parameters. The outcome of these analyses is not only useful for basic research in emotion psychology, but also for clinical research and forensic studies, in the areas of developmental and pedagogical psychology, as well as in industrial and organizational psychology.	10/2007-7/2008	Univ. Kiel, LMU Munich
HUMAINE	HUMAINE (Human-Machine Interaction Network on Emotion) is a Network of Excellence in the EU's Sixth Framework Programme, in the IST (Information Society Technologies) Thematic Priority. HUMAINE aims to lay the foundations for European development of systems that can register, model and/or influence human emotional and emotion-related states and processes - 'emotion-oriented systems'.	2004-2008	Queen's University, Belfast, Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Institute of Communication and Computer Systems - National Technical University of Athens, Université de Genève, FPSE, University of Hertfordshire, Istituto Trentino Di Cultura, Université de Paris VIII, Österreichische Studiengesellschaft für Kybernetik, Kungliga Tekniska Högskolan, Stockholm, Universität Augsburg, Università Degli Studi di Bari, Ecole Polytechnique Fédérale de Lausanne, Friedrich-Alexander-Universität , Università Degli Studi di Genova, University of Haifa, Imperial College of Science, Technology and Medicine, Inesc Id - Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento em Lisboa, King's College, London, Centre National De La Recherche Scientifique, University of Oxford, University of Salford, Tel Aviv University, Trinity College, La Cantoche Production, France Télécom SA, T-Systems Nova GmbH,, Instituto Superior Técnico, Lisbon, University of Southern California, University of Zagreb, University of Twente, University of Sheffield, University of Leeds, Institute of Cognitive Sciences and Technologies, Rome.
Ermis	Emotionally Rich Man-machine Intelligent System, EU-project. Scope: The development of a prototype system for human computer interaction than can interpret its users' attitude or emotional state, e.g., activation/interest,boredom, and anger, in terms of their speech and/or their facial gestures and expressions Adopted technologies: Linguistic and paralinguistic speech analysis and robust speech recognition, facial expression analysis, interpretation of the user's emotional state using hybrid, neurofuzzy, techniques, while being in accordance with the MPEG-4 standard.	2002-2005	ALTEC S.A., ILSP, ICCS-NTUA, EYETRONICS EYE, KUL Belgium, Queens University Belfast, KCL UK, MIT Germany, FRANCE TELECOM, BRITISH TELECOM
JST/CREST ESP	(Japan Science and Technology) / (Core Research for Evoluational Science and Technology) Expressive Speech Processing. The goal of the five-year ESP Project is to produce a corpus of natural daily speech in order to design speech technology applications that are sensitive to the various ways in which people use changes in speaking style and voice quality to signal the intentions underlying each utterance, i.e., to add information to spoken utterances beyond that carried by the text or the words in the speech alone. The corpus is to include emotional speech, but also samples to illustrate attitudinal aspects of speech, such as politeness, hesitation, friendliness, anger, and social-distance. The most obvious applications of the resulting technology will be in speech synthesis, but the research also involves speech-recognition technology for the labeling and annotation of the speech databases, and the development of a grammar of spoken language in order to take into account supra-linguistic (i.e., paralinguistic and extralinguistic) information.	2000-2005	ATR, NAIST Graduate Institute, and Kobe University
PF-Star	Preparing future multisensorial interaction research, EU-project. Scope: PF-STAR intends to contribute to establish future activities in the field of Multisensorial and Multilingual communication (Interface Technologies) on firmer bases by providing technological baselines, comparative evaluations, and assessment of prospects of core technologies, which future research and development efforts can build from. To this end, the project will address three crucial areas: technologies for speech-to-speech translation, the detection and expressions of emotional states, and core speech technologies for children. For each of them, promising technologies/approaches will be selected, further developed and aligned towards common baselines. The results will be assessed and evaluated with respect to both their performances and future prospects.	2002-2004	ITC - irst, Italy, RWTH Computer Science Department, Germany, Institute for Pattern Recognition of Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany, Interactive Systems Laboratories at Universitaet Karlsruhe, Germany, Kungl Tekniska Högskolan, Sweden, Department of Electronic, Electrical & Computing Engineering of the University of Birmingham, United Kingdom, Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova - "Fonetica e Dialettologia", Italy
Neca	European Project: NECA promotes the concept of multi-modal communication with animated synthetic personalities. A particular focus in the project lies on communication between animated characters that exhibit credible personality traits and affective behavior. The key challenge of the project is the fruitful combination of different research strands including situation-based generation of natural language and speech, semiotics of non-verbal expression in situated social communication, and the modelling of emotions and personality.	2001-2003	OFAI - Austrian Research Institute for Artificial Intelligence, DFKI German Research Center for Artificial Intelligence,FREESERVE UK internet portal, Information Technology Research Institute at the University of Brighton, Sysis Interactive Simulations AG ,Institute of Phonetics at the University of the Saarland
Interface	European Project to make man/machine interaction more natural. The objective of the project is to define new models and implement advanced tools for audio-video analysis, synthesis and representation in order to provide essential technologies for the implementation of large-scale virtual and augmented environments.	2000-2002	DIST - University of Genoa, L&H Belgium, Image Coding Group - Linköping University Sweden, Universitat Politecnica de Catalunya, Ecole Polytechnique Fédérale de Lausanne, Université de Genève, FPSE, Informatics and Telematics Institute Greece, Tecnologia Automazione Uomo, University of Maribor Slovenia, Curtin University of Technology Australia, Umea University Sweden, Centre National de la Recherche Scientifique France, W Interactive SA Fra
Mimic	Voice, Accent and Emotion Adaptive Text to Speech Synthesis. The objective of this project is to develop a speaker-adaptive text to speech synthesis with applications to high quality automatic voice dialogue, personalised voice for the disabled, broadcast studio voice processing, interpreted telephony, very low bit rate phonetic speech coding, and multimedia communication.	2000 ?	Brunel University, London
TU-Berlin	DFG-Projekt: Phonetische Reduktion und Elaboration bei emotionaler Sprechweise. recording of German emotional database and analysis.	1998-2000	TU-Berlin
EMOVOX	Voice variability related to speaker-emotional state in Automatic Speaker Verification. Extending the Verivox approach. this project aims to systematically explore the effects of transient speaker state changes on the acoustic speech signal, by plotting the space of speaker state-dependent variation in speech. Based upon the knowledge gained by analyzing the speech recorded from speakers in a number of induced cognitive and emotional states, new methods of structured training in ASV systems will be developed. To collect emotional speech data for the Emovox project, we will create an interactive computer program designed to induce various target emotions and stress in speakers.	1999-2000 ?	Université de Genève, FPSE
SUSAS	Speech Under Simulated and Actual Stress.This database was put together by Duke University and Air Force Research Laboratory for researchers who are interested in the characteristics and effects of speech under stress on speech processing recognizers. SUSAS was created at the Robust Speech Processing Laboratory in the Department of Electrical and Computer Engineering at Duke University.	1997 ?	Univ. of Colorado at Boulder, CSLR, RSPL
VeriVox	Voice Variability in Speaker Verification. The main aim of VeriVox is to improve the reliability of automatic speaker verification (ASV), by developing novel, phonetically-informed methods for coping with the variation in a speaker's voice.	1996 ?	Queen Margaret College, Edinburgh, IKP, Rheinische Friedrich-Wilhelms-Universität Bonn, Faculté de Psychologie et des Sciences de l'Education, University of Geneva, University of Cambridge, Trinity College, Dublin, University of Cambridge, Laboratoire Parole et Langage, CNRS Université de Provence, Ensigma,CNRS, Délégation Régionale Normandie
Univ. of Reading	the Emotion in Speech Project. Of the many types of suprasegmental and affective information that have been found to occur in speech, relatively few have been coded in such a way as to permit inclusion of them in large-scale machine-readable speech databases. However, as the demand for more natural and unconstrained speech material grows, it becomes increasingly necessary to look at ways of doing this. This project brought together expertise in phonetics and phonology and in cognitive psychology in order to examine emotional speech and to produce a database of such speech to put alongside the emotionally neutral material found in most spoken language databases.	1995-2000 ?	University of Reading, Department of Psychology at the University of Leeds
VAESS	Voices, Attitudes and Emotions in Speech Synthesis. The aim of the VAESS project is to develop a fully portable (hand-held) communicator with versatile, high quality speech output. By combining the latest advances in speech technology withstate-of-the-art hardware, the capabilities of current speech prostheses will be extended. A deliberate choice was made to base the communicator around a standard personal computer. The potential uses of the portable communicator are then increased to encompass those of any equivalent computer; there are significant benefits from this in the workplace where standard computer applications are needed. The VAESS Project is funded by the European Union Technology Initiative for the Disabled and Elderley Programme (TIDE).	1995 ?	Sheffield University, Center for PersonKommunikation at the University of Aalborg. Department of Speech and Music Acoustics at KTH in Stockholm, Telia Promoter Infovox AB in Stockholm, BiDesign Ltd in Tamworth., Barnsley District General Hospital NHS Trust
VOX - 6298	The Analysis and Synthesis of Speaker Characteristics. The VOX Working Group is investigating speech databases with different types of speakers, different affective conditions of emotion and attitude, and different casual versus careful styles of speaking: each considered with reference to acoustic, perceptual and physiological representation. Speech synthesis can be used to empirically test such characterisations.	1992-1995	Université de Genève, IKP-Universität Bonn, CNRS-Institut de Phonetique, CNRS - LIMSI, Trinity College Dublin, KTH, University of Cambridge, University of Reading, University of Sheffield

Expressive Synthetic Speech

contents

examples not simulating the "big four"

related examples

missing samples

links connected to emotional speech-synthesis

projects that deal with affective speech

Changelog