T06: Speech-based Interaction: Myths, Challenges, and Opportunities

Go to content | Go to Menu

HCI International 2016
Toronto, Canada, 17 - 22 July 2016
The Westin Harbour Castle Hotel

Navigation Menu

T06: Speech-based Interaction: Myths, Challenges, and Opportunities

Sunday, 17 July 2016, 14:00 - 17:30

Cosmin Munteanu (short bio)
Institute for Communication, Culture, Information, and Technology
University of Toronto Mississauga, Canada

Gerald Penn (short bio)
Department of Computer Science
University of Toronto, Canada

Objectives:

How Automatic Speech Recognition (ASR) and Speech Synthesis (or Text-To-Speech, aka TTS) systems work and why these are such computationally difficult problems;
Where speech recognition and text-to-speech synthesis are used in current commercial interactive applications;
What the usability issues surrounding speech-based interaction systems are, particularly in mobile and pervasive computing;
What the challenges in enabling speech as a modality for mobile interaction are;
What the current state-of-the-art in ASR and TTS research is;
What the differences are between the claimed accuracy of commercial speech recognition systems' and the needs of mobile interactive applications;
What the difficulties are in evaluating the quality of speech synthesis systems, particularly from a usability and user perspective;
What opportunities exist for HCI researchers in terms of enhancing systems' interactivity by enabling speech.

Content and Benefits:

HCI research has long been dedicated to better and more naturally facilitating information transfer between humans and machines. Unfortunately, our most natural form of communication, speech, is also one of the most difficult modalities to be understood by machines. Despite significant recent advances towards understanding speech, HCI has been relatively timid in embracing this modality as a central research focus - partly due to the relatively discouraging accuracy of speech understanding in some genres (exaggerated claims from the industry notwithstanding), but also due to the intrinsic difficulty of designing and evaluating speech and natural language interfaces. On the engineering side, improving speech technology with respect to largely arbitrary measures of performance has led to systems that deviate from user-centered design principles, and that fail to consider usability or usefulness.

The goal of this course is to inform the HCI community of the current state of speech and natural language research, to dispel some of the myths surrounding speech-based interaction, as well as to provide an opportunity for researchers and practitioners to learn more about how speech recognition and speech synthesis work, their limitations, and how they could be used to enhance current interaction paradigms.

Our approach is two-fold: present new concepts to the audience, and foster discussions and exchange of ideas. Slides are used to introduce the main points, while videos and audio clips are played to illustrate examples. After each main concept is presented, time is allocated for interaction with the audience.

Target Audience:

The course will be beneficial to all HCI researchers or practitioners without a strong expertise in speech recognition or text-to-speech synthesis, who still believe in fulfilling HCI's goal of developing methods and systems that allow humans to naturally interact with the ever-increasingly ubiquitous mobile technology, but are disappointed with the lack of success in using speech and natural language to achieve this goal.

No prior technical experience is required for the participants. The classroom activities will be conducted using the participants' smartphones (Android or iPhone), but the builtin phone functions will be used - no software download will be required. Participants will work in small groups, ensuring that even participants without smartphone are able to fully contribute.

Brief Biographical sketches

Gerald Penn is a Professor of Computer Science at the University of Toronto, specializing in mathematical linguistics and spoken language processing. His lab played a pivotal role in the invention of neural-network-based acoustic models, which are now standard in speech recognition systems, and specializes in human-subject interaction with speech-enabled devices. He is a senior member of both the IEEE and AAAI, and a recipient of the Ontario Early Researcher Award.

Cosmin Munteanu is an Assistant Professor at the Institute for Communication, Culture, Information, and Technology, University of Toronto at Mississauga), and Associate Director of the Technologies for Aging Gracefully lab. Until 2014 he was a Research Officer with the National Research Council of Canada. His area of expertise is at the intersection of Human-Computer Interaction, Automatic Speech Recognition, Natural Language Processing, Mobile Computing, and Assistive Technologies. He has extensively studied the human factors of using imperfect speech recognition systems, and has designed and evaluated systems that improve humans' access to and interaction with information-rich media and technologies through natural language. Cosmin's multidisciplinary interests include speech and natural language interaction for mobile devices, mixed reality systems, learning technologies for marginalized users, assistive technologies for older adults, and ethics in human-computer interaction research.

Top of Page

Last revision date: August 22, 2025 by web@hcii2016.org