Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children

Milling, Manuel and Baird, Alice and Bartl-Pokorny, Katrin D. and Liu, Shuo and Alcorn, Alyssa M. and Shen, Jie and Tavassoli, Teresa and Ainger, Eloise and Pellicano, Elizabeth and Pantic, Maja and Cummins, Nicholas and Schuller, Björn W. (2022) Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children. Frontiers in Computer Science, 4. ISSN 2624-9898

[thumbnail of pubmed-zip/versions/1/package-entries/fcomp-04-837269/fcomp-04-837269.pdf]

Text
pubmed-zip/versions/1/package-entries/fcomp-04-837269/fcomp-04-837269.pdf - Published Version
Download (897kB)

Official URL: https://doi.org/10.3389/fcomp.2022.837269

Abstract

Individuals with autism are known to face challenges with emotion regulation, and express their affective states in a variety of ways. With this in mind, an increasing amount of research on automatic affect recognition from speech and other modalities has recently been presented to assist and provide support, as well as to improve understanding of autistic individuals' behaviours. As well as the emotion expressed from the voice, for autistic children the dynamics of verbal speech can be inconsistent and vary greatly amongst individuals. The current contribution outlines a voice activity detection (VAD) system specifically adapted to autistic children's vocalisations. The presented VAD system is a recurrent neural network (RNN) with long short-term memory (LSTM) cells. It is trained on 130 acoustic Low-Level Descriptors (LLDs) extracted from more than 17 h of audio recordings, which were richly annotated by experts in terms of perceived emotion as well as occurrence and type of vocalisations. The data consist of 25 English-speaking autistic children undertaking a structured, partly robot-assisted emotion-training activity and was collected as part of the DE-ENIGMA project. The VAD system is further utilised as a preprocessing step for a continuous speech emotion recognition (SER) task aiming to minimise the effects of potential confounding information, such as noise, silence, or non-child vocalisation. Its impact on the SER performance is compared to the impact of other VAD systems, including a general VAD system trained from the same data set, an out-of-the-box Web Real-Time Communication (WebRTC) VAD system, as well as the expert annotations. Our experiments show that the child VAD system achieves a lower performance than our general VAD system, trained under identical conditions, as we obtain receiver operating characteristic area under the curve (ROC-AUC) metrics of 0.662 and 0.850, respectively. The SER results show varying performances across valence and arousal depending on the utilised VAD system with a maximum concordance correlation coefficient (CCC) of 0.263 and a minimum root mean square error (RMSE) of 0.107. Although the performance of the SER models is generally low, the child VAD system can lead to slightly improved results compared to other VAD systems and in particular the VAD-less baseline, supporting the hypothesised importance of child VAD systems in the discussed context.

Item Type:	Article
Subjects:	Universal Eprints > Computer Science
Depositing User:	Managing Editor
Date Deposited:	24 Jan 2023 04:53
Last Modified:	02 Jul 2024 12:28
URI:	http://journal.article2publish.com/id/eprint/734

Actions (login required)

: View Item