CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Abstract

In this study, we present a deep learning-based speech signal-processing mobile application, known as CITISEN, which can perform three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC). For SE, CITISEN can effectively reduce noise components from speech signals and accordingly enhance their clarity and intelligibility. When it encounters noisy utterances with unknown speakers or noise types, the MA function allows CITISEN to effectively improve the SE performance by adapting an SE model with a few audio files. Finally, for the BNC, CITISEN converts the current background noise to different background noise. The novel BNC function can evaluate SE performance under specific conditions, cover up people’s tracks, and provide entertainment. The experimental results confirmed the effectiveness of performing SE, MA, and BNC functions through objective evaluation and subjective listening tests. The promising results revealed that the developed CITISEN mobile application can be potentially used as a front-end processor for various speech-related services, such as voice communication, assistive hearing devices, and virtual reality headsets. In addition, CITISEN can be used as a platform for utilizing and evaluating state-of-the-art deep-learning-SE models, and can flexibly extend the models to address various noise environments and users. Finally, the proposed BNC function can be used not only in CITISEN, but also for data augmentation and to promote the development of an acoustic scene classification (ASC) model.

SE, BNC, and MA functions in CITISEN.

The background nosie conversion (BNC) function

BNC is a new topic in the field of speech processing. This idea is similar to the changing background of an image or video. With the BNC, users can artificially convert the background noise of their speech signal to another specified noise. The applications of the BNC function include: (1) evaluating SE performance under practical conditions when multiple speakers discuss online with a personal SE device. (2) covering up people’s tracks by converting the original environment noises to noises from other places when a positioning system is unavailable or not being used because of limited access to the technology or the lack of intention. (3) entertainment purposes, such as adding background music or sound effects.

In this study, we present human and machine evaluations of the BNC function. Specifically, human evaluation was performed by conducting a listening test, whereas the machine evaluation was performed using an ASC model. The listening test results indicated that the BNC function can convert the background noise while maintaining the clarity and intelligibility of the converted speech signal. The machine evaluation experiments show that the ASC embeddings of clean speech signals with a new noise are close to those of enhanced speech signals with the same converted noise. Therefore, the BNC function can serve as a data augmentation method for the ASC model in the condi- tion that clean speech signals are unavailable.

See Paper for more detail.