Asian Research Thesis Index

Abstract

In audio source separation, cocktail party problem is a typical example that segregates a particular signal while filtering out the others from audio mixture. This problem has been investigated for decades. In our daily life, a plethora of sources add their acoustic pattern to the environment. In this scenario, segregating human speech acoustics from the audio mixture is challenging. This challenge becomes harder in case of monaural (single microphone) setup that essentially eliminates the spatiality of the target source. Human listeners possess an incredible capability to segregate a specific sound source from a complex mixture, even single ear is enough for complex auditory scene. This process of separating and target specific sound source is called auditory scene analysis (ASA). Recently, ASA has received profound interest from many audio researchers. However, Emulating the same functionality on computer is imperative and challenging. Number of applications need an effective system in place that has near human-like ability to segregate auditory signals. There are many challenges for existing Computer Auditory Scene Analysis (CASA) systems yet to be handled in case of monaural speech segregation. This research work proposes a systematic in-depth effort in evolving a CASA framework for monaural speech segregation. In the first stage, peripheral analysis is done to model ASA inspired time-frequency representation called cochleagram. In the second stage system extracts ASA cues are extracted such as fundamental frequency (F0), spectral peaks, onsets and offsets. The Cochleagram is further mapped to eight discrete clusters. Theses clusters will become the foundation to produce morphed cochleagram versions that are further processed one by one with rough estimated fundamental frequency (F0) and spectral peaks to iteratively stabilize and improve pitch estimation. The system classifies speech and non-speech interference based on improved pitch estimation, spectral peaks and underlying ASA features ix e.g. harmonicity, onset/offset, and generates an ideal binary mask for target speech segregation. Finally, the target speech source is resynthesized using masked time-frequency units of Cochleagram. Systematic evaluation shows that the proposed system for voiced speech segregation produces better results and it can able to identify majority of time frequency (T-F) units from cochleagram for target speech separation. The proposed system produces significantly better results when compared with existing standardized voiced speech segregation techniques.

Add/Update Thesis