Asian Research Thesis Index

Abstract

Time-aligned and labeled speech at sub-word level is required to develop spoken language  technology components. Determining time boundaries of sub word units of speech and labelling  those, is the speech segmentation problem. Manual human-labeling is considered to be the most  accurate, which however requires significant amount of time when large amount of speech has to  be dealt. The evidences which humans employ are based on knowledge of acoustic-phonetics  and at very basic level works on spectrograms based techniques. Based on a hypothesis that  computers can also segment speech automatically if evidence which human experts utilizes are  used, leads us towards time effective automatic speech segmentation.  In this thesis unsupervised automatic time-alignment of speech at sub-word level is carried out  based on the pieces of information which spectrograms carry. The speech spectrogram  engineered in this thesis does not possess information of vocal excitations and capture dynamics  of vocal tract only. The novel feature is found suitable for segmentation problem and utilizes  both forward and inverse characteristics of vocal tract (FICV). Additionally to evaluate the  suitability of a feature extraction technique for speech segmentation task, a framework has also  been developed. In the thesis, speech segmentation is carried out on indigenously developed Classical Arabic  (CA) dataset and therefore becomes first scheme of its kind for CA which is an under resourced  language in speech technology. The performance of FICV based speech segmentation scheme is  compared and shown to be significantly better than standard unsupervised and supervised  techniques both in terms of error-rates and alignment accuracies. Reduction of 12.29% in error rates is achieved with FICV based feature when compared with standard unsupervised technique. Carrying out supervised segmentation requires a basic sub-word level recognizer, which labels  and aligns speech. In this connection a Hidden Markov Model (HMM) based speech recognizer  is trained. The acoustic modeling is carried using a discriminative technique which shows better  recognition accuracies of up to 4% than the non-discriminative technique. Thesis also verifies  that using manually-labeled data for training acoustic models can further improve recognition  accuracies by 3-4%. In this regard, thesis carries details of experimental steps which can also  serve as guideline for developing an automatic speech recognizer for CA.

Add/Update Thesis