Abstract

Large Scale Speech Recognition for Low Resource Language Amharic, an End-to-End Approach

Yohannes Ayana Ejigu* and Tesfa Tegegne Asfaw

Speech recognition, or Automatic Speech Recognition (ASR), is a technology designed to convert spoken language into text using software. However, conventional ASR methods involve several distinct components, including language, acoustic, and pronunciation models with dictionaries. This modular approach can be time-consuming and may influence performance. In this study, we propose a method that streamlines the speech recognition process by incorporating a unified Recurrent Neural Network (RNN) architecture. Our architecture integrates a Convolutional Neural Network (CNN) with an RNN and employs a Connectionist Temporal Classification (CTC) loss function.

Key experiments were carried out using a dataset comprising 576,656 valid sentences, using erosion techniques. Evaluation of the model performance, measured by the Word Error Rate (WER) metric, demonstrated remarkable results, achieving a WER of 2%. This approach has significant implications for the realm of speech recognition, as it alleviates the need for labor-intensive dictionary creation, enhancing the efficiency and accuracy of ASR systems, and making them more applicable to real-world scenarios.

For future enhancements, we recommend the inclusion of dialectal and spontaneous data in the dataset to broaden the model's adaptability. Additionally, fine-tuning the model for specific tasks can optimize its performance for targeted objectives or domains, further enhancing its effectiveness in those areas.

Published Date: 2024-03-21; Received Date: 2024-02-15