Research Article - (2024) Volume 13, Issue 2

Large Scale Speech Recognition for Low Resource Language Amharic, an End-to-End Approach
Yohannes Ayana Ejigu1* and Tesfa Tegegne Asfaw2
 
1Department of Artificial Intelligence and Data Science, Bahir Dar Institute of Technology, Bahir Dar University, Bahir Dar, Ethiopia
2Department of Computer Science, Bahir Dar Institute of Technology, Bahir Dar University, Bahir Dar, Ethiopia
 
*Correspondence: Yohannes Ayana Ejigu, Department of Artificial Intelligence and Data Science, Bahir Dar Institute of Technology, Bahir Dar University, Bahir Dar, Ethiopia, Email:

Received: 15-Feb-2024, Manuscript No. SIEC-24-24904; Editor assigned: 19-Feb-2024, Pre QC No. SIEC-24-24904 (PQ); Reviewed: 04-Mar-2024, QC No. SIEC-24-24904; Revised: 12-Mar-2024, Manuscript No. SIEC-24-24904 (R); Published: 21-Mar-2024, DOI: 10.35248/2090-4908.24.13.357

Abstract

Speech recognition, or Automatic Speech Recognition (ASR), is a technology designed to convert spoken language into text using software. However, conventional ASR methods involve several distinct components, including language, acoustic, and pronunciation models with dictionaries. This modular approach can be time-consuming and may influence performance. In this study, we propose a method that streamlines the speech recognition process by incorporating a unified Recurrent Neural Network (RNN) architecture. Our architecture integrates a Convolutional Neural Network (CNN) with an RNN and employs a Connectionist Temporal Classification (CTC) loss function.

Key experiments were carried out using a dataset comprising 576,656 valid sentences, using erosion techniques. Evaluation of the model performance, measured by the Word Error Rate (WER) metric, demonstrated remarkable results, achieving a WER of 2%. This approach has significant implications for the realm of speech recognition, as it alleviates the need for labor-intensive dictionary creation, enhancing the efficiency and accuracy of ASR systems, and making them more applicable to real-world scenarios.

For future enhancements, we recommend the inclusion of dialectal and spontaneous data in the dataset to broaden the model's adaptability. Additionally, fine-tuning the model for specific tasks can optimize its performance for targeted objectives or domains, further enhancing its effectiveness in those areas.

Keywords

Automatic speech recognition; Convolutional neural network; Connectionist temporal classification; Endto- end; Neural network; Erosion; Recurrent neural network

Introduction

Speech recognition, often known as Automatic Speech Recognition (ASR), computer speech recognition, or speech-totext, is a capability enabling software to transform spoken language into written text. Although it is occasionally mistaken for voice recognition, speech recognition specifically concentrates on converting speech from a spoken to a textual format, distinguishing it from voice recognition, which aims solely to identify the voice of a particular individual.

Different speech technology applications are being used by a wide range of industries today, which helps both businesses and consumers save time and even lives.

Various methods are used to create automatic speech recognition. These methods include Data Time Wrapping (DTW), Hidden Markov Model (HMM), Dynamic Bayesian Network (DBN), ANN and Deep Neural Network (DNN) [1]. HMM and NN are the most widely used methods in recent times for speech recognition. HMM works by breaking down speech signals into a sequence of states and then using the acoustic properties of the speech to determine the likelihood of a sequence of phones (Kebebew, 2010).

Although neural networks have significantly assisted automatic speech recognition, they now make up only one portion of a complicated pipeline [2]. Like in conventional computer vision, the first step of the pipeline is the extraction of input features. Common methods include vocal tract length normalization and Mel-scale filter banks (with or without a further transform into Cepstral coefficients) [3,4]. Then, emission probabilities for a hidden Markov model are reconstituted from the neural network output distributions. The neural networks are then trained to recognize specific audio input frames (HMM).

Consequently, the actual performance metric and the goal function used to train the networks are very different (sequencelevel transcription accuracy). This is the kind of consistency that end-to-end learning is designed to avoid. The fact that a considerable improvement in frame accuracy can lead to a little improvement or even a deterioration in transcription accuracy puzzles researchers. Another problem is that the frame-level training targets must be inferred from the alignments the HMM obtained. As a result, in an uncomfortable iterative process, network retraining and HMM realignments are alternated to provide targets that are more exact. Direct training of HMM neural network hybrids has been done using full-sequence training techniques such as maximum mutual information to increase the likelihood of accurate transcription [5]. However, these methods can only be used to retrain a system that has already been trained at the frame level, and they necessitate the careful adjustment of several hyper-parameters, often much more than for deep neural networks. The goals provided to the networks are often phonetic, despite the fact that the transcriptions used to train speech recognition systems are lexical. To translate words into phoneme sequences, a pronunciation dictionary is required. Such dictionaries require a lot of human effort to create, and often have a great impact on performance [6]. Another source of expert information, "state tying," is required to lessen the number of target classes since coarticulation effects are taken into account by multi-phone contextual models, which adds another layer of complexity.

Existing Amharic speech recognition systems are composed of several different components, including a feature extraction module, an acoustic model, a language model, and a decoder. These systems require significant amounts of data and manual feature engineering, which can be time-consuming and laborintensive. On the contrary, end-to-end speech recognition systems use a single neural network to map an input speech signal directly to an output transcript. These systems require less manual feature engineering and can be trained on raw speech signals, making them more efficient and effective.

Speech recognition has been a challenging problem in the field of artificial intelligence for decades, and traditional systems rely on complex pipelines of feature extraction, acoustic modeling, and language decoding. However, recent advances in deep learning have allowed for the development of end-to-end speech recognition models, that can directly transcribe speech to text without the need for intermediate steps.

The system proposed in this research replaces as much of the speech pipeline with a single Recurrent Neural Network (RNN) architecture. While it is possible to directly transcribe unprocessed speech waveforms using RNNs or features learned using limited Boltzmann machines, the computational cost is significant, and the performance typically lags behind conventional preparation. As a result, we have decided to use spectrograms as the minimum required preprocessing technique [7].

This research addresses important issues in addition to solving scientific issues. The difficulties that hearing-impaired people have in understanding other people's speeches makes it difficult for them to interact with non-hearing-impaired people, which prevents them from learning about their surroundings. Our work speeds up by typing directly from human voice, which is great for people who struggle with precise word placement. We therefore came up with the notion of creating an end-to-end speech recognition model that transforms speech to text to get around those difficulties and make life easier.

To the best of our knowledge, previous speech recognition models for Amharic language are built using traditional speech recognition mechanisms using acoustic, pronunciation, language models with a relatively smaller number of data, without considering a single pipeline and automatic feature extraction [8]. Even currently available end-to-end trials in the Amharic language are applied using language and acoustic models separately, and their feature extraction methods do not utilize neural networks. But we propose an end-to-end speech recognition mechanism, which enables us to directly convert speech to text by replacing those traditional pipelines with a single RNN pipeline. Therefore, unlike traditional HMM based speech recognition models, our speech recognition model will not have those individual pipelines, for example, pronunciation dictionaries are not needed in our case. So, our study saves the time spent for preparing those dictionaries and finding domain expertise on certain areas. End-to-end is a system which directly maps a sequence of input acoustic features into a sequence of graphemes or words. We are expecting that our end-to-end speech recognition model will greatly simplify the complexity of traditional speech recognition. With the advances in neural networks, the need for manual labeling of language and pronunciation information is significantly reduced, as the neural network can now autonomously learn and capture such information. According to the literature, there are two main structures for end-to-end speech recognition attention model and CTC. We have used CTC in our case and it has solved the alignment problem that occurs in traditional models [9-20].

Materials and Methods

We got 110 hours from Andreas Nürnberger-Data and Knowledge Engineering Group and 20 hours from the ALFFA project. We have used 62 hours and 30 minutes, clipped from VOA and DW radios, from a previous project of our own work. We augment these noise-free read speech audios using time stretching, pitch shifting, speed perturbation, time and pitch scaling, dynamic range compression, filtering, time shifting, and amplitude scaling to make the whole audio tally 1732 hours and 30 minutes. The audio data obtained from Andreas Nürnberger- Data and Knowledge Engineering Group includes transcriptions written in English characters. To make the transcriptions compatible with Amharic, we performed a transcription process, converting them into Amharic characters. Additionally, we utilized the transcriptions created previously for our personal project. This process involved carefully listening to the audio and converting it into written text. The text data serve as the ground truth for the deep learning model, enabling it to learn the relationship between the audio features and the corresponding text. We used a frame length of 256 samples and a frame step of 160 samples to extract audio features.

The data is converted to spectrograms using Short-Time Fourier Transform (STFT) for feature extraction. The resulting spectrograms are then used as input to the neural network. Overall, the research aims to contribute to the development of improved speech recognition and processing technologies.

Training and validation phases

The training phase is a critical step in developing our end-to-end speech recognition model. It involves training a neural network model, specifically a combination of Recurrent Neural Network (RNN) variants and Convolutional Neural Network (CNN), to recognize speech patterns and convert them into text. The training data, which consists of a mixture of noisy-free and noisy speech data, are used to adjust the model's parameters and improve its accuracy. The training process continues until the model achieves satisfactory accuracy in the training data [21-27].

Following the training phase, the validation phase is conducted to evaluate the system performance on unseen data. For this purpose, a separate validation data set, derived from the training data, is used. The goal of the validation phase is to monitor the system's performance and prevent overfitting, which refers to a situation where the model performs well on the training data but poorly on new, unseen data. The accuracy of the predicted transcription on the validation data are measured using the Word Error Rate (WER) metric. The model's hyper parameters, such as the learning rate, number of layers, and number of neurons, are adjusted during the validation phase to improve the accuracy of the validation data.

Building the deep learning model

To build the Amharic speech recognition model, a deep learning algorithm is applied to the collected and processed data. The model is based on a hybrid approach that combines a CNN with an RNN and utilizes a Connectionist Temporal Classification (CTC) loss function.

The model architecture consists of several key components. The input to the model is a representation of the audio data. The input is passed through a series of convolutional layers, which apply filters to extract relevant features from the spectrogram. Batch normalization and ReLU activation functions are used after the layers to enhance network performance [28-35].

After the layers, the output is fed into bidirectional GRU layers. These layers’ capture temporal dependencies by processing the sequence in both forward and backward directions. The outputs of the GRU layers are concatenated and passed through a fully connected layer. The model's training is guided by the CTC loss function, which aligns the predictions with the target labels without requiring the input data.

Model implementation

The implementation of the model follows a Tensor Flow/ framework. The input to the model is a variable-length sequence of spectrograms, reshaped to include an additional channel dimension for 2D layers. The customized CNN architecture includes two layers that extract useful features from the input audio spectrogram.

The reshaped output of the second layer is prepared for the recurrent layers by collapsing the height and width dimensions into a single dimension. This ensures proper processing of the features in the recurrent layers.

The recurrent layers consist of Bidirectional Gated Recurrent Units (BiGRUs) with tanh activation functions and sigmoid recurrent activations. The number of units in each GRU is specified by the argument. Dropout is applied after each bidirectional layer, except for the last one.

The combination of convolutional layers, batch normalization, and ReLU activation functions in the CNN architecture helps the model to learn useful local features from the input spectrogram, which are then used by the recurrent layers to generate transcriptions of the input speech signal [36-42].

Overall, the model undergoes training and validation phases, with adjustments made to hyper parameters during the validation phase to improve accuracy. The implementation of the model involves the use of convolutional and recurrent layers to process the input spectrograms, along with appropriate reshaping and activation functions to facilitate feature extraction and modeling of temporal dependencies.

The architecture of the proposed model is presented in Figure 1 below:

swarm-intelligence-model

Figure 1: Proposed architecture of the model.

Evaluation metric

WER is a widely used metric to evaluate speech recognition systems. It measures the percentage of incorrectly recognized words compared to the reference transcription. Substitution, deletion, and insertion errors are considered, and WER is calculated by adding these errors and dividing by the total number of words in the reference. Lower WER indicates better performance and allows for model comparison and hyper parameter tuning.

Results and Discussion

We evaluated the performance of the speech recognition model using the Word Error Rate (WER), calculated as the percentage of words incorrectly recognized by the system. The results of our experiments are presented and the WER for the huge dataset was 2%. This’s presented in Figure 2 below:

swarm-intelligence-loss

Figure 2:Presents the loss function of our model. Note: (Image) Train; (Image) Validation.

In the context of our end-to-end speech recognition model, the x-axis of a plot in the figure typically represents the number of training epochs, which refers to the number of times the entire training dataset is fed to the model for learning. The y-axis, on the other hand, represents the loss, which is a measure of how well the model is performing in predicting the correct output for a given input. The loss function used in speech recognition tasks typically quantifies the difference between the predicted output and the actual output for a given input.

At the beginning of the training process, the loss value is usually high as the model has not yet learned to accurately predict the output of the input data. As the model is trained on more data and the number of epochs increases, the loss gradually decreases, indicating that the model is becoming better at predicting the output. This decrease in loss can be attributed to the model learning patterns and features in the training data, which allows it to make better predictions.

In the plotted graph, we can observe that the loss value gradually stabilizes and converges to a low value as the training progresses. This indicates that the model has learned to accurately predict the output of the input data and has reached a state of convergence. The point at which the loss stabilizes and converges can vary depending on various factors, such as the size and complexity of the dataset, the architecture of the model, and the training hyper parameters [43-50].

Although training loss is a good measure of how well the model fits the training data, it is not always a good indicator of how well the model will perform on new and unseen data. Here the validation loss comes into play. The validation loss is calculated by evaluating the model's performance on a separate set of data that it has not seen during training. Typically, validation data is a subset of the entire dataset that is held out specifically for this purpose.

By comparing training and validation losses, we can gain valuable insight into the performance and generalizability of our speech recognition model. During training, if the model is overfitting to the training data, we might observe that the training loss continues to decrease while the validation loss starts to increase, indicating that the model is becoming less accurate in predicting the output for new data. However, if the model is under fitting to the training data, we might observe that both the training and validation losses are high, indicating that the model is not learning the patterns and features in the data effectively. As we analyze our model, it becomes apparent that the loss metric consistently stabilizes and eventually reaches a low value as training progresses. This pattern of convergence demonstrates that the model has acquired the ability to make precise predictions for the given input data, and it has attained a state of convergence [51-58].

Assessing the performance and generalization ability of our speech recognition model relies heavily on examining the correlation between its training and validation losses. The training loss denotes the level of fitness of the model to the training data, whereas the validation loss signifies how well the model is likely to perform on fresh, previously unseen data. By keeping track of the training and validation losses, we can make informed choices regarding the model architecture and training hyper parameters, which in turn can enhance the model's performance and generalization ability.

Adam optimizer is preferred for end-to-end speech recognition due to its effectiveness in handling large-scale datasets and complex models. It combines the benefits of both AdaGrad and RMSProp algorithms by adapting the learning rate for each parameter individually. This adaptive learning rate adjustment helps in efficient optimization and convergence, making it suitable for speech recognition tasks [59-66].

A learning rate of 0.0001 is chosen for the end-to-end speech recognition to strike a balance between learning speed and accuracy. A lower learning rate allows for finer adjustments to the model's parameters, which can help in achieving better convergence and avoiding overshooting the optimal solution. It helps stabilize the training process and prevent drastic updates that may lead to sub optimal performance.

In the context of this end-to-end speech recognition, a drop rate of 0.5 typically refers to the dropout regularization technique. Dropout randomly sets a fraction of input units to 0 during training, which helps prevent overfitting and improves the model's generalization ability. A dropout rate of 0.5 means that, on average, half of the input units are dropped during training, providing regularization to the network. This helps prevent the model from relying too heavily on specific input features, leading to a more robust and accurate speech recognition system.

Our work has demonstrated the effectiveness of our end-to-end speech recognition models in large amount of data, with the model achieving exceptional accuracy. These results can have important implications for a variety of applications, such as improving accessibility for individuals with hearing impairments or enhancing the accuracy of voice-controlled devices in controlled environments [67-74].

We first evaluated the performance of our speech recognition model without erosion. When we applied the erosion technique to enhance the spectrogram representation of the input audio, we achieved an impressive Word Error Rate (WER) of 1.9%. This indicates that our model excels at accurately recognizing speech in clean environments. Even without erosion, the model performed remarkably well with a WER of 2.1%.

A WER displaying for this model and the predictions for unseen validation data via the RTX800 NVIDIA GPU is presented in Figure 3 below:

swarm-intelligence-data

Figure 3:Presents a WER displayed and the predictions for Int J Swarm Evol Comput, Vol.13 Iss.2 No:1000357 unseen validation data via RTX800 NVIDIA GPU.

Error analysis

We analyzed the errors made by the model on the test sets to identify common errors occurred.

The model performed well, with a WER of 2%. It performs exceptionally good in test data of the dataset. However, during our evaluation of the model's performance on a newly recorded audio from a natural environment, we noticed that it encountered difficulty in accurately predicting characters that possess similar visual representations. During the error analysis, it’s observed that the model occasionally exhibited character swaps in its transcriptions. Specifically, certain characters were substituted with similar looking characters, leading to errors in the output. One common swap observed was the substitution of the character "ከ" (ke) with "ቀ" (q'a)..

These characters have similar visual representations in their spectrogram representation.

As a result, the model sometimes mistakenly replaced instances of "ከ" with "ቀ" in its transcriptions, which could introduce inaccuracies. Similarly, another swap involved the characters "ተ" (te) and "ጠ" (t'e). These characters share similar visual features in their visual representation of their audio. Consequently, the model occasionally misinterpreted "ተ" as "ጠ" and vice versa, leading to incorrect transcriptions.

Another notable swap occurred between the characters "ች" (ch') and "ጭ" (tch'). These characters’ bear resemblance in terms of their visual structure in spectrogram and became a challenge to CNN. As a result, the model occasionally confused "ች" with "ጭ," resulting in errors in the transcribed text.

Conclusion

This study addresses the challenges of traditional Automatic Speech Recognition (ASR) methods by proposing an approach that utilizes a single Recurrent Neural Network (RNN) architecture. The objective was to streamline the speech recognition pipeline and improve the efficiency and accuracy of the system.

The conventional ASR pipeline often requires multiple separate components, such as language, acoustic, and pronunciation models with dictionaries, resulting in time-consuming processes and performance limitations. Using the power of RNNs, our proposed end-to-end system significantly simplifies this pipeline. 

Our research findings indicate that applying erosion to the spectrograms has a positive effect on speech recognition and enhances model performance, although the improvement is not significant.

In building an end-to-end speech recognition model for Amharic, we selected BiGRU as the preferred deep learning algorithm. This decision was based on the observation that BiLSTM required approximately twice the processing time of BiGRU and involved a greater investment in computational resource.

Through rigorous evaluation using the Word Error Rate (WER) metric, our approach demonstrated impressive performance. We achieved a remarkable WER of 2%, showcasing the system's robustness in clean environments.

This research has important implications for the field of speech recognition. By reducing the need for manual efforts in creating dictionaries and integrating multiple models, our approach not only saves time but also enhances the practicality of ASR systems for real-world applications. The efficiency and accuracy improvements brought forth by our end-to-end RNN-based architecture pave the way for more accessible and effective speech recognition solutions.

This research successfully achieved the main objective of developing an end-to-end speech recognition model for the Amharic language using deep learning. The architecture of the model combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN) and utilizes a Connectionist Temporal Classification (CTC) loss function.

Institutional Review Board (IRB) statement

Ethical review and approval were not required for this study as it did not involve human or animal subjects. This research was conducted as part of the thesis research at Bahir Dar Institute of Technology.

References

Citation: Ejigu YA, Asfaw TT (2024) Large Scale Speech Recognition for Low Resource Language Amharic, An End-to-End Approach. Int J Swarm Evol Comput. 13:357.

Copyright: © 2024 Ejigu YA. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.