Text Generation is a subfield of Natural Language Processing (NLP) which attempts to teach machines to automatically generate texts that are indistinguishable from texts written by humans.

Speech & Audio in NLP is a subfield of Multimodality that deals with the processing of spoken language and audio data in conjunction with textual data. It involves developing algorithms and models that can recognize and transcribe speech and audio, understand and interpret their meaning, and generate natural language responses.

Multilinguality is a subfield of Natural Language Processing (NLP) that refers to the ability of machines to understand, process, and generate natural language text in multiple languages. Multilinguality is concerned with addressing the challenges that arise due to the linguistic and structural variations that exist across different languages. This includes variations in word order, syntax, grammar, or vocabulary.

Self-supervised Learning is a type of machine learning where the model learns to predict a part of the input data from other parts of the same input data. It does not require explicit labels provided by humans. Instead, it uses the structure of the data itself to generate labels. For example, in NLP, a model might be trained to predict the next word in a sentence, using the previous words as input, thereby learning the syntax, semantics, and other language rules.

Responsible & Trustworthy NLP is a subfield of Natural Language Processing (NLP) that is concerned with implementing methods that focus on fairness, transparency, trustworthiness, explainability, accountability, and ethical aspects at its core. It involves considering the societal impact of NLP applications and ensuring that they do not perpetuate or amplify biases or discrimination. Additionally, responsible NLP also involves developing methods to ensure the privacy and security of user data, and to mitigate the risks of unintended consequences or misuse of NLP systems.

Multimodality is a subfield of Natural Language Processing (NLP) that refers to the capability of a system or method to process input of different types or “modalities”, such as natural language text, speech, audio, images, video, and programming languages in NLP applications. It involves developing algorithms and models that can process and analyze information from multiple modalities, and integrate them to form a unified representation of the input.

Automatic Speech Recognition (ASR), or Speech-to-Text (STT), is a technology that converts spoken language into written text. It is a field that focuses on the development of systems and algorithms capable of transcribing spoken language accurately. ASR systems are designed to understand and interpret human speech, allowing computers to process and analyze spoken words.

Natural Language Processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Low-Resource NLP is a subfield of Responsible NLP which is concerned with the development of algorithms and models that work in resource-constrained environments, such as data scarcity and low-resource languages.

<p>a r X i v : 2 2 0 8 . 0 3 0 6 7 v 2 [ c s . C L ] 4 O c t 2 0 2 2</p> <h1>Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning</h1> <h2>Abstract</h2> <p>Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data and collected data for 15 languages, and trained experimental models using these techniques. Our results show that pooling the small amounts of data available in multilingual end-to-end models, and pretraining on unsupervised data can help improve speech recognition quality for many African languages. Index Terms : Africa, multilingual speech recognition, selfsupervised learning, low-resource languages</p> <h2>1. Introduction</h2> <p>Africa is home to over 2,000 languages, but almost none of them have widely available automatic speech recognition (ASR) systems. The data required to train such systems is also only available for a few of the major languages, and not in the quantities typically available for high-resource languages. As more people in Africa come online in the 2020s and beyond, the need for high quality ASR systems for African languages becomes more pressing, and along with it the need to develop data sets, model types and training architectures which are optimized for these languages. Two strands in ASR research which may provide solutions are multilingual end-to-end (E2E) modeling [1, 2] and selfsupervised learning (SSL) [3, 4]. These techniques have been applied in the low-resource scenario with some encouraging results [5, 6, 7, 8], and pre-training has also been shown to help with accented speech [9]. However, efforts to develop ASR for African languages have so far mostly focused on data collection and model training for single languages or a small number of related or similar languages, including Amharic [10, 11]; Dinka [11]; South African languages [12, 13, 14]; Nigerian English [15]; Hausa [16, 17]; Yoruba [18, 19, 20, 21]; Swahili [11, 22, 23]; Wolof [24]; Somali [25]; Igbo & Fon [26]; Bemba [27] and Akan [28]. Alongside the data collections initiated for these projects, there are also a few open source projects which aim to improve African language data coverage. Mozilla Common Voice (MCV) 1 provides a web-based platform for donating audio data [29], and has released data sets for Hausa, Kinyarwanda, Kabyle and Luganda, with Igbo and Swahili coming soon. Notably, and in large part thanks to the efforts of Digital Umuganda, an artificial intelligence company based in Kigali, 2 MCV provides over 2,000 hours of supervised data for Kinyarwanda, the second largest resource for a language on their platform after English. The South African Centre for Digital Language Resources (SADiLaR) 3 also provides access to various resources, including the NCHLT speech corpus of the South African languages [30]. There are other efforts under the umbrella of the Lacuna Fund 4 to create even more data sets, but there is still lots to be done in this space. In general, technology companies have not yet invested significantly in the development of ASR systems for African languages. For a number of years, Google has supported voice search and voice typing for four African languages: Afrikaans, Amharic, Swahili and Zulu, 5 but many more languages are spoken in Africa. In this paper, we report on our work to develop or improve ASR systems for 15 African languages: Akan, Hausa, Igbo, Ndebele, Northern Sotho, Kinyarwanda, Swati, Southern Sotho, Swahili, Tswana, Tsonga, Venda, Xhosa, Yoruba and Zulu. This group of languages was selected for various reasons. Firstly, they are major languages in key regions and countries. Secondly, they share many features in common: they all use the Latin alphabet as their primary writing system, with the exception of Hausa they are all members of the Atlantic-Congo language family, and with the exception of Swahili they are all tonal languages. Finally, some kind of data is available for all of these languages, primarily from open source repositories or through data collections conducted by Google. Using the available open source and collected data, we trained ASR classic hybrid models, multilingual E2E models, and SSL models. The classic ASR models consist of dedicated connectionist temporal classification (CTC) acoustic models [31], pronunciation models and FST-based n-gram language models for each language. We also trained a multilingual Conformer model: a type of LSTM-based recurrent neural network transducer (RNN-T) [32] in which the recurrent encoder is replaced with convolution-augmented transformer layers [33]. For our first experiment with SSL, we pre-trained a model with unsupervised data using wav2vec 2.0 [34] and finetuned it for each of the languages, following [35]. We also experimented with pre-training on a larger high-resource unsupervised data set. Our results show that pooling the available data in multilingual models, and pre-training on unsupervised data show improvements for African languages compared with classic ASR models. However, while lots of efforts have been spent on creating open source data sets, we have found that the lack of high quality data, which is typically available for high-resource lan-</p> <h2>2. Languages</h2> <p>The 15 languages which are the focus of this paper are spoken across West, East and Southern Africa by an estimated 242 million people [36]. With the exception of Hausa which is an Afroasiatic language, all the languages form part of the large Atlantic-Congo language family. 11 of the 15 are also more closely related in the Bantu group. See Table 1 for the families and population estimates for each language. We include Hausa in our group despite its different lineage because it shares some common features with its AtlanticCongo neighbours, most notably lexical and grammatical tone. In grammatical tone systems, each word has an inherent tonal pattern, and these patterns can also change to indicate grammatical features, for example tense and aspect on verbs [37]. All languages in the group except for Swahili exhibit lexical and/or grammatical tone. Another distinctive phonological feature common among several of the languages is the use of click consonants. These are especially prevalent in the Southern contact with neighbouring Khoisan languages [38]. Xhosa has a particularly large set of 18 click consonants. All the languages use a form of the Latin script as their primary writing system. In some languages like Xhosa and Zulu, no special characters are used, and phonological features like clicks are represented using Latin letters like < c > and < x > . In other languages, the basic Latin set is supplemented with other characters and diacritics to represent tone and other phonological features. Commonly, high tone is marked with an acute accent, and low tone is marked with a grave accent on tonebearing vowel and nasal graphemes. Less commonly, other characters like macron, circumflex or caron are used to mark mid tone [39]. Some languages like Hausa do not mark tone in the orthography. A summary of some phonological features and orthographic conventions used for the 15 languages is given in Table 2. 6</p> <h2>3. Data</h2> <p>In Tables 3 and 4, the volume of supervised and unsupervised data we used for each language is given. We have used available open source data where possible, for example we have a million utterances and 1,406 hours of training data for Kinyarwanda thanks mostly to the open source repository provided by MCV, 7 and for the South African languages, we rely almost entirely on the NCHLT speech corpus provided through SADiLaR; we only have supplementary data for Southern Sotho and Zulu. We split off approximately 20% of the original data set in each case and used that for evaluation of the various models. The supervised short form data was collected in two ways. First is prompt-based collection, where contributors are shown written sentences (i.e. prompts) and record themselves reading them out loud. In the second type of collection, contributors are shown images of common objects and scenes and record themselves describing the image in a few words. These descriptions are then transcribed by other contributors. Supervised long form data was collected by identifying public videos on YouTube in the target language and transcribing them. The videos were manually identified by linguists at Google and verified as containing content in the target language by the transcribers. We applied the voice activity detection model used in [35] to segment the video. The unsupervised short form data was collected from voice search queries where users had opted in to help Google develop and improve its audio recognition technologies and the Google services that use them. The unsupervised long form data was collected by identifying public YouTube videos and extracting the audio from the video.</p> <h2>4. Experiments</h2> <p>We conducted our experiments with four settings. The details are as follows: (1) Classic ASR models . We developed grapheme-tophoneme (G2P) rules to generate pronunciation lexicons, gathered text data and developed text normalization grammars to train n-gram language models, and trained custom CTC acoustic models for seven of the languages. We did not train classic models for Ndebele, Northern Sotho, Swati, Southern Sotho, Tswana, Tsonga, Venda, Xhosa or Yoruba, because at the time the NCHLT corpus was not available to us, and in the case of Yoruba because we had issues with data quality. (2) Multilingual Conformer . We trained a multilingual Conformer model – this is a modified LSTM-based RNN-T architecture in which the recurrent encoder is replaced with convolution-augmented transformer layers. The encoder is composed of 12 layers, with left 3 frames stacked 128-channel log-mel features for the 512 conformer model dimension. A streaming encoder is used, where each local self-attention layer looks at 23 left context and 0 right context. The model is trained with FastEmit [40] and Hybrid Autoregressive Transducer (HAT) factorization [41], where the latter technique enables better integration with an external language model. 8 The 17417.53 prediction network uses a small embedding lookup architecture introduced in [42]. (3) Multilingual Pre-Training (MPT) : Following Zhang et al.’s work on BigSSL [35], the unsupervised data listed in Table 4 are segmented for pre-training with the wav2vec 2.0 objective [34] and are finetuned on each of the 15 languages using the standard RNN-T loss as the downstream task. We used the 600M-parameter Conformer model, the same architecture reported in [35] for both pre-training and finetuning. (4) Larger Dataset Multilingual Pre-Training (LDMPT) : This setting is similar to (3). The difference is that before we pre-trained using unsupervised data from the 15 languages, we first applied the same pre-training objective on a much larger high-resource language unsupervised data set. More specifically, we used 900,000 hours of segmented, unlabeled YT-U audio data from the set reported in [35]. The underlying assumption is that this would make the model generalize better, as the model sees more data with different phonetic distribution, even though there is a language mismatch between the high-resource languages and the African languages.</p> <h2>5. Evaluation</h2> <p>We evaluated the different model types on test sets composed of supervised short form utterances, which were split off from the acquired or collected data sets (see Section 3). Table 5 shows word error rates (WERs) for each type of model and setting.</p> <h2>6. Discussion</h2> <p>In general, the single-language classic ASR models have higher WERs compared with other techniques. This is likely due to the comparatively small amounts of training data available for each individual language. Issues with transcription quality, autogenerated G2P accuracy and scarce text data may also contribute to the higher WERs for classic models. Swahili and Zulu do not exhibit the same disparities as the other languages, likely because these languages have more training data and their lexicons and language models have been subject to greater attention, as Google has supported these two languages for a number of years. While we cannot draw direct comparisons between the Conformer and pre-training results, as the model types and sizes are quite different, there are some noteworthy patterns in the results. In general, the results for the pre-trained models are better than the Conformer model for the South African languages, where our only source of training data is the NCHLT corpus. 9 Conversely, where our only source of short-form training data comes from image description tasks, as is the case for Akan, Hausa, Igbo and Yoruba, the Conformer model outperforms the pre-trained models, except in the case of Hausa. In cases where we have both an image description set and another train set, as is the case for Kinyarwanda and Southern Sotho, the pre-trained models show improvements over the Conformer model. Comparing the two pre-training settings, we note that 7 languages show improvements if we first pre-train the model with high resource-language data rather than training solely on unsupervised data. This might be due to the fact that the unsupervised data is of inferior quality: many of the videos have background music and might also suffer from audio clipping issues. Potential solutions might be to filter out the speech-music mixed segments and apply some declipping preprocessing algorithms [43]. In all settings, WERs are higher for languages with only image description data. This suggests two things: (1) relying on this type of data alone is more likely to result in poorer quality models, and (2) the multilingual Conformer model seems to be more resilient to less reliable data. Looking into the issues with the image description data sets further, we see two types of problems. The first is that some of the utterances are one word in length, and this may present an issue when these utterances are seen in training. The second issue is that there are more discrepancies between the audio and transcriptions than we see with prompt-based corpora, which can be attributed to several factors: spontaneous speech contains more hesitations and other features which are hard to represent accurately in writing, and there is greater potential for transcribers to mishear or misunderstand the audio content. Issues with transcription quality are not necessarily limited to the image-based data sets. In all the data sets, there are issues with spelling variation and variation in the use of diacritics and special characters outlined in Table 2. We see the following kinds of variation: (1) alternative spellings for words; (2) alternative symbols instead of the standard ones, typically because the standard characters are not easily accessible on many keyboards or other input methods; (3) use of alternative diacritics, for example carons instead of circumflexes or vice versa; (4) variation in use of tone-marking diacritics: in some languages, while the standard orthography requires marking tone on every vowel and nasal, in practice tone marking is only used in cases where the meaning would otherwise be ambiguous, or tone is not marked at all. While the first three types of variation can be comparatively easily standardized with text normalization [44], deficient or non-existent tone marking is much harder to restore, as the tones associated with words can vary based on grammatical features, meaning that a contextual model is required for tone restoration, e.g. [45]. For languages where tone is marked inconsistently, this may play a significant role in the higher WERs we see for some of the 15 languages which are the subject of this paper.</p> <h2>7. Conclusions</h2> <p>We have shown that multilingual RNN-T models and selfsupervised pre-training techniques can improve ASR quality for African languages. These are just two techniques among many that have been shown to be useful in the low-resource scenario. Other novel ASR modeling techniques which could help include federated learning and personalization, zero-shot learning, data augmentation using synthesized speech (though text-to-speech is generally not available for these languages either), and adding external LMs which would also enable techniques like second pass rescoring. We hope that this work stimulates further research on ASR for these and other languages of Africa.</p> <h2>8. Acknowledgements</h2> <p>We would like to thank Parisa Hagani, Manasa Prasad, Isabel Leal, Neeraj Gaur, Brian Farris, Yun Zhu, Al¨ ena Aks¨ enova, Pierric Sans, Landis Baker, Mandy Jordan, Eoin Mahon, Clara Rivera and Wei Han for support and feedback on the work presented in this paper, and Franc¸oise Beaufays and Pedro Moreno for executive support.</p> <h2>9. References</h2>

Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data and collected data for 15 languages, and trained experimental models using these techniques. Our results show that pooling the small amounts of data available in multilingual end-to-end models, and pre-training on unsupervised data can help improve speech recognition quality for many African languages.

Publication:

Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning

Related Fields of Study

Citations

References