Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Published in International Conference on Computational Processing of the Portuguese Language., 2021

Recommended citation: Stefanel Gris, L. R., Casanova, E., Oliveira, F. S. D., Silva Soares, A. D., & Candido Junior, A. (2022, March). "Brazilian Portuguese Speech Recognition Using Wav2vec 2.0". In International Conference on Computational Processing of the Portuguese Language (pp. 333-343). Springer, Cham. https://link.springer.com/chapter/10.1007/978-3-030-98305-5_31

Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe a sentence in audio in a sequence of words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, as Brazilian Portuguese. In this sense, this work presents the development of an public Automatic Speech Recognition system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages over Brazilian Portuguese data. The final model presents a Word Error Rate of 11.95% (Common Voice Dataset). This corresponds to 13% less than the best open Automatic Speech Recognition model for Brazilian Portuguese available according to our best knowledge, which is a promising result for the language. In general, this work validates the use of self-supervising learning techniques, in special, the use of the Wav2vec 2.0 architecture in the development of robust systems, even for languages having few available data.

Download paper here

Bibtex:

@InProceedings{10.1007/978-3-030-98305-5_31, author=”Stefanel Gris, Lucas Rafael and Casanova, Edresson and de Oliveira, Frederico Santos and da Silva Soares, Anderson and Candido Junior, Arnaldo”, editor=”Pinheiro, Vl{'a}dia and Gamallo, Pablo and Amaro, Raquel and Scarton, Carolina and Batista, Fernando and Silva, Diego and Magro, Catarina and Pinto, Hugo”, title=”Brazilian Portuguese Speech Recognition Using Wav2vec 2.0”, booktitle=”Computational Processing of the Portuguese Language”, year=”2022”, publisher=”Springer International Publishing”, address=”Cham”, pages=”333–343”, abstract=”Deep learning techniques have been shown to be efficient in various tasks, especially in the development of speech recognition systems, that is, systems that aim to transcribe an audio sentence in a sequence of written words. Despite the progress in the area, speech recognition can still be considered difficult, especially for languages lacking available data, such as Brazilian Portuguese (BP). In this sense, this work presents the development of an public Automatic Speech Recognition (ASR) system using only open available audio data, from the fine-tuning of the Wav2vec 2.0 XLSR-53 model pre-trained in many languages, over BP data. The final model presents an average word error rate of 12.4{\%} over 7 different datasets (10.5{\%} when applying a language model). According to our knowledge, the obtained error is the lowest among open end-to-end (E2E) ASR models for BP.”, isbn=”978-3-030-98305-5” }