Frederico S. Oliveira

Frederico S. Oliveira

Developer. AI Engineer

© 2020

Dark Mode

BRSpeech TTS Dataset

The BRSpeech TTS Dataset is a Brazilian Portuguese dataset, developed in the Federal University of Goiás, with the support of the CyberLabs.Ai, composed of audio and text files, for training Text-to-Speech (TTS) models. The BRSpeech TTS Dataset is made from audiobooks available at LibriVox website, recordings of the Brazilian constitution and several laws, available at website of the legislative chamber of Brazil, and by speeches from former presidents, available at website of the library of the presidency of Brazil. It consists of 87,793 audio samples, 170 hours of audio. For each audio sample a transcript is provided a transcription. Audio samples vary in length from 0.2 to 30 seconds.

Words

There is a total of a 79,027 words. There are 208 sentences composed with a single word. The sentence composed with the largest number of words has 51 words, which is:

Atrelado à velocidade de avanço do rei inglês para Hastings é uma possibilidade que Haroldo pode não ter confiado nos condes Eduíno de Mércia e Morcar da Nortúmbria uma vez que o inimigo deles Tostig havia sido derrotado e ele recusou-se a levá- los junto com suas forças para o sul.

On average there are a total of 16 words per sentence. In the following figure, we can see a histogram of the number of words.

Duration

The files have an average time of 7 seconds. The largest file has 30 seconds and the smallest has only 0,2 seconds. In the following figure, we can see a histogram of files duration.

Files Size

The files have an average a size of 306.31 kbytes. The bigger file has 1,702.04 kbytes and the smallest has 9.04 kbyte. In the following figure, we can see a histogram of files size.

Number of Speakers

There are a total of 44 speakers. The speakers are identified by two-character acronyms. The dataset is totally unbalanced in relation to the speakers and the hours of speech. The speaker with the biggest number of hours is the LN, and the one with the smallest number of hours is the EF. In the following figure, we can see the total of hours of all the speakers.

The following figure shows the percentage of the dataset in relation to hours of each corpus.

Gender

The dataset consists of 44 speakers, 16 female and 28 male. In the following figure, we can see the gender of each speaker. In this figure, it is also possible to see the number of words present in the vocabulary of each corpus.

The dataset is relatively balanced about the gender of the speakers. In the following figure, we can see the percentage by gender.

File Format

The transcripts are provided in a file metadata.csv, following LJSpeech Dataset standard. This file consists of one record per line, delimited by the pipe character (0x7c). The fields are:

  1. File ID: this is the name of the corresponding .wav file
  2. Transcription: words spoken by the reader (UTF-8)
  3. Normalized Transcription: transcription with numbers, ordinals, and monetary units expanded into full words (UTF-8).

Quality

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz. The corpora have different qualities. An evaluation was carried out and the corpora were rated from 1 (bad) to 5 (excellent). It is possible to observe the quality of each corpus in the following figure.

The following figure shows the percentage of the dataset in relation to quality.

Download

The BRSpeech TTS Dataset is not yet available for download.