Précédent Suivant

FONTI 4.0: Evaluating Speech-to-Text Automatic Transcription of Digitized Historical Oral Sources

p. 387-393

Résumé

Conducting “manual” transcriptions and analyses is unsustainable for most historical oral archives because they require a remarkable amount of funds and time. The FONTI 4.0 project aims at exploring the suitability of automatic transcription and information extraction technologies for making historical oral sources available. In this work, we conducted an experiment to test the performance of two commercial speech-to-text services (Google Cloud Speech-to-text and Amazon Transcribe) on digitized oral sources. We created an eight-hour corpus made of manually transcribed and annotated historical speech recordings in TEI format. The results clearly show how audio quality and disturbing elements (e.g., overlaps, foreign words, etc.) impact on the automatic transcription, showing what needs to be improved for implementing an unsupervised transcription chain.

Remerciements

This paper is produced under the FONTI 4.0 project, financed by resources from the Regional Operational Program co-financed with the European Social Fund 2014-2020 of the Veneto Region. An important acknowledge is due to Fondazione Giorgio Cini, Venice for making available its precious audio material as well as for its help in the recording analysis, and Matteo Pettenò for his contribution in the Jupyter notebook development.


Texte intégral

1. Introduction

1FONTI 4.01 is a project aiming at exploring the suitability of automatic transcription and analysis tools for the preservation of historical oral sources recorded on analog carriers, in particular magnetic tapes. The digitization of an audio archive is a long and expensive task that can require several years. Furthermore, the content of audio recordings needs to be listened and cataloged for making audio recordings retrievable. Archives composed by hundreds or thousands of hours of audio require a huge amount of time, people and funds for making the content accessible and preventing their exploitation. Therefore, automatizing the transcription and the analysis task could drastically reduce the time for making digitized audio recordings accessible.

2The project consists in a transcription-chain (T-chain), firstly defined in (Hessen et al. 2020), that differs in two main aspects: (a) in FONTI 4.0, the transcription obtained with speech-to-text (STT) algorithms should not be corrected by human; (b) an additional restoration step could be required for digitized audio recordings. Furthermore, differently from STT evaluation experiments conducted by (Moore et al. 2019; Kostuchenko et al. 2019; Filippidou and Moussiades 2020), we decided to employ two commercial software, namely Google Cloud Platform and Amazon Web Services, to test their ability to transcribe historical analog recordings, and to eventually include in our pipeline.

3During the digitization process, speed and equalization errors can occur, especially when different speed and equalization configurations are used in different part of the same tape (Pretto et al. 2020). This leads to distortions of the recorded signal that becomes unlistenable. By using the correction workflow and digital filters described in (Niccolò Pretto et al. 2021; Niccolò Pretto et al. 2021) these errors can be corrected and at least parts of the signal can be saved. This task is essential for making the speech signal suitable for STT algorithms. This paper aims at evaluating the transcription performance of two commercial software on a real use case and identifying potential problems or limitations concerning peculiarities of analog audio recordings. Section 2 describes the corpus, used as ground-truth for the experiment. Section 3 outlines the methodology adopted for this experiment, whereas results are reported in Section 4. Finally, Section 5 presents the authors’ conclusions.

2. Corpus

4The Cinema & Civiltà (C&C) corpus was conceived within the FONTI 4.0 as ground-truth for evaluating the performance of STT services on a real case study. To build the corpus, we transcribed speech recorded on four magnetic tapes made available by the Giorgio Cini Foundation in Venice and digitized at the Centro di Sonologia Computazionale - CSC (Canazza and De Poli 2020). The recordings are parts of the Cinema & Civiltà conference for the awarding of the San Giorgio prize, part of the Venice Film Festival, that took place between the 7th and 9th of September 1959, attended by important figures of the history of cinema such as Roberto Rossellini and representatives of the Italian literary critics such as Vittore Branca. Each reel of magnetic tape is composed of two sides: each side counting 60 minutes of recorded speech for a total of eight hours of recording. The C&C corpus is also a multilingual corpus of 64,930 tokens and three sub-corpora: Italian 49,772 tokens, French 9,555 tokens (L1 and L2), and Spanish 5,603 tokens. This corpus was manually transcribed and annotated as described in the following subsections and is available at this link2.

2.1 Transcription

5Defining the methodology for the transcription is an important step for the preservation, analysis and access of oral sources. The main difficulty consists in making decisions on how to represent and convey both verbal and non-verbal elements in written form. Because of the absence of a universal standard of transcription (Schorrsidt 2011), the methodology usually depends on the research aim.

6In this research, we decided to complete a verbatim transcription, by reporting every word spoken in the recording including errors, false starts, truncations, and overlaps in Italian, French and Spanish. Using the software ELAN (Lausberg and Sloetjes 2016), we first segmented audio files extracted from the digitized tapes, making the start and end of each segment coincide with the speaker’s turn of talk. Then, we transcribed each segment while listening to the corresponding part of audio in slow motion. Eventually, we opted for employing automatic transcriptions from Google Cloud Speech-to-text (GCS) and Amazon Transcribe (AT)3, later used in the STT experiment, and correcting the text playing the audio at normal speed. This allowed us to save half the time for each transcription, which previously required a full day of work. Moreover, we were able to retrace and match the identity of the speakers to the voices in the recordings, through the consultation of historical documentation on the conference, and also by comparing voices across the recordings.

2.2 Annotation

7The annotation was employed for the addition of important metadata to the C&C corpus regarding different levels of audio quality and the presence of disturbing elements in the recordings. Our methodology is in compliance with the Text Encoding Initiative (TEI) standard guidelines4 for transcribed spoken material (Burnard and Bauman 2007). To proceed with the annotation, we first converted the transcription files from the ELAN .eaf into the XML TEI standard using the EXMARaLDA (Schmidt and Wörner 2014) tool TEI Drop (Schorrsidt 2011). Subsequently, we used Oxygen5 to assign TEI tags to the relevant tokens. The list of tags together with a brief description and examples is given below:

<pause>
marks a pause either between or within utterances in the same segment, e.g.: unica fisionomia. <pause/> Parte dell’architettura;

<unclear>
contains a word, phrase, or passage that could not be transcribed with certainty because it is illegible or inaudible in the source, e.g.: gli stessi <unclear reason="inaudible"> strumenti </unclear>, volti agli stessi fini;

<gap>
indicates a point where material has been omitted in the transcription because it is inaudible, e.g.: erba che sorgerà <gap reason="inaudible"/> quell’asfalto.;

<foreign>
identifies a word or phrase as belonging to some language other than that of the surrounding text, e.g.: <foreign xml:lang="fr-FR"> Mesdames, messieurs </foreign>;

<shift>
marks the point at which some paralinguistic feature of a series of utterances by any one speaker changes, e.g.: Io credo che questo argomento sia <shift feature="tempo" new="a"/> particolarmente importante <shift feature="tempo" new="normal"/> per vedere;

<del>
contains a letter, word, or passage indicated as superfluous by the annotator, in this case it was used for false starts, repetitions and truncations, e.g.: in questo <del type="falseStart"> moden </del> momento (false start) momento di <del type="repetition"> di </del> crisi (repetition) suggestione di <del type="truncation"> spettaco </del> di spettacolo (truncation);

<anchor>
was used to mark overlaps by attaching an identifier to a point within a text, e.g.: a contatto di un <anchor synch="ovrl6" xml:id="S06"/> pensiero <anchor synch="ovrl6e" xml:id="S06e"/> lo inducono a (interrupted speaker) <anchor xml:id="ovrl6"/> Io non lo vedo. Chi è questo? Chi è questo? <anchor xml:id="ovrl6e"/> (interrupting speaker);

<distinct>
identifies any word or phrase which is regarded as linguistically distinct, as in the case of prosodically unified units, e.g.: staccarsi da <distinct type="pcu"> questa estetica </distinct> e dai pregiudizi;

<vocal>
marks any vocalized but not necessarily lexical phenomenon, e.g.: del nostro mondo <vocal> <desc>cough</desc> </vocal> che direi postmoderno.;

<incident>
marks any phenomenon or occurrence, not necessarily communicative, for example incidental noises or other events affecting communication, e.g.: è attività creatrice, <incident><desc>noise</desc></incident> ma non propriamente l’artista;

<note>
contains notes or citations, and, for the purpose of this research, it was used to annotate the audio quality at the beginning of each segment, e.g.: <note>good </note>;

8Audio quality annotations (<note>) were assigned to each segment using the the following scale (Samar and Metz 1988):

excellent: speech is completely intelligible;

good: speech is intelligible with the exception of a few words or phrases;

fair: with difficulty, the listener can understand about half the content of the message;

poor: speech is very difficult to understand, only isolated words or phrases are intelligible;

bad: speech is completely unintelligible.

9The distribution of words (without punctuation and events) for each audio quality annotation is reported in Table 1.

Table 1: Number of words (no punctuation nor events) annotated with different audio quality tags

Scale

it-IT

fr-FR

es-ES

TOT

Excel.

9,075

5,930

4,097

19.102

Good

30,571

2,514

800

33.885

Fair

2,919

83

0

3,002

Poor

1,417

23

0

1,440

TOT

43,984

8,550

4,897

0

3. Experiments

10The STT experiment consisted in testing the ability of GCS and AT to correctly transcribe historical recordings. Furthermore, we decided to investigate the performance of STT transcriptions obtained from GCS and AT at different levels of audio quality and in presence of disturbing elements in the recordings such as background noise, overlaps, code switching etc. (see Section 2.2).

11To analyze the performance of the two STT systems, we developed a Jupyter notebook able to filter the text by language, audio quality, disturbing elements, etc., and select several options, such as tokenization rules. In this experiment, we decided to use only lower case characters, split apostrophes and remove punctuation from both manual and automatic transcriptions. The ground-truth and the resulting transcription of the STT services were canonicalized. The alignment algorithm works on single utterances and minimizes the Levenshtein distance (Jurafsky and Martin 2008). The obtained metrics were: the number of correct matches (COR) and mismatches, i.e.: deletions (DEL), substitutions (SUB) and insertions (INS), and the word error rate (WER), which is the ratio between the number of mismatches and words in the reference text (Morris, Maier, and Green 2004). It is important to note that we did not employ this metric to tell how good a system is, but only that one is better than the other (Errattahi, El Hannani, and Ouahmane 2018).

12In order to avoid the introduction of errors not due to the transcription task, we decided not to use the automatic language recognition feature because it could drastically impact on the performance. Therefore, we cut and divided the audio files in different languages and automatically transcribed them separately.

4. Results

13In this preliminary work we illustrate and compare mainly WER trends between the two STT systems, calculated on the entire corpus as well as each sub-corpora in relation to audio quality levels and the presence of disturbing elements.

14Figure 1 illustrates that the performance of AT are better than GCS in all corpora. The difference between the two systems is small in the Italian sub-corpus, but much wider in the French. A possible explanation could be the presence of L2 speakers of French whose pronunciation could have negatively affected the recognition performance. Nevertheless, it should be also considered that the Italian sub-corpus is more than five times bigger than the French and the Spanish.

Figure 1

Image 1000000000000A6B000005DCCB82F0D91B3EF01E.png

WER of GCS and AT transcriptions on the whole corpus and sub-corpora

15STT software performance can be further observed in Table 2: for the transcription of the whole corpus, AT scores a lower WER and finds more correct matches than GCS. On the other hand, deletions in GCS are more than double than in AT, whereas substitutions and insertions are higher in AT than in GCS. In any case, the number of deletions and insertions between AT and GCS are different probably because the two services make use of different language model weights.

Table 2: Word error rate (WER), Correct matches (COR), deletions (DEL), substitutions (SUB) and insertions (INS) of the Amazon Transcribe (AT) and Google Cloud Speech-to-text (GCS) transcriptions of the overall C&C corpus.

STT

WER

COR

DEL

SUB

INS

AT

16.35%

49,480

2,639

5,312

1,440

GCS

20.92%

46,510

5,837

5,084

1,094

16Figure 2 shows that transcription performance are very similar in Italian and Spanish with “Excellent” quality, but not in French. For this reason, we cannot impute the bad GCS performance to audio quality. In the Italian sub-corpus, performance are also similar with “Good” quality, but not in the Spanish, where both services performed badly. The negative impact of audio quality is also evident in the French sub-corpus, despite WER values are much higher than Italian.

Figure 2

Image 1000000000000A6B000005DC838B0728BE2FB397.png

WER of GCS and AT with different audio quality - whole corpus and sub-corpora

17Results in Figure 3 display the annotated disturbing events found in the C&C corpus that were assumed to negatively affect the performance of STT software in terms of WER. The element that provides the minor disturbance is shift, although the scored WER value for this tag is higher than the one calculated on the overall evaluation. About the other disturbing elements, they show a major impact on the transcription of both STT services. Overall, AT performance is better with most disturbing elements. The only exception is represented by code-switching events in foreign languages for which GCS had a better performance.

Figure 3

Image 1000000000000A6B000005DC12C46AF2FD10CF18.png

WER of GCS and AT with elements on the whole corpus and sub-corpora

5. Conclusion

18In this article we conducted a preliminary research experiment testing the ability of STT software to correctly transcribe digitized historical oral sources on magnetic tape. It should be noted, that since this preliminary work has been conducted on a small sample of data, our results are only indicative of which elements represent the biggest obstacle for STT software performance.

19In spite of disturbing elements and the variation of audio quality in the recordings, we demonstrate that with our dataset and in terms of WER, AT performed more accurate transcriptions compared to GCS . On the other hand, GCS was better at recognizing foreign words. Table 2 shows that AT introduces less incorrect words but more insertions and substitutions. This should be taken into consideration when working with automatic information extraction tools (e.g., Named Entity Recognition algorithms) applied to automatic transcriptions. Further analysis should investigate the cause of this trend, to verify if this behavior is also due to alignment or tokenization errors.

20With respect to software performance evaluations in relation to variables characterizing analog recordings of speech, we found evidence that audio quality drastically impacts on the number of mismatches. Observations about the incidence of disturbing elements, on the other hand, cannot be generalized since sub-corpora are in three different languages and have three different sizes. Throughout the analyses we noted that the most negative impact on transcription, in terms of the increase of WER, is caused by the presence of some specific recurring elements, i.e.: code-switching (foreign), overlaps and probably even the production of L2 speakers (Figure 3). Nonetheless, given the necessity of preserving historical documents in a more time and cost effective way, we came to the conclusion that researchers working on the preservation of historical recordings will benefit from the use of the T-chain. This is because the reduction by half of the time required for manual transcriptions in slow motion does compensate the lack of accuracy. This means that researchers working on the collection and preservation of oral archives will be able to focus on filling the gap between human and machine output.

21Further contributions will be necessary for conducting experiments on L1 and L2 data separately, cross-language testings reducing the Italian subset to the size of the French and Spanish sub-corpora and evaluating the impact of incorrect transcriptions on WER. Language identification through code-switching is another important problem for automatic transcription. Both services recently provided this functionality, but while we are writing this paper, the Google Cloud is still a preview version. As soon as the feature will be available the performance of automatic language recognition algorithms should also be investigated, especially because this feature is essential for automatizing the transcription of entire archives.

Bibliographie

Des DOI sont automatiquement ajoutés aux références bibliographiques par Bilbo, l’outil d’annotation bibliographique d’OpenEdition. Ces références bibliographiques peuvent être téléchargées dans les formats APA, Chicago et MLA.

Lou Burnard, and Syd Bauman, eds. 2007. “TEI P5: Guidelines for Electronic Text Encoding and Interchange.” In. Text Encoding Initiative Consortium. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html.

Sergio Canazza, and Giovanni De Poli. 2020. “Four Decades of Music Research, Creation, and Education at Padua’s Centro di Sonologia Computazionale.” Computer Music Journal 43 (4): 58–80. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1162/comj_a_00537.

10.1162/comj_a_00537 :

Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. 2018. “Automatic Speech Recognition Errors Detection and Correction: A Review.” Procedia Computer Science 128: 32–37.

Foteini Filippidou, and Lefteris Moussiades. 2020. “A Benchmarking of Ibm, Google and Wit Automatic Speech Recognition Systems.” In IFIP International Conference on Artificial Intelligence Applications and Innovations, 73–82. Springer.

Arjan van Hessen, Silvia Calamai, Henk van den Heuvel, Stefania Scagliola, Norah Karrouche, Jeannine Beeken, Louise Corti, and Christoph Draxler. 2020. “Speech, Voice, Text, and Meaning: A Multidisciplinary Approach to Interview Data Through the Use of Digital Tools.” In Companion Publication of the 2020 International Conference on Multimodal Interaction, 454–55. ICMI ’20 Companion. New York, NY, USA: Association for Computing Machinery. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1145/3395035.3425657.

10.1145/3395035.3425657 :

Daniel Jurafsky, and James Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Vol. 2. Pearson, Prentice Hall.

Evgeny Kostuchenko, Dariya Novokhrestova, Marina Tirskaya, Alexander Shelupanov, Mikhail Nemirovich-Danchenko, Evgeny Choynzonov, and Lidiya Balatskaya. 2019. “The Evaluation Process Automation of Phrase and Word Intelligibility Using Speech Recognition Systems.” In International Conference on Speech and Computer, 237–46. Springer.

Hedda Lausberg, and Han Sloetjes. 2016. “The Revised Neuroges–Elan System: An Objective and Reliable Interdisciplinary Analysis Tool for Nonverbal Behavior and Gesture.” Behavior Research Methods 48 (3): 973–93.

Meredith Moore, Michael Saxon, Hemanth Venkateswara, Visar Berisha, and Sethuraman Panchanathan. 2019. “Say what? A dataset for exploring the error patterns that two ASR engines make.” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2019-Septe: 2528–32. https://0-doi-org.catalogue.libraries.london.ac.uk/10.21437/Interspeech.2019-3096.

10.21437/Interspeech.2019-3096 :

Andrew Cameron Morris, Viktoria Maier, and Phil Green. 2004. “From Wer and Ril to Mer and Wil: Improved Evaluation Measures for Connected Speech Recognition.” In Eighth International Conference on Spoken Language Processing.

Niccolò Pretto, Nadir Dalla Pozza, Alberto Padoan, Anthony Chmiel, Kurt James Werner, Alessandra Micalizzi, Emery Schubert, Antonio Rodà, Simone Milani, and Sergio Canazza. 2021. “A Workflow and Novel Digital Filters for Compensating Speed and Equalization Errors on Digitized Audio Open-Reel Tapes.” In Proceedings of Audio Mostly 2021, Am21. Trento, Italy.

10.1145/3478384 :

Niccolò Pretto, Edoardo Micheloni, Anthony Chmiel, Nadir Dalla Pozza, Dario Marinello, Emery Schubert, and Sergio Canazza. 2021. “Multimedia Archives: New Digital Filters to Correct Equalization Errors on Digitized Audio Tapes.” Edited by Patrick Seeling. Advances in Multimedia 2021: 5410218. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1155/2021/5410218.

10.1155/2021/5410218 :

Niccolò Pretto, Alessandro Russo, Federica Bressan, Valentina Burini, Antonio Rodà, and Sergio Canazza. 2020. “Active Preservation of Analogue Audio Documents: A Summary of the Last Seven Years of Digitization at Csc.” In Proceedings of the 17th Sound and Music Computing Conference, Smc20, 394–98. Torino, Italy. https://0-doi-org.catalogue.libraries.london.ac.uk/10.5281/zenodo.3898905.

10.5281/zenodo.3898905 :

Vincent J. Samar, and Dale Evan Metz. 1988. “Criterion Validity of Speech Intelligibility Rating-Scale Procedures for the Hearing-Impaired Population.” Journal of Speech, Language, and Hearing Research 31 (3): 307–16.

Thomas Schmidt, and Kai Wörner. 2014. “EXMARaLDA.” In The Oxford Handbook of Corpus Phonology, edited by Jacques Durand, Ulrike Gut, and Gjert Kristoffersen. Oxford University Press. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1093/oxfordhb/9780199571932.013.030.

10.1093/oxfordhb/9780199571932.013.030 :

Thomas Schorrsidt. 2011. “A Tei-Based Approach to Standardising Spoken Language Transcription.” Journal of the Text Encoding Initiative 1.

Notes de bas de page

1 csc.dei.unipd.it/fonti40en/ (last accessed September 2nd, 2021)

2 DOI: 10.5281/zenodo.5645827

3 Automatic transcriptions were obtained on the 16th, 17th, 19th and 24th of March 2021.

4 tei-c.org/release/doc/tei-p5-doc/en/html/TS.html (last accessed September 3rd, 2021

5 oxygenxml.com (last accessed September 3rd, 2021)

Précédent Suivant

Le texte seul est utilisable sous licence Creative Commons - Attribution - Pas d'Utilisation Commerciale - Pas de Modification 4.0 International - CC BY-NC-ND 4.0. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.