Visualization: The Missing Factor in Simultaneous Speech Translation
p. 272-279
Résumé
Simultaneous speech translation (SimulST) is the task in which output generation has to be performed on partial, incremental speech input. In recent years, SimulST has become popular due to the spread of multilingual application scenarios, like international live conferences and streaming lectures, in which on-the-fly speech translation can facilitate users’ access to audio-visual content. In this paper, we analyze the characteristics of the SimulST systems developed so far, discussing their strengths and weaknesses. We then concentrate on the evaluation framework required to properly assess systems’ effectiveness. To this end, we raise the need for a broader performance analysis, also including the user experience standpoint. We argue that SimulST systems, indeed, should be evaluated not only in terms of quality/latency measures, but also via task-oriented metrics accounting, for instance, for the visualization strategy adopted. In light of this, we highlight which are the goals achieved by the community and what is still missing.1
Remerciements
This work has been carried out as part of the project Smarter Interpreting (https://kunveno.digital/) financed by CDTI Neotec funds.
Texte intégral
1. Introduction
1Simultaneous speech translation (SimulST) is the task in which the translation of a source language speech has to be performed on partial, incremental input. This is a key feature to achieve low latency in scenarios like streaming conferences and lectures, where the text has to be displayed following as much as possible the pace of the speech. SimulST is indeed a complex task in which the difficulties of performing speech recognition from partial inputs are exacerbated by the problem to project meaning across languages. Despite the increasing demand for such a system, the problem is still far from being solved.
2So far, research efforts mainly focused on the quality/latency trade-off, i.e. producing high quality outputs in the shortest possible time, balancing the need for a good translation with the necessity of a rapid text generation. Previous studies, however, disregard how the translation is displayed and, consequently, how it is actually perceived by the end users. After a concise survey of the state of the art in the field, in this paper we posit that, from the users’ experience standpoint, output visualization is at least as important as having a good translation in a short time. This raises the need for a broader, task-oriented and human-centered analysis of SimulST systems’ performance, also accounting for this third crucial factor.
2. Background
3As in the case of offline speech translation, the adoption of cascade architectures (Stentiford and Steer 1988; Waibel et al. 1991) was the first attempt made by the SimulST community to tackle the problem of generating text from partial, incremental input. Cascade systems (Fügen 2009; Fujita et al. 2013; Niehues et al. 2018; Xiong et al. 2019; Arivazhagan, Cherry, Te, et al. 2020) involve a pipeline of two components. First, a streaming automatic speech recognition (ASR) module transcribes the input speech into the corresponding text (Wang et al. 2020; Moritz, Hori, and Le 2020). Then, a simultaneous text-to-text translation module translates the partial transcription into target-language text (Gu et al. 2017; Dalvi et al. 2018; Ma et al. 2019; Arivazhagan et al. 2019). This approach suffers from error propagation, a well-known problem even in the offline scenario, where the transcription errors made by the ASR module are propagated to the MT module, which cannot recover from them as it does not have direct access to the audio. Another strong limitation of cascaded systems is the extra latency added by the two-step pipeline, since the MT module has to wait until the streaming ASR output is produced.
4To overcome these issues, the direct models initially proposed in represent a valid alternative that is gaining increasing traction (Bentivogli et al. 2021). Direct ST models are composed of an encoder, usually bidirectional, and a decoder. The encoder starts from the audio features extracted from the input signal and computes a hidden representation; the decoder transforms this representation into target language text. Direct modeling becomes crucial in the simultaneous scenario, as it reduces the overall system’s latency due to the absence of intermediate symbolic representation steps. Despite the data scarcity issue caused by the limited availability of speech-to-translation corpora, the adoption of direct architectures showed to be promising (Weiss et al. 2017; Ren et al. 2020; Zeng, Li, and Liu 2021), driving recent efforts towards the development of increasingly powerful and efficient models.
3. Architectural Challenges
5This section surveys the direct SimulST models developed so far, highlighting strengths and weaknesses of the current architectures and decision policies – i.e. the strategies used by the system to decide whether to output a partial translation or to wait for more audio information. We discuss ongoing research on architectural improvements of encoder-decoder models, as well as popular approaches like offline training and re-translation. All these works concentrate on reducing systems latency, targeting a better quality/latency trade-off.
Encoding Strategy
6Few studies (Elbayad, Besacier, and Verbeek 2020; Nguyen, Estève, and Besacier 2021) tried to improve the encoder part of simultaneous systems. and introduced the use of unidirectional encoders instead of standard bidirectional encoders (i.e. the encoder states are not updated after each read action) to speed up the decoding phase. also proposed an encoding strategy called Overlap-and-Compensate, where the encoder exploits extra frames provided from the past that were discarded during the previous encoding step. The segmentation problem is a crucial aspect in SimulST, where the system needs to split a long audio input into smaller chunks (speech frames) in order to process them. Different segmentation techniques can be adopted to extract this information, starting from the easiest one based on fixed time windows (X. Ma, Pino, and Koehn 2020) to the dynamic ones based on automatically detected word boundaries (Zeng, Li, and Liu 2021; Chen et al. 2021). also studied the dynamic segmentation based on oracle boundaries but they discovered that, in their scenario, it had worse performance compared to that of the fixed segmentation.
Decoding Strategy
7Some efforts have been made to improve the decoding strategy as it strongly correlates to the decision policy of simultaneous systems. Speculative beam search, or SBS, (Zheng et al. 2019c) represents the first successful attempt to use beam search in SimulST. This technique consists in hallucinating several prediction steps in the future in order to make more accurate decisions based on the best “speculative" prediction obtained. Also integrate the beam search in the decoding strategy, developing the wait-k-stride-N strategy. In particular, the authors bypass output speculation by directly applying beam search, after waiting for k words, on a word stride of size N (i.e., on N words at a time) instead of one single word as prescribed by the standard wait-k. analyzed several decoding strategies relying on different output token granularities, such as characters and Byte Pair Encoding (BPE), showing that the latter yields lower latency.
Offline or Online training?
8An alternative approach to simultaneous training is the offline (or full-sentence) training of the system and its subsequent use as a simultaneous one. explored this solution with an LSTM-based direct ST system, analyzing the effectiveness of different decoding strategies. Interestingly, the offline approach does not only preserve overall performance despite the switch of modality, it also improves system’s ability to generate well-formed sentences. These results are confirmed by , who successfully exploit a direct ST system jointly trained in an offline fashion with an ASR one.
Another point of view: re-translation
9Re-translation (Niehues et al. 2016, 2018; Arivazhagan, Cherry, Macherey, et al. 2020; Arivazhagan, Cherry, Te, et al. 2020) consists in re-generating the output from scratch (e.g. after a fixed amount of time) for as long as new information is received. This approach ensures high quality (the final output is produced with all the available context) and low latency (partial translations can be generated with fixed, controllable delay). This, however, comes at the cost of strong output instability (the so-called flickering, due to continuous updates of the displayed translations) which is not optimal from the user experience standpoint. To this end, some metrics have been developed to measure the instability phenomenon, such as the Erasure (Arivazhagan, Cherry, Te, et al. 2020), which measures the number of tokens that were deleted from the emitted translation to produce the next translation.
Decision Policy
10In simultaneous settings, the model has to decide, at each time step, if the available information is enough to produce a partial translation – i.e. to perform a write action using the information received until that step (audio chunk/s in case of SimulST or token/s in case of simultaneous MT) – or if it has to wait and perform a read action to receive new information from the input. Possible decision policies result in different ways to balance the quality/latency trade-off. On one side, more read actions provide the system with larger context useful to generate translations of higher quality. On the other side, this counterbalances the increased, sometimes unacceptable latency. To address this problem, two types of policy have been proposed so far: fixed and adaptive. While fixed decision policies look at the number of ingested tokens (or speech chunks, in the speech scenario), in the adaptive ones the decision is taken by also looking at the contextual information extracted from the input.
11While little research focused on adaptive policies (Gu et al. 2017; Zheng et al. 2019b, 2020) due to the hard and time-consuming training (Zheng et al. 2019a; Arivazhagan et al. 2019), the adoption of very easy-to-train fixed policies is the typical choice. Indeed, the most widely used policy is a fixed one, called wait-k (Ma et al. 2019). Simple yet effective, it is based on waiting for k source words before starting to generate the target sentence, as shown in Table 1.
12Table 1: wait-k policy example with k={3,5}
source | It | was | a | way | that | parents | ... |
wait-3 | - | - | - | Es | ging | um | eine |
wait-5 | - | - | - | - | - | Es | ging |
13As the original wait-k implementation is based on textual source data, adapted it to the audio domain by waiting for k fixed time frames (audio chunks or speech frames) rather than k words. However, this simplistic approach does not consider various aspects of human speech, such as different speech rates, duration, pauses, and silences. In (Ren et al. 2020), the adaptation was done differently, by including a Connectionist Temporal Classification (CTC)-based (Graves et al. 2006) segmentation module that is able to determine word boundaries. In this case, the wait-k strategy is applied by waiting for k pauses between words that are automatically detected by the segmenter. Similarly, employed the CTC-based segmentation method but applying a wait-k-stride-N policy to allow re-ranking during the decoding phase. The wait-k-stride-N model emits more than one word at a time, slightly increasing the latency, since the output is prompted after the stride is processed. This small increase in latency, however, allows the model to perform beam search on the stride, which has been shown to be effective in improving translation quality (Sutskever, Vinyals, and Le 2014). Decoding more than one word at a time is the approach also employed by , who showed that emitting two words increases the quality of the translation without any relevant impact on latency. Another way of applying the wait-k strategy was proposed by , where a streaming ASR system is used to guide the direct ST decoding. They look at the ASR beam to decide how many tokens have been emitted within the partial audio segment, hence having the information to apply the original wait-k policy in a straightforward way. An interesting solution is also the one by , who jointly train a direct model across multiple wait-k paths. Once the sentence has been encoded, they optimize the system by uniformly sampling the k value for the decoding step. Even though they reach good performance by using a single-path training with k=7 and a different k value for testing, the multi-path approach proved to be effective. One of its advantages is that no k value has to be specified for the training, which allows to avoid the training from scratch of several models for different values of k.
Retrospective
14All the aspects analyzed in this section highlight several research directions already taken by the simultaneous community, which have to be studied more in depth. Among all, the audio or text segmentation strategy clearly emerges as a fundamental factor of simultaneous systems, and the ambivalent results obtained in several studies point out that this aspect has to be better clarified. Moreover, the presence of extensive literature on the wait-k policy shows that it represents one of the topics of greatest interest to the community, which continues to work on it to further improve its effectiveness as it directly impacts on the systems’ performance, especially latency. Unfortunately, all these studies focus on the architecture enhancements and decision policies despite the absence of a unique and clear evaluation framework to perform a correct and complete analysis of the system.
4. Evaluation Challenges
15A good simultaneous model should produce a high quality translation with reasonable timing, as waiting too long will negatively affect the user experience. Offline MT and ST communities commonly use the well-established BLEU metric (Papineni et al. 2002; Post 2018) to measure the quality of the output translation, but a simultaneous system also needs a metric that accounts for the time spent by the system to output the partial translation. Simultaneous MT (SimulMT) is the task in which a real-time translation is produced having a partial source text at disposal. Since SimulMT was the first yet easiest simultaneous scenario studied by the community, a set of metrics was previously introduced for the textual input-output translation part.
Latency Metrics for SimulMT
16The first metric, the Average Proportion (AP), was proposed by and measures the average proportion of source input read when generating a target prediction, that is the sum of the tokens read when generating the partial target. However, AP is not length-invariant, i.e. the value of the metric depends on the input and output lengths and is not evenly distributed on the [0, 1] interval (Ma et al. 2019), making this metric strongly unreliable.
17To overcome all these problems, introduced the Average Lagging (AL) that directly describes the lagging behind the ideal policy, i.e. a policy that produces the output exactly at the same time as the speech source. As a downside, Average Lagging is not differentiable, which is, instead, a useful property, especially if the metric is likely to be added in the system’s loss computation. For this reason, proposed the Differential Average Lagging (DAL), introducing a minimum delay after each operation.
18Another way of measuring the lagging is to compute the alignment difficulty of a source-target pair. Hence, proposed the Lagging Difficulty (LD) metric that exploits the use of the fast-align (Dyer, Chahuneau, and Smith 2013) tool to estimate the source and target alignments. Then, they infer the reference decoding path and compute the AL metric. The authors claimed the LD to be a realistic measure of the simultaneous translation as it also evaluates how a translation is easy to align considering the context available when decoding.
Latency Metrics for SimulST
19The most popular AP, AL and DAL metrics were successively adapted by the SimulST community to the speech scenario by converting, for instance, the number of words to the sum of the speech segment durations, as per (X. Ma et al. 2020). Later, raised the issue of using computational unaware metrics and proposed computational aware metrics accounting for the time spent by the model to generate the output. Unfortunately, computing such metrics is not easy at all in absence of a unique and reproducible environment that can be used to evaluate the model’s performance. To this end, proposed SimulEvala tool which computes the metrics by simulating a real-time scenario with a server-client scheme. This toolkit automatically evaluates simultaneous translations (both text and speech) given a customizable agent that can be defined by the user and that will depend on the adopted policy. Despite the progress in the metrics for evaluating quality and latency, no studies have been conducted on the effective correlation with user experience. This represents a missing key point in the current evaluation framework landscape, giving rise to the need for a tool that combines quality and latency metrics with application-oriented metrics (e.g., reading speed), which are strongly correlated to the visualization and, as an ultimate goal, to the user experience.
5. The Missing Factor: Visualization
20In the previous section, we introduced the most popular metrics used to evaluate the simultaneous systems’ performance. These metrics account for the quality and the latency of the system without capturing the user needs. Although many researchers acknowledge the importance of human evaluation, this current partial view can push the community in the wrong direction, in which all the efforts are focused on the quality/latency factors while the problem experienced by the user is of another kind. Indeed, the third factor that matters and strongly influences the human understanding of a – even very good – translation is the visualization strategy adopted. The visualization problem and the need to present the text in a readable fashion for the user was only faced in our previous work (Karakanta et al. 2021). In the paper, we raised the need for a clearer and less distracting visualization of the SimulST system’s generated texts by presenting them as subtitles (text segmented in lines preserving coherent information). We proposed different visualization strategies to better assess the online display problem, attempting to simulate a setting where human understanding is at the core of our analysis.
Visualization modalities
21The standard word-for-word visualization method (Ma et al. 2019), in which the words appear sequentially on the screen as they are generated, could be strongly sub-optimal for the human understanding (Romero-Fresco 2011). Infact, the word-for-word approach has two main problems: i) the emission rate of words (some go too fast, some too slow) is irregular and the users waste more time reading the text because their eyes have to make more movements, and ii) emission of pieces of text that do not correspond to linguistic units/chunks, requiring more cognitive effort. Moreover, when the maximum length of the subtitle (that depends on the dimensions of the screen) is reached, the subtitle disappears without giving the user enough time to read the last words emitted. As this will negatively impact the user experience, we propose in (Karakanta et al. 2021) to adopt different visualization modes that better accommodate the human reading requirements. We first introduced the block visualization mode, for which an entire subtitle is displayed at once (usually one or two lines maximum) as soon as the system has finished generating it. This display mode is the easiest to read for the user because it prevents re-reading phenomena (Rajendran et al. 2013) and unnecessary/excessive eye fixations (Romero-Fresco 2010), reducing the human effort. However, we discovered that the latency introduced by waiting for an entire subtitle is too high to let this visualization mode be used in many simultaneous scenarios. As a consequence, we proposed the scrolling lines visualization mode that displays the subtitles line by line. Every time a new line becomes available, it appears at the bottom of the screen, while the previous (older) line is scrolled to the upper line. In this way, there are always two lines displayed on the screen. To evaluate the performance of the system in the different visualization modes, we also proposed an ad-hoc calculation of the reading speed (characters per second or CPS) that correlates with the human judgment of the subtitles (Perego et al. 2010). The reading speed shows how fast a user needs to read in order not to miss any part of the subtitle. The lower the reading speed, the better is the model’s output since a fast reading speed increases the cognitive load and leaves less time to look at the image. The scrolling line method offers the best balance between latency and a comfortable reading speed resulting to be the best choice for the simultaneous scenario. On the other hand, this approach requires segmented text (i.e. a text that is divided into subtitles), thus the system needs to be able to simultaneously generate transcripts or translations together with proper subtitle delimiters. However, building a simultaneous subtitling system combines the difficulties of the simultaneous setting with the constraint of having a text formatted in proper subtitles. Since both these research directions are still evolving, a lot of work is required to achieve good results.
22The lack of studies on this aspects highlights the shortcomings of the actual SimulST systems, individuating possible improvements that will allow the systems to evolve in a more organic and complete way according to the user needs. Moreover, to completely assess the subtitling scenario, a system has to be able to jointly produce timestamps metadata linked to the word emitted, a task that has not been addressed so far. The need for this kind of system represents an interesting direction to follow for the simultaneous community. In the light of this, the researcher should also take into account the three quality-latency-visualization factors in their analyses. We are convinced that these are the most promising aspects to work on to build the best SimulST system for the audience and that human evaluation has to have a crucial role in future studies. We also believe that interdisciplinary dialogue with other fields such as cognitive studies, media accessibility and human-computer interaction would be very insightful to evaluate SimulST outputs from communicative perspectives (Fantinuoli and Prandi 2021).
6. Conclusions and Future Directions
23SimulST systems have become increasingly popular in recent years and many efforts have been made to build robust and efficient models. Despite the difficulties introduced by the online framework, these models have rapidly improved, achieving comparable results to the offline systems. However, many research directions have not been explored enough (e.g., the adoption of dynamic or fixed segmentation, the offline or the online training). First among all, the visualization strategy that is adopted to display the output of the simultaneous systems is an important and largely under-analyzed aspect of the simultaneous experience. We posit that the presence of application-oriented metrics (e.g., reading speed), which are strongly related to the visualization and, as an ultimate goal, to the user experience, is the factor that misses in the actual evaluation environment. Indeed, this paper points out that BLEU and Average Lagging are not the only metrics that matter to effectively evaluate a SimulST model, even if they are fundamental to judge a correct and real-timed translation. We hope that this will inspire the community to work on this critical aspect in the future.
Bibliographie
Des DOI sont automatiquement ajoutés aux références bibliographiques par Bilbo, l’outil d’annotation bibliographique d’OpenEdition. Ces références bibliographiques peuvent être téléchargées dans les formats APA, Chicago et MLA.
Format
- APA
- Chicago
- MLA
Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel. 2019. “Monotonic Infinite Lookback Attention for Simultaneous Machine Translation.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1313–23. Florence, Italy: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/P19-1126.
10.18653/v1/P19-1126 :Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, and George Foster. 2020. “Re-Translation Versus Streaming for Simultaneous Translation.” In Proceedings of the 17th International Conference on Spoken Language Translation, 220–27. Online: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2020.iwslt-1.27.
10.18653/v1/2020.iwslt-1.27 :Naveen Arivazhagan, Colin Cherry, Isabelle Te, Wolfgang Macherey, Pallavi Baljekar, and George Foster. 2020. “Re-Translation Strategies for Long Form, Simultaneous, Spoken Language Translation.” In ICASSP 2020-2020 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 7919–23. IEEE.
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Karakanta, Alberto Martinelli, Matteo Negri, and Marco Turchi. 2021. “Cascade Versus Direct Speech Translation: Do the Differences Still Make a Difference?” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2873–87. Online: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2021.acl-long.224.
10.18653/v1/2021.acl-long.224 :Junkun Chen, Mingbo Ma, Renjie Zheng, and Liang Huang. 2021. “Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR.” In Findings of the Association for Computational Linguistics: ACL-Ijcnlp 2021, 4618–24. Online: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2021.findings-acl.406.
10.18653/v1/2021.findings-acl.406 :Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Stephan Vogel. 2018. “Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 493–99. New Orleans, Louisiana: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/N18-2079.
10.18653/v1/N18-2079 :Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. “A Simple, Fast, and Effective Reparameterization of IBM Model 2.” In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 644–48. Atlanta, Georgia: Association for Computational Linguistics. https://aclanthology.org/N13-1073.
Maha Elbayad, Laurent Besacier, and Jakob Verbeek. 2020. “Efficient Wait-k Models for Simultaneous Machine Translation.” In Proc. Interspeech 2020, 1461–5. https://0-doi-org.catalogue.libraries.london.ac.uk/10.21437/Interspeech.2020-1241.
10.21437/Interspeech.2020-1241 :Claudio Fantinuoli and Bianca Prandi. 2021. “Towards the Evaluation of Automatic Simultaneous Speech Translation from a Communicative Perspective.” In Proceedings of the 18th International Conference on Spoken Language Translation (Iwslt 2021), 245–54. Bangkok, Thailand (online): Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2021.iwslt-1.29.
10.18653/v1/2021.iwslt-1.29 :Tomoki Fujita, Graham Neubig, S. Sakti, T. Toda, and Satoshi Nakamura. 2013. “Simple, Lexicalized Choice of Translation Timing for Simultaneous Speech Translation.” In INTERSPEECH.
C. Fügen. 2009. “A System for Simultaneous Translation of Lectures and Speeches.” In.
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.” In Proceedings of the 23rd International Conference on Machine Learning, 369–76. ICML ’06. New York, NY, USA: Association for Computing Machinery. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1145/1143844.1143891.
10.1145/1143844.1143891 :Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor O. K. Li. 2017. “Learning to Translate in Real-Time with Neural Machine Translation.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 1053–62. Valencia, Spain: Association for Computational Linguistics. https://www.aclweb.org/anthology/E17-1099.
Alina Karakanta, Sara Papi, Matteo Negri, and Marco Turchi. 2021. “Simultaneous Speech Translation for Live Subtitling: From Delay to Display.” In Proceedings of the 1st Workshop on Automatic Spoken Language Translation in Real-World Settings (Asltrw), 35–48. Virtual: Association for Machine Translation in the Americas. https://aclanthology.org/2021.mtsummit-asltrw.4.
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, et al. 2019. “STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency Using Prefix-to-Prefix Framework.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3025–36. Florence, Italy: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/P19-1289.
10.18653/v1/P19-1289 :Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Jiatao Gu, and Juan Pino. 2020. “SIMULEVAL: An Evaluation Toolkit for Simultaneous Translation.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 144–50. Online: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2020.emnlp-demos.19.
10.18653/v1/2020.emnlp-demos.19 :Xutai Ma, Juan Pino, and Philipp Koehn. 2020. “SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation.” In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 582–87. Suzhou, China: Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.aacl-main.58.
Niko Moritz, Takaaki Hori, and Jonathan Le. 2020. “Streaming Automatic Speech Recognition with the Transformer Model.” In ICASSP 2020-2020 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 6074–8. IEEE.
Niko Nguyen, Yannick Estève, and Laurent Besacier. 2021. “Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation.” In Proc. Interspeech 2021, 2371–5. https://0-doi-org.catalogue.libraries.london.ac.uk/10.21437/Interspeech.2021-608.
10.21437/Interspeech.2021-608 :J. Niehues, T. Nguyen, Eunah Cho, Thanh-Le Ha, Kevin Kilgour, Markus Müller, Matthias Sperber, S. Stüker, and A. Waibel. 2016. “Dynamic Transcription for Low-Latency Speech Translation.” In INTERSPEECH.
J. Niehues, Ngoc-Quan Pham, Thanh-Le Ha, Matthias Sperber, and A. Waibel. 2018. “Low-Latency Neural Speech Translation.” In INTERSPEECH.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “Bleu: A Method for Automatic Evaluation of Machine Translation.” In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.3115/1073083.1073135.
10.3115/1073083.1073135 :Elisa Perego, F. Del Missier, M. Porta, and M. Mosconi. 2010. “The Cognitive Effectiveness of Subtitle Processing.” Media Psychology 13 (3): 243–72.
10.1080/15213269.2010.502873 :Matt Post. 2018. “A Call for Clarity in Reporting BLEU Scores.” In Proceedings of the Third Conference on Machine Translation: Research Papers, 186–91. Brussels, Belgium: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/W18-6319.
10.18653/v1/W18-6319 :Dhevi J. Rajendran, Andrew T. Duchowski, Pilar Orero, Juan Martínez, and Pablo Romero-Fresco. 2013. “Effects of Text Chunking on Subtitling: A Quantitative and Qualitative Examination.” Perspectives 21 (1): 5–21. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1080/0907676X.2012.722651.
10.1080/0907676X.2012.722651 :Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2020. “SimulSpeech: End-to-End Simultaneous Speech to Text Translation.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3787–96. Online: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2020.acl-main.350.
10.18653/v1/2020.acl-main.350 :Pablo Romero-Fresco. 2010. “Standing on Quicksand: Hearing Viewers’ Comprehension and Reading Patterns of Respoken Subtitles for the News.” In, 175–94. Leiden, The Netherlands: Brill. https://0-doi-org.catalogue.libraries.london.ac.uk/https://doi.org/10.1163/9789042031814_014.
10.1163/9789042031814_014 :Pablo Romero-Fresco. 2011. Subtitling Through Speech Recognition: Respeaking. Manchester: St. Jerome.
Frederick W. M. Stentiford and Martin G. Steer. 1988. “Machine Translation of Speech.” British Telecom Technology Journal 6 (2): 116–22.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. “Sequence to Sequence Learning with Neural Networks.” In Advances in Neural Information Processing Systems, edited by Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger. Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf.
Alex Waibel, Ajay N. Jain, Arthur E. McNair, Hiroaki Saito, Alexander G. Hauptmann, and Joe Tebelskis. 1991. “JANUS: A Speech-to-Speech Translation System Using Connectionist and Symbolic Processing Strategies.” In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Icassp 1991, 793–96. Toronto, Canada.
10.1109/ICASSP.1991.150456 :Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, and Ming Zhou. 2020. “Low Latency End-to-End Streaming Speech Recognition with a Scout Network.” arXiv Preprint arXiv:2003.10369.
Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and Zhifeng Chen. 2017. “Sequence-to-Sequence Models Can Directly Translate Foreign Speech.” In Proceedings of Interspeech 2017, 2625–9. Stockholm, Sweden.
Hao Xiong, Ruiqing Zhang, Chuanqiang Zhang, Zhongjun He, Hua Wu, and Haifeng Wang. 2019. “Dutongchuan: Context-Aware Translation Model for Simultaneous Interpreting.” arXiv Preprint arXiv:1907.12984.
Xingshan Zeng, Liangyou Li, and Qun Liu. 2021. “RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer.” In Findings of the Association for Computational Linguistics: ACL-Ijcnlp 2021, 2461–74. Online: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2021.findings-acl.218.
10.18653/v1/2021.findings-acl.218 :Baigong Zheng, Kaibo Liu, Renjie Zheng, M. Ma, Hairong Liu, and L. Huang. 2020. “Simultaneous Translation Policies: From Fixed to Adaptive.” ArXiv abs/2004.13169.
Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. 2019a. “Simultaneous Translation with Flexible Policy via Restricted Imitation Learning.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 5816–22. Florence, Italy: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/P19-1582.
10.18653/v1/P19-1582 :Baigong Zheng. 2019b. “Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp), 1349–54. Hong Kong, China: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/D19-1137.
10.18653/v1/D19-1137 :Renjie Zheng, Mingbo Ma, Baigong Zheng, and Liang Huang. 2019c. “Speculative Beam Search for Simultaneous Translation.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp), 1395–1402. Hong Kong, China: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/D19-1144.
10.18653/v1/D19-1144 :Notes de bas de page
1 Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Auteurs
Fondazione Bruno Kessler, Italy – University of Trento, Italy – spapi@fbk.eu
Fondazione Bruno Kessler, Italy – negri@fbk.eu
Fondazione Bruno Kessler, Italy – turchi@fbk.eu
Le texte seul est utilisable sous licence Creative Commons - Attribution - Pas d'Utilisation Commerciale - Pas de Modification 4.0 International - CC BY-NC-ND 4.0. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.
Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015
3-4 December 2015, Trento
Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)
2015
Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016
5-6 December 2016, Napoli
Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)
2016
EVALITA. Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 7 December 2016, Naples
Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)
2016
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
11-12 December 2017, Rome
Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)
2017
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
10-12 December 2018, Torino
Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 12-13 December 2018, Naples
Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop
Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)
2020
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
Bologna, Italy, March 1-3, 2021
Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)
2020
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
Milan, Italy, 26-28 January, 2022
Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)
2022