CHANGE-IT @ EVALITA 2020: Change Headlines, Adapt News, GEnerate
p. 235-239
Résumé
We propose a generation task for Italian – more specifically, a style transfer task for headlines of Italian newspapers. This is the first shared task on generation included in the EVALITA evaluation framework. Indeed, one of the reasons to have this task is to stimulate more research on generation within the Italian community. With this aim in mind, we release to the participating teams not only training data, but also a baseline sequence to sequence model that performs the task in order to help everyone get started, even when not accustomed to Natural Language Generation (NLG) approaches. Contextually, we explore the complex issue of automatic evaluation of generated text, which is receiving particular attention in the NLG community.
Note de l’éditeur
Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Texte intégral
1. Task and Motivation
1We propose a generation task for Italian in the context of the EVALITA 2020 campaign (Basile et al. 2020). More specifically, we design a style transfer task for headlines of Italian newspapers.
2We believe it is the first time that a shared task on generation is offered in the context of EVALITA. Indeed, one of the reasons to have this task is to stimulate more research on generation within the Italian community. With this goal in mind, we release to the potential participating teams not only training data, but also a baseline sequence to sequence model that performs the task in order to help everyone get started, even when not accustomed to generation models, yet. This baseline model casts the style transfer problem as an extreme summarisation task, just showing how versatile the problem is in terms of possible approaches. Contextually, this task will help to further explore the complex issue of evaluation of generated text, which is receiving particular attention in the Natural Language Generation international community (Gatt and Krahmer 2018; Lee et al. 2019).
Task
3The task is cast as a “headline translation" problem, and it is as follows. Given a collection of headlines from two Italian newspapers at opposite ends of the political spectrum, call them G and R, change all G-headlines to headlines into style R, and all R-headlines to headlines in style G.
4In the context of this task we need to take care of two crucial aspects: data and evaluation. Details on data are provided in Section 2, and on evaluation in Section 3.
2. Data
5We have collected news coming from two of the most important Italian newspapers situated at opposite ends of the political spectrum., namely la Repubblica (left) and Il Giornale (right), totalling approximately 152,000 article-headline pairs, with the two newspapers equally represented. Although the task only concerns headline change, the teams will receive both the headlines as well as their respective full articles.
6Leveraging on an alignment procedure described below (see (Cafagna, De Mattei, and Nissim 2019) for further details), we account for potential topic biases in the two newspapers, and we split the data set into strongly, weakly and not-aligned news. This information is useful in the creation of the datasets that we need to train our three evaluation classifiers (see Section 3). Additionally, it could help to better disentangle newspaper-specific style.
Alignment
7We compute the tf-idf vectors of all the articles of both newspapers and create subsets of relevant news filtering by date, i.e. considering only news which were published in approximately the same, short, temporal range for the two sources. On the tf-idf vectors we then compute cosine similarities for all news in the resulting subset, rank them, and retain only the alignments that are above a certain threshold. The threshold is chosen taking into consideration a trade-off between number of documents and quality of alignment. We choose two different thresholds: one is stricter () and we use it to select best alignments ; the other one is looser (, and <0.5).
Data splits
8We split the dataset into strongly aligned news, which are selected using the stricter threshold (20K aligned pairs), and weakly aligned and non-aligned news (100K article-headline pairs equally distributed among the two newspapers).
9The aligned data is further split as shown in Figure 1a. SA is left aside and used as test set for the final style transfer task. The remaining three sets are used for training the evaluation classifiers and the system for the target task. These are shown in Figure 1b. Note that all sets also always contain the headlines’ respective full articles, though these are not necessarily used.
Figure 1: Data splits and their use in the different training sets b) Training/test stes
evaluation | ||
train & test | main | R+A3+A1 |
HH | ||
AH | ||
task | ||
train | R+A3 | |
test | A2 |
Format
10The data is distributed in the form of one CSV file with the following fields:
id, headline, article, label [R,G]
3. Evaluation
11Human evaluation is generally viewed as the most desirable method to assess generated text (Novikova, Dušek, and Rieser 2018; Lee et al. 2019). However, human evaluation is not always a viable option, due to resources, but also due to the fact that humans might not be capable of reliably assessing the task at hand. Related to the current challenge, De Mattei et al. (2020a) have shown that people find it difficult to identify subtle stylistic differences between texts.
12Automatic, reliable metrics should therefore also be sought (Novikova et al. 2017). For our task, we propose a fully automatic strategy based on a series of classifiers to assess style strength and content preservation. For style, we train a single classifier (main). For content, we train two classifiers that perform two ‘sanity checks’: one ensures that the two headlines (original and transformed) are still compatible (HH classifier); the other ensures that the headline is still compatible with the original article (AH classifier). See also Figure 1.
13In what follows we describe these classifiers in more detail. When discussing baseline results, we will show how the contribution of each classifier is crucial towards a comprehensive evaluation.
Main classifier
14The main classifier uses a pre-trained BERT (Devlin et al. 2019) encoder with a linear classifier on top fine-tuned with a batch size of 256 and sequences truncated at 32 tokens for 6 epochs with learning rate 1e-05. Given a headline, this classifier can distinguish the two sources with an f-score of approximately 80% (see Table 1). Since style transfer is deemed successful if the original style is lost in favour of the target style, we use this classifier to assess how many times a style transfer system manages to reverse the main classifier’s decisions.
HH classifier
15This classifier checks compatibility between the original and the generated headline. We use the same architecture as for the main classifier with a slightly different configuration: max. sequence length of 64 tokens, batch size of 128 for 2 epochs (early-stopped), with learning rate 1e-05. Being trained on strictly aligned data as positive instances (A1), with a corresponding amount of random pairs as negative instances, it should learn whether two headlines describe the same content or not. Performance on gold data is .96 (Table 2).
AH classifier
16This classifier performs yet another content-related check. It takes a headline and its corresponding article, and tells whether the headline is appropriate for the article.The classifier is trained on article-headline pairs from both the strongly aligned and the weakly and non-aligned instances (R+A3+A1, Figure 1b). At test time, the generated headline is checked for compatibility against the source article. We use the same base model as for the main and HH classifiers with batch size of 8, same learning rate and 6 epochs. Performance on gold data is >.97 (Table 2).
Table 2: Performance of the evaluation classifiers on gold data
prec | rec | f-score | ||
main | rep | 0.77 | 0.83 | 0.80 |
gio | 0.84 | 0.78 | ||
HH | match | 0.98 | 0.95 | 0.96 |
no match | 0.95 | 0.98 | ||
AH | match | 0.96 | 0.99 | 0.98 |
no match | 0.99 | 0.96 |
Overall compliancy
17We calculate a compliancy score which assesses the proportion of times the following three outcomes are successful (i) the HH classifier predicts ‘match’; (ii) the AH classifier predicts ‘match’; (iii) the main classifier’s decision is reversed. As upperbound, we find the compatibility score for gold at 74.3% for transfer from La Repubblica to Il Giornale (rep2gio), and 78.1% for the opposite direction (gio2rep).
Table 3: Examples of headlines generated by the baseline system
Il Giornale La Repubblica | ||
E in Sicilia è scattata l’allerta rossa | Migranti, la Protezione civile continua dimenticata | |
[en: And in Sicily it’s now red alert] | [en: Migrants, the Civil Protection Department goes on forgotten] | |
Nozze gay, toghe contro i sindaci: "Le trascrizioni sono illegittime" | Il Consiglio di Stato boccia le nozze gay all’estero | |
[en: Gay marriages, gowns against mayors: “Transcriptions are not valid"] | [en: The State Council rejects gay marriages abroad] | |
La Repubblica Il Giornale | ||
Castelnuovo, lo sdegno di cittadini e associazioni: "Attacco all’integrazione che funziona" | I migranti non sono più rifugiati | |
[en: Castelnuovo, the indignation of citizens and associations: “Attack to the integration that works"] | [en: Migrants are not refugees anymore] | |
Da Renzi a Di Maio, ecco il reddito dichiarato dai politici italiani. Fedeli il ministro con l’imponibile più alto | Grillo e Giggino italiani conquistano l’elenco dei redditi italiani | |
[en: From Renzi to Di Maio: here it’s the income declared by the Italian politicians. Fedeli is the minister with the highest taxable income] | [en: Grillo and Giggino Italians conquer the list of Italian incomes] |
4. Baseline System
18We developed a baseline system using a summarisation approach, where headlines are viewed as an extreme case of summarisation and generated from the article. We exploit article-headline generators trained on opposite sources to do the transfer, as done in (De Mattei et al. 2020b). The advantage of this approach is that in principle it doesn’t require parallel data for training.
19Specifically, we use two pointer-generator networks (See, Liu, and Manning 2017), which include a pointing mechanism able to copy words from the source as well as pick them from a fixed vocabulary, thereby allowing better handling of out-of-vocabulary words.
20One model is trained on the la Repubblica portion of the training set, the other on Il Giornale. In a style transfer setting we use these models as follows: Given a headline from Il Giornale, for example, the model trained on la Repubblica can be run over the corresponding article from Il Giornale to generate a headline in the style of la Repubblica, and vice versa.
21The results of the baseline system, measured as performance of each classifier as well as the overall compliancy score, are reported in Table 4.
Table 4: Baseline performance on test data
HH | AH | Main | compl. | |
rep2gio | .649 | .876 | .799 | .449 |
gio2rep | .639 | .871 | .435 | .240 |
avg | .644 | .874 | .616 | .345 |
Bibliographie
Des DOI sont automatiquement ajoutés aux références bibliographiques par Bilbo, l’outil d’annotation bibliographique d’OpenEdition. Ces références bibliographiques peuvent être téléchargées dans les formats APA, Chicago et MLA.
Format
- APA
- Chicago
- MLA
Valerio Basile, Danilo Croce, Di Maro Maria, and Lucia C. Passaro. 2020. “EVALITA 2020: Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian.” In Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (Evalita 2020), edited by Valerio Basile, Danilo Croce, Di Maro Maria, and Lucia C. Passaro. Online: CEUR.org.
Michele Cafagna, Lorenzo De Mattei, and Malvina Nissim. 2019. “Embeddings Shifts as Proxies for Different Word Use in Italian Newspapers.” In Proceedings of the Sixth Italian Conference on Computational Linguistics (Clic-It 2019), Bari, Italy.
Lorenzo De Mattei, Michele Cafagna, Felice Dell’Orletta, and Malvina Nissim. 2020a. “Invisible to People but not to Machines: Evaluation of Style-aware Headline Generation in Absence of Reliable Human Judgment.” In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020). Marseille, France: European Language Resources Association (ELRA).
Lorenzo De Mattei, Michele Cafagna, Felice Dell’Orletta, and Malvina Nissim. 2020b. “Invisible to People but not to Machines: Evaluation of Style-aware Headline Generation in Absence of Reliable Human Judgment.” In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020). Marseille, France: European Language Resources Association (ELRA).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of Naacl, 4171–86. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/N19-1423.
10.18653/v1/N19-1423 :Albert Gatt, and Emiel Krahmer. 2018. “Survey of the State of the Art in Natural Language Generation: Core Tasks, Applications and Evaluation.” Journal of Artificial Intelligence Research 61: 65–170.
Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. “Best Practices for the Human Evaluation of Automatically Generated Text.” In Proceedings of the 12th International Conference on Natural Language Generation, 355–68. Tokyo, Japan: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/W19-8643.
10.18653/v1/W19-8643 :Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. “Why We Need New Evaluation Metrics for NLG.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2241–52. Copenhagen, Denmark: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/D17-1238.
10.18653/v1/D17-1238 :Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2018. “RankME: Reliable Human Ratings for Natural Language Generation.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 72–78. New Orleans, Louisiana: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/N18-2012.
10.18653/v1/N18-2012 :Abigail See, Peter J Liu, and Christopher D Manning. 2017. “Get to the Point: Summarization with Pointer-Generator Networks.” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1073–83.
10.18653/v1/P17-1 :Auteurs
University of Pisa, CLCG, University of Groningen, ItaliaNLP Lab, ILC-CNR, Pisa, Italy – lorenzo.demattei@di.unipi.it
Aptus.AI, Pisa, Italy, University of Malta, Malta – michele@aptus.ai
ItaliaNLP Lab, ILC-CNR, Pisa, Italy – felice.dellorletta@ilc.cnr.it
CLCG, University of Groningen, The Netherlands – m.nissim@rug.nl
University of Malta, Malta – albert.gatt@um.edu.mt
Le texte seul est utilisable sous licence Creative Commons - Attribution - Pas d'Utilisation Commerciale - Pas de Modification 4.0 International - CC BY-NC-ND 4.0. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.
Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015
3-4 December 2015, Trento
Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)
2015
Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016
5-6 December 2016, Napoli
Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)
2016
EVALITA. Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 7 December 2016, Naples
Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)
2016
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
11-12 December 2017, Rome
Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)
2017
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
10-12 December 2018, Torino
Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 12-13 December 2018, Naples
Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop
Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)
2020
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
Bologna, Italy, March 1-3, 2021
Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)
2020
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
Milan, Italy, 26-28 January, 2022
Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)
2022