Hate Speech Detection with Machine-Translated Data: The Role of Annotation Scheme, Class Imbalance and Undersampling
p. 92-98
Résumé
While using machine-translated data for supervised training can alleviate data sparseness problems when dealing with less-resourced languages, it is important that the source data are not only correctly translated, but also follow the same annotation scheme and possibly class balance as the smaller dataset in the target language. We therefore present an evaluation of hate speech detection in Italian using machine-translated data from English and comparing three settings, in order to understand the impact of training size, class distribution and annotation scheme.1
Texte intégral
1. Introduction
1The task of detecting hate speech on social media has been attracting increasing attention due to the negative effects this phenomenon can have on online communities and society as a whole. The development of systems which can effectively detect hate speech has therefore become increasingly important for academics and tech companies alike.
2One of the difficulties of producing accurate hate speech detection systems is the need for large, high-quality datasets, the creation of which is time and resource-consuming. English can count on the highest number of hate speech detection datasets, as well as the ones with the largest sizes, with up to 150k posts for a single dataset (Gomez et al. 2020). Other languages such as Italian, on the other hand, can count on fewer datasets which tend to be smaller (Vidgen and Derczynski 2020). Given that machine learning methods are typically used for this task, the use of small datasets can lead to overfitting problems due to the lack of linguistic variation (Vidgen and Derczynski 2020). One possible solution to alleviate data sparseness is the use of machine translated data from English to less resourced languages for training classifiers, exploiting the large amount of data available for English. This has already been used in the context of hate speech detection (Sohn and Lee 2019; Casula et al. 2020) but results have not been consistent across languages.
3An additional issue is the fact that there is no shared fixed definition within the NLP community of what type of language constitutes hate speech. Indeed, there are typically large differences among hate speech and abusive language datasets in terms of annotation frameworks and their applications in practice (Caselli et al. 2020). In addition to this, there can be large variations between datasets in terms of size and class balance. Possible issues affecting the behaviour of classifiers trained on machine-translated data, such as different class distribution in source and target language, or different annotation scheme, have not been analysed.
4In order to fill this gap, we explore the impact of these differences between datasets when performing hate speech detection in Italian using machine-translated data from English. Our goal is to address the three following questions:
What performance can we expect by using only machine translated data, given that translation quality for social media language may be problematic?
Is it better to use a larger translated set for training, even by merging slightly different classes, or a smaller, more precise one?
What is the impact of class imbalance, and to what extent can undersampling be effective?
5The above questions are addressed by comparing three experimental settings that are described in Section 4 and evaluated in Section 5.
2. Related Work
6In recent years, the number of research works focused on the detection of hate speech on social media has remarkably increased, mostly due to the growing awareness regarding the societal impact these platforms can have.
7Computational methods for detecting the presence of hate speech on the web have become necessary due to the extremely large amounts of user-generated content being posted each day. These methods typically rely on supervised learning, in the form of both traditional machine learning (e.g. support vector classifiers) and deep learning approaches (Schmidt and Wiegand 2017). Given the increased attention towards this topic, more and more shared tasks regarding hate speech and abusive language detection have emerged, such as the HaSpeeDe task at Evalita 2018 (Bosco et al. 2018), OffensEval (Zampieri et al. 2019) and HatEval (Basile et al. 2019) at SemEval 2019, and the multilingual OffensEval at SemEval 2020 (Zampieri et al. 2020).
8Systems based on Transformers architectures such as BERT (Devlin et al. 2019) have proven effective for hate speech detection and classification in both English (Zampieri et al. 2019) and Italian (Polignano, Basile, Gemmis, and Semeraro 2019). These systems are generally pre-trained on large unlabeled corpora through two self-supervised tasks (next sentence prediction and masked language modeling) to create language models which can then be fine-tuned to a variety of downstream tasks using labeled data.
9AlBERTo (Polignano, Basile, Gemmis, Semeraro, et al. 2019) is a BERT-based system which was pre-trained on Italian Twitter data, and it currently defines the state of the art for hate speech detection in Italian (Polignano, Basile, Gemmis, and Semeraro 2019).
10Recently, more attention has been directed towards the quality of hate and abuse detection systems. Vidgen et al. (2019) investigate the flaws presented by most abusive language detection datasets in circulation: they can contain systematic biases towards certain types and targets of abuse, they are subject to degradation over time, they typically present very low inter-annotator agreement, and they can vary greatly with respect to quality, size, and class balance. further analyse the role of datasets in the detection of abuse, addressing issues such as the use of different task descriptions and annotation schemes across corpora, as well as similar annotation schemes being applied in different ways.
3. Data
11Since tweets containing hate speech or abusive language constitute a very small subset (between 0.1% and 3% depending on the label used) of all tweets being posted (Founta et al. 2018), random samples are generally not used for annotation, because the final datasets would contain an extremely low number of positive class examples, which would make classification difficult. The typical solution to this is to preselect posts that are likely to contain hateful language by searching for specific hate-related keywords. While this method is effective for gathering more instances of hate speech, it can make datasets biased, which is a main issue in hate speech datasets (Wiegand, Ruppenhofer, and Kleinbauer 2019).
12The dataset we chose for training our system is described in Founta et al. (2018). This dataset was not created starting from a set of predefined offensive terms or hashtags in order to reduce bias, which was an important factor in our choice. The method used by Founta et al. (2018) to increase the percentage of hateful/abusive tweets is boosted random sampling, in which a portion of the dataset is “boosted” with tweets that are more likely to belong in the minority classes. The boosted set of tweets is created using text analysis and machine learning (Founta et al. 2018).
13The dataset was annotated through crowdsourcing using the labels hateful, abusive, spam, and normal. The definition of hate speech given by Founta et al. (2018) to the annotators, based on existing literature on the topic, is:
Hate Speech: Language used to express hatred towards a targeted individual or group, or is intended to be derogatory, to humiliate, or to insult the members of the group, on the basis of attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender.
14The abusive label, on the other hand, is the result of three separate labels (abusive, offensive, and aggressive) being combined. In preliminary annotation rounds, Founta et al. (2018) found that these three labels were significantly correlated, so they grouped them together. The definition of abusive language given to the annotators is:
Abusive Language: Any strongly impolite, rude or hurtful language using profanity, that can show a debasement of someone or something, or show intense emotion.
15While the Founta et al. (2018) dataset was originally comprised of 80k tweets, Twitter datasets can often be subject to degradation due to tweets being removed over time and not accessible anymore through tweet IDs (Vidgen et al. 2019). After retrieving all available tweets and after removing tweets annotated as spam, the total number of tweets we use for training is 12,379, of which 727 are annotated as hateful and 1,792 as abusive. Before translating the data into Italian, we pre-process it using the Ekphrasis tool2 to tokenise the text and normalise user mentions, URLs (replaced by <user> and <url> respectively), as well as numbers, which are substituted with a number tag. We then use the Google Translate API to translate the data into Italian, in order to use it as training data for our classifier.
16For testing, we use the test portion of the Twitter dataset used in the Hate Speech Detection (HaSpeeDe) task at Evalita 2018 (Bosco et al. 2018), consisting of 1,000 Italian tweets manually annotated for hate speech against immigrants. This dataset is a simplified version of the dataset described in (Sanguinetti et al. 2018), in which more fine-grained labels are used.
4. Experimental Setup
17We experiment with the fine-tuning of AlBERTo (Polignano, Basile, Gemmis, Semeraro, et al. 2019), a BERT-based language model pre-trained on Italian Twitter data, using data that was automatically translated from English. This model has achieved state-of-the-art results when fine-tuned on the training data from the HaSpeeDe task at Evalita 2018 (Polignano, Basile, Gemmis, and Semeraro 2019).
18Our goal is that of exploring the impact of different annotation schemes and class balance when using machine-translated data for hate speech detection. Indeed, merging fine-grained classes into coarser ones has been a common and accepted practice when creating larger training sets from a smaller one (e.g. Founta et al. (2019)). This step has been performed also to compare classification in different languages (Corazza et al. 2020).
19In order to investigate this, we compare three different experimental settings. In the first one, we fine-tune AlBERTo on the translated tweets in Founta et al. (2018) after merging the hateful and abusive classes together, mapping them to a single hateful class as required by the binary classification task at Evalita 2018. In a second setting, AlBERTo is fine-tuned on the hateful class alone, discarding all tweets annotated as abusive in Founta et al. (2018). We hypothesize this setting may perform better when tested on the HaSpeeDe data, given the higher similarity in annotation framework.
20Simply removing tweets annotated as abusive, however, can throw off the balance between classes. More specifically, when training the system on both abusive and hateful tweets the hateful+abusive class constitutes about 20% of our data, while when we only use tweets annotated as hateful this percentage drops to 7%, potentially affecting classification results. In particular, the data we use for testing has a different class balance, with 30% of tweets marked as hateful. In order to assess the impact of class imbalance on our results, we further evaluate each setting using undersampling (Kubat 2000; Sun, Wong, and Kamel 2009), a technique typically used for imbalanced classification, in which we reduce the number of tweets belonging to the majority class, so that the overall percentage of tweets containing hate increases.
21Given that undersampling our data reduces the total size of tweets available for training, the resulting datasets for each annotation scheme considerably differ in size. We therefore consider a third setting, in which we use further random undersampling (Kubat 2000; Sun, Wong, and Kamel 2009) to match the larger dataset (hateful+abusive) with the smaller one (hateful only), so that the two annotations can be effectively compared in a setting with equal class balance and sample size.
22In summary, the three data settings we train our system on are:
Hateful and abusive tweets, using undersampling to progressively lower class imbalance;
Hateful only tweets, again using undersampling to progressively lower class imbalance;
Hateful and abusive tweets, both using undersampling to progressively lower class imbalance as in the previous settings, and using further random undersampling to match the low sample sizes of setting 2.
23Our AlBERTo fine-tuning architecture consists of a pooling layer for extracting the AlBERTo hidden representation for each sequence, followed by a dropout layer (dropout rate 0.2), two dense layers of size 768 and 128 and, finally, a softmax layer. We use L2 regularization (λ=0.01), Adam optimizer (2e-5 learning rate), and categorical cross-entropy loss. We train the system for 5 epochs with batch size 32.
5. Results and Discussion
24We measure the classification results using both macro-F1 score and minority class F1 score. We repeat each run five times in order to compensate for random initialization, and we report the average scores of these runs.
5.1 Setting 1: Hateful + Abusive Tweets
25The classification results obtained when fine-tuning AlBERTo on both abusive and hateful tweets combined can be observed in Table 1.
Table 1: Scores obtained when fine-tuning AlBERTo on both hateful and abusive tweets
Setting 1: Hateful + abusive | |||
% hate | Size (tweets) | Macro-F1 | Hate class F1 |
20% | 12,379 | 0.40 0 | |
30% | 8,397 | 0.64 | 0.52 |
40% | 6,298 | 0.63 | 0.57 |
26The class balance of the dataset prior to undersampling is 20% hateful + abusive tweets and 80% non-hateful, which amounts to 12,379 tweets total. With this class balance, the system performs the worst, classifying every tweet as belonging to the majority non-hateful class. On the other hand, with a higher percentage of minority class instances, the classification results improve, in spite of the considerably smaller amount of training data available. These results suggest that consistency in class balance can play a bigger role than training data size in classification results in this context.
5.2 Setting 2: Hateful Only Tweets
27The performance of the system when fine-tuned on tweets labeled as hateful only is reported in Table 2. As previously mentioned, only 7% of tweets in the dataset we use are labeled as hateful. The classes are therefore extremely imbalanced before undersampling. Predictably, with the classes being this imbalanced, the system identifies all test instances as belonging to the majority class. This again happens with the minority class comprising 20% of the training data.
Table 2: Scores obtained when fine-tuning AlBERTo on tweets labeled as hateful only
Setting 2: Hateful only | |||
% hate | Size (tweets) | Macro-F1 | Hate class F1 |
7% | 10,587 | 0.40 0 | |
20% | 3,635 | 0.40 | 0 |
30% | 2,423 | 0.65 | 0.54 |
40% | 1,818 | 0.52 | 0.56 |
28Similarly to Setting 1, the best classification performance in this case is achieved with 30% of minority class tweets. Interestingly, the best performance is comparable to the one obtained in Setting 1, even though in this case the number of training samples available is much lower, suggesting that more task-specific training instances can impact performance. We can note a difference with the minority class at 40% of total data, in which the performance drops in terms of macro-F1 score, likely due to the very small number of samples available for training and the consequent lack of linguistic variation. The hate class F1 score, however, remains stable.
29State-of-the-art results obtained by fine-tuning AlBERTo on the same Evalita dataset as reported in reach 0.80 macro-F1 and 0.73 F1 on the hate class, which we can consider an upper-bound for our task, obtained in a fully-supervised monolingual setting. On the other hand, the most frequent label baseline is 0.40 macro-F1, which is clearly outperformed using only machine-translated data.
5.3 Setting 3: Hateful + Abusive Tweets (Random Undersampling)
30Since there are large differences in size between the hateful+abusive annotation and the hateful-only annotation, we randomly undersample the hateful+abusive training data so that it matches the size of the hateful-only training data, in order to allow us to effectively compare the impact of each annotation framework on our results. The classification performance is reported in Table 3.
Table 3: Scores obtained when fine-tuning Al-BERTo on tweets labeled as hateful and abusive, after random undersampling
Setting 3: Hateful + abusive (random undersampling) | |||
% hate | Size (tweets) | Macro-F1 | Hate class F1 |
30% | 2,423 | 0.58 | 0.38 |
40% | 1,818 | 0.59 | 0.51 |
31If we compare the results of Setting 3 with those of Setting 2, it is clear that using more task-specific data, in this case hateful-only tweets, can lead to a larger improvement in performance when the amount of training data is the same. This suggests that consistency in annotation between training and test data can have a positive impact on classification, although it is not fundamental to help classification of hate speech detection with machine translated data. In fact, other aspects such as class balance can also play an important role.
5.4 Qualitative Analysis
32Another aspect affecting classification, which we have not considered so far, is the quality of machine translation, a particularly challenging task on social media data (Michel and Neubig 2018). In order to assess the impact of translation quality on our results, two annotators with linguistic background manually analysed 500 samples from the training data, consisting of 300 tweets annotated as normal, 100 as hateful, and 100 as abusive. Each annotator checked manually 250 random tweets from this sample. Translation quality was evaluated using the semantic adequacy annotation scheme proposed in . Annotations are judged on a scale between -3 and 3, with scores below 0 for inadequate translations and above 0 for adequate ones. The averaged annotations for each class are reported in Table 4.
Table 4: Average translation quality scores
Normal | Hateful | Abusive | Overall | |
Average | 0.438 | 0.527 | -0.043 | 0.368 |
33Overall, translations tend towards adequacy, but the average scores are below 1 for all classes. Interestingly, tweets annotated as abusive show poorer translation quality than other classes. This could help explain the small differences in classification performance between our experiments.
34A major role is played in this context by profanities, which are often used to offend a target but can also appear in non derogatory messages exchanged among members of the same community (Pamungkas, Basile, and Patti 2020). In the case of abusive tweets, we observe that the offenses are less direct and therefore slurs tend to be translated poorly. See for example the following sentence, which is labeled as abusive in the dataset:
(1) use that ugly ass design [...]
utilizzare quel disegno asino brutto [...]
use that design donkey ugly [...]
35Here, “ass” is translated with “asino” (“donkey”), effectively removing the profanity in the translated tweet and changing completely the meaning of the message.
36On the other hand, when profanities are used in a more direct way, or when they are expressed through unambiguous words such as “idiot” and “stupid”, they tend to be translated correctly, contributing to a correct classification. Example 2 shows a hateful tweet which was translated almost correctly, retaining its offensiveness in the target language.
(2) what happens when you put idiots in charge
cosa succede quando si mette idioti in carica
6. Conclusions
37In this paper we analysed the impact of machine-translated data on Italian hate speech detection in a zero-shot setting. Our experiments show that when using machine-translated data for training it is possible to learn a classification model that clearly outperforms the most-frequent baseline, even if translation quality is affected by the jargon used in social media data. We found that using more task-specific data can have a positive impact on classification performance even with lower sample sizes compared to larger, less targeted datasets.
38Consistency in class distribution of training and test data can have a bigger impact than the size of the training set, or the annotation scheme. Indeed, using only the original training set translated into Italian, without undersampling, classification performance would be poor.
39In the future, we plan to extend this kind of evaluation to new language pairs and new datasets, to check whether the findings obtained on the English – Italian pair are confirmed also with other languages.
Bonnie J. Dorr, Joseph Olive, John McCary, and Caitlin Christianson, 2011. Machine Translation Evalua-tion and Optimization, pages 745 – 843. Springer New York.
Bibliographie
Des DOI sont automatiquement ajoutés aux références bibliographiques par Bilbo, l’outil d’annotation bibliographique d’OpenEdition. Ces références bibliographiques peuvent être téléchargées dans les formats APA, Chicago et MLA.
Format
- APA
- Chicago
- MLA
Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. 2019. “SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter.” In Proceedings of the 13th International Workshop on Semantic Evaluation, 54–63. Minneapolis, Minnesota, USA: Association for Computational Linguistics.
Cristina Bosco, Dell’Orletta Felice, Fabio Poletto, Manuela Sanguinetti, and Tesconi Maurizio. 2018. “Overview of the Evalita 2018 Hate Speech Detection Task.” In EVALITA 2018-Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, 2263:1–9. Turin, Italy: CEUR.
Tommaso Caselli, Valerio Basile, Jelena Mitrovic, Inga Kartoziya, and Michael Granitzer. 2020. “I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language.” In Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, edited by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, et al., 6193–6202. European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.760/.
Camilla Casula, Alessio Palmero Aprosio, Stefano Menini, and Sara Tonelli. 2020. “FBK-Dh at Semeval-2020 Task 12: Using Multi-Channel Bert for Multilingual Offensive Language Detection.” In Proceedings of Offenseval.
Michele Corazza, Stefano Menini, Elena Cabrio, Sara Tonelli, and Serena Villata. 2020. “A Multilingual Evaluation for Online Hate Speech Detection.” ACM Trans. Internet Techn. 20 (2): 10:1–10:22. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1145/3377323.
10.1145/3377323 :Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics.
Antigoni Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. “Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior.” In 12th International Aaai Conference on Web and Social Media.
Raul Gomez, Jaume Gibert, Lluis Gomez, and Dimosthenis Karatzas. 2020. “Exploring Hate Speech Detection in Multimodal Publications.” In 2020 Ieee Winter Conference on Applications of Computer Vision (Wacv), 1459–67. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1109/WACV45572.2020.9093414.
10.1109/WACV45572.2020.9093414 :M. Kubat. 2000. “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection.” Fourteenth International Conference on Machine Learning, June.
Paul Michel, and Graham Neubig. 2018. “MTNT: A Testbed for Machine Translation of Noisy Text.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 543–53. Brussels, Belgium: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/D18-1050.
10.18653/v1/D18-1050 :Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. 2020. “Do You Really Want to Hurt Me? Predicting Abusive Swearing in Social Media.” In Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, edited by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, et al., 6237–46. European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.765/.
Marco Polignano, Pierpaolo Basile, Marco de Gemmis, and Giovanni Semeraro. 2019. “Hate Speech Detection Through Alberto Italian Language Understanding Model.” In NL4AI@ Ai* Ia.
Marco Polignano, Pierpaolo Basile, Marco de Gemmis, Giovanni Semeraro, and Valerio Basile. 2019. “AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets.” In Proceedings of the Sixth Italian Conference on Computational Linguistics (Clic-It 2019). Vol. 2481. CEUR. https://0-www-scopus-com.catalogue.libraries.london.ac.uk/inward/record.uri?eid=2-s2.0-85074851349&partnerID=40&md5=7abed946e06f76b3825ae5e294ffac14.
Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, Viviana Patti, and Marco Stranisci. 2018. “An Italian Twitter Corpus of Hate Speech Against Immigrants.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/L18-1443.
Anna Schmidt, and Michael Wiegand. 2017. “A Survey on Hate Speech Detection Using Natural Language Processing.” In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, 1–10. Valencia, Spain: Association for Computational Linguistics.
10.18653/v1/W17-11 :Hajung Sohn, and Hyunju Lee. 2019. “MC-Bert4hate: Hate Speech Detection Using Multi-Channel Bert for Different Languages and Translations.” In 2019 International Conference on Data Mining Workshops (Icdmw), 551–59. IEEE.
Y. Sun, A. Wong, and M. Kamel. 2009. “Classification of Imbalanced Data: A Review.” Int. J. Pattern Recognit. Artif. Intell. 23: 687–719.
10.1142/S0218001409007326 :Bertie Vidgen, and Leon Derczynski. 2020. “Directions in Abusive Language Training Data: Garbage in, Garbage Out.” ArXiv abs/2004.01670.
Bertie Vidgen, Alex Harris, Dong Nguyen, Rebekah Tromble, Scott Hale, and Helen Margetts. 2019. “Challenges and Frontiers in Abusive Content Detection.” In Proceedings of the Third Workshop on Abusive Language Online, 80–93. Florence, Italy: Association for Computational Linguistics.
Michael Wiegand, Josef Ruppenhofer, and Thomas Kleinbauer. 2019. “Detection of Abusive Language: The Problem of Biased Datasets.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 602–8. Minneapolis, Minnesota: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/N19-1060.
10.18653/v1/N19-1060 :Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. “SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (Offenseval).” In Proceedings of the 13th International Workshop on Semantic Evaluation, 75–86. Minneapolis, Minnesota, USA: Association for Computational Linguistics.
Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin. 2020. “SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020).” In Proceedings of the 14th International Workshop on Semantic Evaluation. Association for Computational Linguistics.
Notes de bas de page
Auteurs
Fondazione Bruno Kessler, Trento, Italy – ccasula@fbk.eu
Fondazione Bruno Kessler, Trento, Italy – satonelli@fbk.eu
Le texte seul est utilisable sous licence Licence OpenEdition Books. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.
Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015
3-4 December 2015, Trento
Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)
2015
Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016
5-6 December 2016, Napoli
Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)
2016
EVALITA. Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 7 December 2016, Naples
Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)
2016
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
11-12 December 2017, Rome
Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)
2017
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
10-12 December 2018, Torino
Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 12-13 December 2018, Naples
Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop
Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)
2020
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
Bologna, Italy, March 1-3, 2021
Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)
2020
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
Milan, Italy, 26-28 January, 2022
Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)
2022