Returning the L in NLP: Why Language (Variety) Matters and How to Embrace it in Our Models
p. 3-4
Texte intégral
1NLP’s success today is driven by advances in modeling together with huge amounts of unlabeled data to train language models. However, for many application scenarios like low-resource languages, non-standard data and dialects we do not have access to labeled resources and even unlabeled data might be scarce. Moreover, evaluation today largely focuses on standard splits, yet language varies along many dimensions (Plank 2016). What is more is that for almost every NLP task, the existence of a single perceived gold answer is at best an idealization.
2In this talk, I will emphasize the importance of language variation in inputs and outputs and its impact on NLP. I will outline ways on how to go about it. This includes recent work on how to transfer models to low-resource languages and language variants (Plank, Jensen, and Goot 2020; Goot et al. 2021), the use of incidental (or fortuitous) learning signals such as genre for dependency parsing (Müller-Eberstein, Goot, and Plank 2021) and learning beyond a single ground truth (Fornaciari et al. 2021; Plank 2016; Plank, Hovy, and Søgaard 2014).
3Biography. Computer Science Department – IT University of Copenhagen. Barbara Plank is Professor in the Computer Science Department at ITU (IT University of Copenhagen). She is also the Head of the Master in Data Science Program. She received her PhD in Computational Linguistics from the University of Groningen. Her research interests focus on Natural Language Processing, in particular transfer learning and adaptations, learning from beyond the text, and in general learning under limited supervision and fortuitous data sources. She (co)-organised several workshops and international conferences, amongst which the PEOPLES workshop (since 2016) and the first European NLP Summit (EurNLP 2019). Barbara was general chair of the 22nd Northern Computational Linguistics conference (NoDaLiDa 2019) and workshop chair for ACL in 2019. Barbara is member of the advisory board of the European Association for Computational Linguistics (EACL) and vice-president of the Northern European Association for Language Technology (NEALT).
Bibliographie
Tommaso Fornaciari, Alexandra Uma, Silviu Paun, Barbara Plank, Dirk Hovy, and Massimo Poesio. 2021. “Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning.” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2591–7.
Rob van der Goot, Ibrahim Sharaf, Aizhan Imankulova, Ahmet Üstün, Marija Stepanović, Alan Ramponi, Siti Oryza Khairunnisa, Mamoru Komachi, and Barbara Plank. 2021. “From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding.” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2479–97.
Max Müller-Eberstein, Rob van der Goot, and Barbara Plank. 2021. “Genre as Weak Supervision for Cross-lingual Dependency Parsing.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 4786–4802.
Barbara Plank. 2016. “What to do about non-standard (or non-canonical) language in NLP.” In Proceedings of Konvens 2016, Ruhr-University Bochum. Bochumer Linguistische Arbeitsberichte.
Barbara Plank, Dirk Hovy, and Anders Søgaard. 2014. “Learning part-of-speech taggers with inter-annotator agreement loss.” In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 742–51.
Barbara Plank, Kristian Nørgaard Jensen, and Rob van der Goot. 2020. “DaN+: Danish nested named entities and lexical normalization.” In Proceedings of the 28th International Conference on Computational Linguistics, 6649–62.
Auteur
Computer Science Department IT University of Copenhagen
Le texte seul est utilisable sous licence Creative Commons - Attribution - Pas d'Utilisation Commerciale - Pas de Modification 4.0 International - CC BY-NC-ND 4.0. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.
Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015
3-4 December 2015, Trento
Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)
2015
Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016
5-6 December 2016, Napoli
Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)
2016
EVALITA. Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 7 December 2016, Naples
Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)
2016
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
11-12 December 2017, Rome
Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)
2017
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
10-12 December 2018, Torino
Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 12-13 December 2018, Naples
Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop
Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)
2020
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
Bologna, Italy, March 1-3, 2021
Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)
2020
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
Milan, Italy, 26-28 January, 2022
Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)
2022