Italian Transformers Under the Linguistic Lens
p. 310-316
Résumé
In this paper we present an in-depth investigation of the linguistic knowledge encoded by the transformer models currently available for the Italian language. In particular, we investigate whether and how using different architectures of probing models affects the performance of Italian transformers in encoding a wide spectrum of linguistic features. Moreover, we explore how this implicit knowledge varies according to different textual genres.
Texte intégral
1. Introduction and Background
1In the last few years, the study of Neural Language Models (NLMs) and their representations has become a key research area in the NLP community. Several methods have been devised to obtain meaningful explanations regarding the linguistic information encoded in NLMs (Belinkov and Glass 2019). The most common approach is based on the development of probes, i.e. supervised models trained to predict a variety of language properties using the contextual word/sentence embeddings of a pre-trained model (Conneau et al. 2018; Zhang and Bowman 2018; Miaschi and Dell’Orletta 2020). This approach demonstrated that NLMs representations encode linguistic knowledge in a hierarchical manner (Belinkov et al. 2017; Blevins, Levy, and Zettlemoyer 2018; Tenney et al. 2019), and can even support the extraction of dependency parse trees (Hewitt and Manning 2019). investigated the representations learned by BERT (Devlin et al. 2019), one of the most prominent NLM, across its layers, showing that lower ones are usually better for capturing surface features, while embeddings from higher layers are better for syntactic and semantic properties. Using a suite of probing tasks, Tenney et al. (2019a) deeply explore this behavior showing that the linguistic knowledge encoded by BERT through its 12/24 layers follows the traditional NLP pipeline.
2While the vast majority of this research focused on English contextual representations, relatively little work has been done to understand the inner workings of non-English models. The study by de Vries et al. (2020) represents an exception in this context: authors apply the probing task approach to compare the linguistic competence encoded by a Dutch BERT-based model and multilingual BERT (mBERT), showing that earlier layers of mBERT are consistently more informative that earlier layers of the monolingual model. The survey by Nozza et al. (2020) also provides a comparative study of mBERT and language-specific BERT models but focused on the performance that each model obtains after training on several specific downstream tasks.
3In this paper, we adopt a task-agnostic perspective to carry out an in-depth investigation of the linguistic knowledge implicitly encoded by 6 Italian monolingual models and multilingual BERT. We define a broad set of probing tasks, each corresponding to a specific property of sentence structure. We then compare the average performance reached by each model in predicting the feature value, evaluating the results obtained by models using their layer-wise sentence-level representations. A further comparative perspective, which to our knowledge is still rather under-investigated, concerns the study of how the architecture of the probing model itself influences probing scores. To address this point, for each model, we perform the same suite of probing tasks using both a linear SVR and a multilayer perceptron (MLP), and compare whether and how each probing task’s resolution is affected by the two architectures.
4Since all experiments were carried out on different sections of Italian Universal Dependency Treebank (Nivre et al. 2016), we were also able to investigate how linguistic knowledge of NLMs varies according to different textual genres.
Contributions
5To the best of our knowledge, this is the first study aimed at comparing the linguistic knowledge encoded in the representations of multiple non-English pre-trained transformer models. In particular: (i) we compare the probing performances of 6 Italian NLMs spanning three models over multiple linguistic feature categories; (ii) we investigate whether and how using different architectures of probing models affects the performance of transformers in encoding specific features; and (iii) we show how the implicit knowledge learned by these models differs across textual genres.
2. Approach
6To inspect the inner knowledge of language encoded by Italian Transformers, we relied on a suite of 82 probing tasks, each of which corresponds to predicting the value of a corresponding feature modeling a specific property of the sentence. We designed two sets of experiments. The first one consists in comparing the linguistic knowledge encoded by the Italian Transformers and evaluating the best probing model for inferring such knowledge from the NLMs. We compared the results obtained with two simple probing models, a linear SVR and a multilayer perceptron (MLP), which take as input layer-wise sentence-level representations extracted from the Italian models. These representations are produced for each sentence of different sections of the Italian Universal Dependency Treebank (IUDT), version 2.5 (Zeman et al. 2019), and used to predict the actual value of each probing feature. In the second set of experiments, we evaluated how the Italian models’ linguistic knowledge differs across textual genres and varieties, considering different IUDT sections.
2.1 Models and Data
Table 1
Name | Training data |
BERT Architecture | |
Multilingual-BERT | Wikipedia |
BERT-base-italian1 | Wikipedia + OPUS (13GB) (Tiedemann and Nygaard 2004) |
AlBERTo2 | TWITA (191GB) (Basile, Lai, and Sanguinetti 2018) |
RoBERTa Architecture | |
GilBERTo3 | OSCAR (71GB) (Suárez, Sagot, and Romary 2019) |
UmBERTo-Commoncrawl | OSCAR (69GB) |
UmBERTo-Wikipedia4 | Wikipedia (7GB) |
GPT-2 Architecture | |
GePpeTto5 | Wikipedia + ItWAC (14GB) (Baroni et al. 2009) |
7We relied on 7 pre-trained Italian Transformers models. Models statistics are reported in Table 1. Sentence level representations were computed performing a Mean-pooling operation over the word embeddings provided by the models.
8NLM’s linguistic competences are probed against five IUDT sections including texts representative of different textual varieties and genres. As shown in the overview in Table 2, we also distinguish the whole ISDT into different sub–corpora according to the specific language variety they represent, e.g. transcription of spontaneous speech (ISDT_europarl), questions (ISDT_quest) or simplified language (ISDT_2parole).
Table 2
Short Name | Types of texts | # sent |
ParTUT (Sanguinetti and Bosco 2015) | Multi-genre | 2,090 |
VIT (Delmonte, Bristot, and Tonelli 2007) | Multi-genre | 10,087 |
ISDT (Bosco, Montemagni, and Simi 2013) | Multi-genre | 14,167 |
ISDT_tanl | Newswire | 4,043 |
ISDT_tut | Legal/Newswire/Wiki | 3,802 |
ISDT_quest | Interrogative sentences | 2,162 |
ISDT_2parole | Simplified Italian news | 1,421 |
ISDT_europarl | EU Parliament acts | 497 |
PoSTWITA (Sanguinetti et al. 2018) | Tweets | 6,713 |
TWITTIRÒ (Cignarella, Bosco, and Rosso 2019) | Ironic Tweets | 1,424 |
Total | 35,481 |
2.2 Probing features
Table 3
Linguistic Feature |
Raw Text Properties |
Sentence Length |
Word Length |
Vocabulary Richness |
Type/Token Ratio for words and lemmas |
Morphosyntactic information |
Distibution of UD and language–specific POS |
Lexical density |
Inflectional morphology |
Inflectional morphology of lexical verbs and auxiliaries |
Verbal Predicate Structure |
Distribution of verbal heads and verbal roots |
Verb arity and distribution of verbs by arity |
Global and Local Parsed Tree Structures |
Depth of the whole syntactic tree |
Average length of dependency links and of the longest link |
Average length of prepositional chains and distribution by depth |
Clause length |
Relative order of elements |
Order of subject and object |
Syntactic Relations |
Distribution of dependency relations |
Use of Subordination |
Distribution of subordinate and principal clauses |
Average length of subordination chains and distribution by depth |
Relative order of subordinate clauses |
9The set of probing tasks consists of predicting the value of a specific linguistic feature automatically extracted from each POS tagged and dependency parsed sentence of the IUTD datasets.
10The set of features is based on the ones described in and are acquired from raw, morpho-syntactic and syntactic levels of annotation and can be categorised in 9 groups corresponding to different linguistic phenomena. As shown in Table 3, these features model linguistic phenomena ranging from raw text one , to morpho–syntactic information and inflectional properties of verbs, to more complex aspects of sentence structure capturing global and local properties of the whole parsed tree and of specific subtrees, such as the order of subjects and objects with respect to the verb, the distribution of UD syntactic relations, also including features referring to the use of subordination and to the structure of verbal predicates.
11All these features have been shown to play a highly predictive role when leveraged by traditional learning models on a variety of classification problems, covering different aspects of stylometric and complexity analysis. In addition, in their recent work, showed that these features can be effectively used to profile the knowledge encoded in the language representations of a pretrained NLM, specifically the English Bert, and how it changes across layers. Since these features are based on the UD formalism, which guarantees the comparative encoding of language phenomena between the two languages (Nivre 2015), we focused on the same set to investigate the linguistic knowledge of Italian transformers.
3. Results
12Average R2 scores for all the NLMs obtained with the LinearSVR and the MLP probing models. Baseline scores are also reported.
Groups | LinearSVR | MLP | Baseline |
RawText | 0.84 | 0.80 | 0.50 |
Vocabulary | 0.70 | 0.34 | 0.19 |
POS | 0.69 | 0.68 | 0.03 |
VerbInflection | 0.50 | 0.61 | 0.03 |
VerbPredicate | 0.32 | 0.43 | 0.08 |
TreeStructure | 0.61 | 0.64 | 0.40 |
Order | 0.46 | 0.55 | 0.06 |
SyntacticDep | 0.65 | 0.74 | 0.04 |
Subord | 0.49 | 0.60 | 0.16 |
AllFeatures | 0.60 | 0.64 | 0.10 |
13We first investigate which is the best architecture for probing the linguistic knowledge encoded by the Italian Transformers. Since many of our probing features are strongly related to sentence length, we compared the two probing models’ results with the ones obtained by a baseline corresponding to a LinearSVR model trained using only sentence length as input feature. Table 4 reports average R^2 results6 for all the 7 NLMs obtained with the LinearSVR and the MLP probing models, along with baseline scores. The MLP probe is a three-layer feedforward network with ReLU activations and was selected to investigate the presence of nonlinear relations in representations, which could hamper the probing performance of the LinearSVM probe, but would be highlighted by a sharp difference between MLP and LinearSVM performances. As a first remark, we notice that both probing models outperform the baseline. This proves that all NLMs encode a spectrum of phenomena that, although related to sentence length, require a more sophisticated linguistic knowledge to be accurately predicted. Best scores are obtained with the MLP model, which achieved higher R^2 scores especially for features grouping more complex syntactic phenomena (e.g. TreeStructure, SyntacticDep). Interestingly enough, the LinearSVR model outperforms the MLP by more than .30 R^2 points when predicting features related to vocabulary richness (Vocabulary).
14In order to ensure that our probes are actually showing the linguistic generalization abilities of the NLMs rather than learning the linguistic tasks, we also tested the probing models using the control task approach devised in Hewitt and Liang (2019). We produced a control version of the IUDT corpus by randomly shuffling the linguistic features assigned to each sentence and performed the same probing tasks with the two probing classifiers for all NLMs representations. The correlation and R2 scores between regressors’ predictions and shuffled scores were low (< 0.05) and comparable for both the SVR and the MLP. These results support the claim that NLMs representations encode information closely related to linguistic competence and that our probing models are not relying on spurious signals unrelated to our linguistic properties to solve the regression task.
15To investigate how each transformer encodes the linguistic knowledge, we report in Figure 1 average R2 scores obtained with the two probing models for all the 7 NLMs. As we can notice, the seven transformers achieve quite similar results when considering all features as a whole, although BERT-base-italian has the best overall performance (0.65 for all features). The same did not hold when we analyzed their performances in terms of R2 scores for the different previously described groups of features. For instance, we can notice that, for both the probing models, features related to the distribution of syntactic relations (SyntacticDep) are better predicted by GePpeTto, while GilBERTo and UmBERTo-Commoncrawl are the best ones in the prediction of tree structure properites. Differences hold for what regards competencies related to vocabulary richness (Vocabulary): while UmBERTo-Wikipedia extensively outperforms all the other transformers using the MLP model, the best transformer is BERT-base-italian when these competences are probed with the LinearSVR model.
16Similar trends can be observed in Figure 2, where we report how the linguistic knowledge encoded by the 7 NLMs evolves across layers according to the two probing models. Regardless of the architectures, for all transformers, raw text features (RawText) are mainly encoded in the first layers, while the knowledge about the order of subject/object (Order) and the use of subordination (Subord) increases consistently across layers and specifically in the first ones. Contrarily to what was observed by de Vries et al. (2020), mBERT’s linguistic knowledge is not encoded systematically earlier than in monolingual transformers. This perspective of analysis also reveals other differences among the considered transformers: e.g. even though GePpeTto has a lower average competence on verb inflection (see Figure 1), it achieves the highest scores in the middle layers. Focusing instead on differences between layerwise scores obtained by the two probing models, we can clearly notice that the encoding of linguistic knowledge shows a quite rough trend for what concerns the results obtained with the MLP. This is particularly the case of features belonging to the vocabulary, POS and tree structure groups.
17Finally, we inspected whether the overall linguistic competence encoded in the contextual representations of each model changes according to the type of texts in the different IUDT sections we considered. As we could expect, the results reported in Figure 3 show that all transformers achieve lower performance when they have to predict the value of features extracted from treebanks representative of social media language (PoSTWITA and TWITTIRÒ). Quite surprisingly, it is also the case of AlBERTo which is trained on Twitter data. A possible explanation is that, although PoSTWITA and TWITTIRÒ contain sentences representative of Twitter language, these sentences are still quite close to the Italian standard language, in order to be compliant with the UD morpho-syntactic and syntactic annotation schema. On the contrary, AlBERTo’s training set is derived from Twitter’s official streaming API that included all possible typologies of sentences. However, bert-base italian is slightly less affected by the non-standard linguistic peculiarities of this genre. Similarly to what is observed for the whole Italian dataset (see Figure 1), this model also reaches the highest performance in almost all different IUDT sections, except for the one containing interrogative sentences (isdt_quest). Interestingly, this type of sentence is hardly mastered by all models. This is possible due to the fact that interrogative sentences are more likely to display a less canonical distribution of morphosyntactic and syntactic phenomena, hence being more difficult to encode effectively.
4. Conclusion
18In this paper we presented an in-depth comparative investigation of the linguistic knowledge encoded in the Italian transformer models. Relying on a suite of more than 80 probing features and testing our approach with two different probing models, we showed that MLP is the best model for inferring the amount of information implicitly encoded in the NLMs representations. We also observed that BERT-base-italian achieved best scores in average, but the linguistic generalization abilities of the examined transformers vary according to specific groups of linguistic phenomena and across layers. Finally, we examined how the linguistic knowledge learned by the NLMs is affected by the distinct textual varieties available in Italian treebanks showing, for instance, that social media language represents a harder domain for all models.
19We are currently investigating if the linguistic knowledge encoded by a NLM positively affects the resolution of downstream tasks, as already suggested by the recent work by Miaschi et al. (2020) for English. This connection, which is still rather investigated, can improve our understanding of how such models make their decisions.
Bibliographie
Des DOI sont automatiquement ajoutés aux références bibliographiques par Bilbo, l’outil d’annotation bibliographique d’OpenEdition. Ces références bibliographiques peuvent être téléchargées dans les formats APA, Chicago et MLA.
Format
- APA
- Chicago
- MLA
Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. “The Wacky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora.” Language Resources and Evaluation 43 (3): 209–26.
Valerio Basile, Mirko Lai, and Manuela Sanguinetti. 2018. “Long-Term Social Media Data Collection at the University of Turin.” In Fifth Italian Conference on Computational Linguistics (Clic-It 2018), 1–6. CEUR-WS.
Yonatan Belinkov and James Glass. 2019. “Analysis Methods in Neural Language Processing: A Survey.” Transactions of the Association for Computational Linguistics 7: 49–72.
Yonatan Belinkov, Lluı́s Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. 2017. “Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks.” In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1–10.
Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018. “Deep Rnns Encode Soft Hierarchical Syntax.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 14–19.
Cristina Bosco, Simonetta Montemagni, and Maria Simi. 2013. “Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank.” In Proceedings of the Acl Linguistic Annotation Workshop & Interoperability with Discourse.
Alessandra Teresa Cignarella, Cristina Bosco, and Paolo Rosso. 2019. “Presenting TWITTIRÒ-UD: An Italian Twitter Treebank in Universal Dependencies.” In Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, Syntaxfest 2019). https://www.aclweb.org/anthology/W19-7723.pdf.
Alexis Conneau, Germán Kruszewski, Guillaume Lample, Loı̈c Barrault, and Marco Baroni. 2018. “What You Can Cram into a Single $&!#Vector: Probing Sentence Embeddings for Linguistic Properties.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2126–36.
Rodolfo Delmonte, Antonella Bristot, and Sara Tonelli. 2007. “VIT - Venice Italian Treebank: Syntactic and Quantitative Features.” In Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/N19-1423.
10.18653/v1/N19-1423 :John Hewitt and Christopher D Manning. 2019. “A Structural Probe for Finding Syntax in Word Representations.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4129–38.
Alessio Miaschi and Felice Dell’Orletta. 2020. “Contextual and Non-Contextual Word Embeddings: An in-Depth Linguistic Investigation.” In Proceedings of the 5th Workshop on Representation Learning for Nlp, 110–19. Online: Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.repl4nlp-1.15.
Joakim Nivre. 2015. “Towards a Universal Grammar for Natural Language Processing.” International Conference on Intelligent Text Processing and Computational Linguistics, 3–16.
Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, et al. 2016. “Universal Dependencies V1: A Multilingual Treebank Collection.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (Lrec’16), 1659–66.
Manuela Sanguinetti and Cristina Bosco. 2015. “PartTUT: The Turin University Parallel Treebank.” In Harmonization and Development of Re- Sources and Tools for Italian Natural Language Processing Within the PARLI Project, edited by Roberto Basili et al., 51–69. Springer. https://0-link-springer-com.catalogue.libraries.london.ac.uk/book/10.1007/978-3-319-14206-7.
10.1007/978-3-319-14206-7 :Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli, Alessandro Mazzei, and Fabio Tamburini. 2018. “PoSTWITA-UD: An Italian Twitter Treebank in Universal Dependencies.” In Proceedings of the Eleventh Language Resources and Evaluation Conference (Lrec 2018). https://www.aclweb.org/anthology/L18-1279.pdf.
Pedro Javier Ortiz Suárez, Benoı̂t Sagot, and Laurent Romary. 2019. “Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures.” Challenges in the Management of Large Corpora (CMLC-7) 2019, 9.
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, et al. 2019. “What Do You Learn from Context? Probing for Sentence Structure in Contextualized Word Representations.” arXiv Preprint arXiv:1905.06316.
Jörg Tiedemann and Lars Nygaard. 2004. “The Opus Corpus-Parallel and Free: https://aclanthology.org/L04-1174/.” In. Citeseer.
Daniel Zeman, Joakim Nivre, Mitchell Abrams, and et al. 2019. “Universal Dependencies 2.5.” In LINDAT/Clariah-Cz Digital Library at the Institute of Formal and Applied Linguistics (úfal). http://0-hdl-handle-net.catalogue.libraries.london.ac.uk/11234/1-3105.
Kelly Zhang and Samuel Bowman. 2018. “Language Modeling Teaches You More Than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis.” In Proceedings of the 2018 Emnlp Workshop Blackboxnlp: Analyzing and Interpreting Neural Networks for Nlp, 359–61.
Notes de bas de page
1 https://github.com/dbmdz/berts
2 Polignano et al. (2019)
3 https://github.com/idb-ita/GilBERTo
4 https://github.com/musixmatchresearch/umberto
5 De Mattei et al. (2020)
6 The Coefficient of determination (R2) is a statistical measure of how close the data are to the fitted regression line and corresponds to the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
Auteurs
Department of Computer Science, University of Pisa – Istituto di Linguistica Computazionale “Antonio Zampolli”, ItaliaNLP Lab, Pisa – alessio.miaschi@phd.unipi.it
Department of Mathematics and Geosciences, University of Trieste – International School for Advanced Studies (SISSA), Trieste – Istituto di Linguistica Computazionale “Antonio Zampolli”, ItaliaNLP Lab, Pisa – gsarti@sissa.it
Istituto di Linguistica Computazionale “Antonio Zampolli”, ItaliaNLP Lab, Pisa – dominique.brunato@ilc.cnr.it
Istituto di Linguistica Computazionale “Antonio Zampolli”, ItaliaNLP Lab, Pisa – felic.dellorletta@ilc.cnr.it
Istituto di Linguistica Computazionale “Antonio Zampolli”, ItaliaNLP Lab, Pisa – giulia.venturi@ilc.cnr.it
Le texte seul est utilisable sous licence Licence OpenEdition Books. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.
Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015
3-4 December 2015, Trento
Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)
2015
Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016
5-6 December 2016, Napoli
Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)
2016
EVALITA. Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 7 December 2016, Naples
Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)
2016
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
11-12 December 2017, Rome
Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)
2017
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
10-12 December 2018, Torino
Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 12-13 December 2018, Naples
Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop
Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)
2020
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
Bologna, Italy, March 1-3, 2021
Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)
2020
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
Milan, Italy, 26-28 January, 2022
Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)
2022