How Contextualized Word Embeddings Represent Word Senses
p. 335-343
Résumés
Contextualized embedding models, such as ELMo and BERT, allow the construction of vector representations of lexical items that adapt to the context in which words appear. It was demonstrated that the upper layers of these models capture semantic information. This evidence paved the way for the development of sense representations based on words in context. In this paper, we analyze the vector spaces produced by 11 pre-trained models and evaluate these representations on two tasks. The analysis shows that all these representations contain redundant information. The results show the disadvantage of this aspect.
Modelli come ELMo o BERT consentono di ottenere rappresentazioni vettoriali delle parole che si adattano al contesto in cui queste appaiono. Il fatto che i livelli alti di questi modelli immagazzinino informazione semantica ha portato a sviluppare rappresentazioni di senso basate su parole nel contesto. In questo lavoro analizziamo gli spazi vettoriali prodotti con 11 modelli pre-addestrati e valutiamo le loro prestazioni nel rappresentare i diversi sensi delle parole. Le analisi condotte mostrano che questi modelli contengono informazioni ridondanti. I risultati evidenziano le criticità inerenti a questo aspetto.
Texte intégral
1. Introduction
1The introduction of contextualized embedding models, such as ELMo (Peters et al. 2018) and BERT (Peters et al., 2018), allows the construction of vector representations of lexical items that adapt to the context in which words appear. It has been shown that the upper layers of these models contain semantic information (Jawahar et al., 2019) and are more diversified than lower layers (Ethayarajh 2019). These word representations overcame the meaning conflation deficiency that affects static word embedding techniques (Camacho-Collados and Pilehvar 2018; Tripodi and Pira 2017), such as word2vec (Mikolov et al. 2013) or GloVe (Pennington et al., 2014) thanks to the adaptation to the context of use.
2The evaluation of these models has been conducted mainly on downstream tasks (Wang et al., 2018; Wang et al. 2019). With extrinsic evaluations, the models are fine-tuned, adapting the vector representations to specific tasks. The resulting vectors are then used as features in classification problems. This hinders a direct evaluation and analysis of the models because the evaluation also takes into account the ability of the classifier to learn the task. A model trained for this kind of task may learn only to discriminate among features that belong to each class with poor generalization.
3The interpretability of neural networks is an emerging line of research NLP that aims at analyzing the properties of pre-trained language models (Belinkov and Glass, 2019). Different studies have been conducted in recent years to discover what kind of linguistic information is stored in large neural language models. Many of them are focused on syntax (Hewitt and Manning, 2019; Jawahar et al., 2019) and attention (Michel, Levy, and Neubig 2019; Kovaleva et al. 2019). For what concerns semantics, the majority of the studies focus on common knowledge (Petroni et al. 2019) and inference and role-based event prediction (Ettinger 2020). Only a few of them have been devoted to lexical semantics, for example, Reif et al.(2019) show how different representations of the same lexical form tend to cluster according to their sense.
4In this work, we propose an in-depth analysis of the properties of the vector spaces induced by different embedding models and an evaluation of their word representations. We present how the properties of the vector space contribute to the success of the models in two tasks: sense induction and word sense disambiguation. In fact, even if contextualized models do not create one representation per word sense (Ethayarajh 2019), their contextualization create similar representations for the same word sense that can be easily clustered.
2. Related Work
5Given the success (and the opacity) of contextualized embedding models, many works have been proposed to analyze their inner representations. These analyses are based on probing tasks (Conneau et al., 2018) that aim at measuring how the information extracted from a pre-trained model is useful to represent linguistic structures. Probing tasks involve training a diagnostic classifier to determine if it encodes desired features. Tenney et al. (2019) discovered that specific BERT’s layers are more suited for representing information useful to solve specific tasks and that the ordering of its layers resembles the ordering of a traditional NLP pipeline: POS tagging, parsing, NER, semantic role labeling, and coreference resolution. Hewitt and Manning (2019)evaluated whether syntax trees are embedded in a linear transformation of a neural network’s word representation space. Hewitt and Liang (2019) raised the problem of interpreting the results derived from probing analysis. In fact, it is difficult to understand whether high accuracy values are due to the representation itself or, instead, they are the result of the ability to learn a specific task during training.
6Our work is more in line with works that try to find general properties of the representations generated by different contextualized models. For example, Mimno and Thompson (2017) demonstrated that the vector space produced by a static embedding model is concentrated in a narrow cone and that its concentration depends on the ratio of positive and negative examples. Mu and Viswanath (2018) explored this analysis further, demonstrating that the embedding vectors share the same common vector and have the same main direction. Ethayarajh (2019) demonstrated how upper layers of a contextualizing model produce more contextualized representations. We built on top of these works analyzing the vector space generated by contextualized models and evaluating them.
3. Construction of the Vector Spaces
7We used SemCor (Miller et al. 1993) as reference corpus for our work. This choice is motivated by the fact that it is the largest dataset manually annotated with sense information and it is commonly used as training set for word sense disambiguation. It contains 352 documents whose content words (about 226,000) have been annotated with WordNet (Miller 1995) senses. In total there are 33,341 unique senses distributed over 22,417 different words. The sense distribution in this corpus is very skewed, and follows a power law (Kilgarriff 2004). This makes the identification of senses challenging. The dataset is also difficult due to the fine granularity of WordNet (Navigli 2006).
8To construct the vector space A from SemCor we collected all the senses S_i of a word w_i and for each sense we recovered the sentences in which this particular sense occurs. These sentences are then fed into a pre-trained model and the token embedding representations of word wi, , are extracted from the last hidden layer. This operation is repeated for all the senses in Si, and for all the tagged words in the vocabulary, V. The vector space corresponds to all the representations of the words in V.
9A t-SNE visualization of the different embeddings in SemCor for the word foot is presented in Figure 1. In this Figure, we can see that the three main senses of foot (i.e., human foot, unit of length and lower part) occupy a definite position in the vector space, suggesting that the models are able to produce specific representations for the different senses of a word and that they lie on defined subspaces. In this work we want to test to what extent this feature is present in language models.
Implementations details
10The pre-trained models1 used in this study are: two BERT (???) models, base cased (12-layer, 768-hidden, 12-heads, 110M parameters) and large cased (24-layer, 1024-hidden, 16-heads, 340M parameters); three GPT-2 (Radford et al. 2019) models, base (12-layer, 768-hidden, 12-heads, 117M parameters), medium (24-layer, 1024-hidden, 16-heads, 345M parameters) and large (36-layer, 1280-hidden, 20-heads, 774M parameters); two RoBERTa (Liu et al. 2019) models, base (12-layer, 768-hidden, 12-heads, 125M parameters) and large (24-layer, 1024-hidden, 16-heads, 355M parameters); two XLNet (Yang et al. 2019) models, base (12-layer, 768-hidden, 12-heads, 110M parameters) and large (24-layer, 1024-hidden, 16-heads, 340M parameters); one XLM (Lample et al. 2019) model (12-layer, 2048-hidden, 16-heads) and one CTRL (Keskar et al. 2019) model (48-layer, 1280-hidden, 16-heads, 1.6B parameters). The main features of these models are summarized in Table 1. We averaged the embeddings of sub-tokens to obtain token-level representations.
Table 1: Statistics and hyperparameters of the models
Model | training data | vocab. size | n. param. | vec. dim. | objective |
BERTbase (Devlin et al., 2019) | 16GB | 30K | 110M | 768 | masked language model and next sentence prediction |
BERTlarge (Devlin et al., 2019) | 16GB | 30K | 340M | 1024 | masked language model and next sentence prediction |
GPT-2base (Radford et al., 2019) | 40GB | 50K | 117M | 768 language model | |
GPT-2medium (Radford et al., 2019) | 40GB | 50K | 345M | 1024 | language model |
GPT-2large (Radford et al., 2019) | 40GB | 50K | 774M | 1280 | language model |
RoBERTabase (Liu et al., 2019) | 160GB | 50K | 125M | 768 | masked language model |
RoBERTalarge (Liu et al., 2019) | 160GB | 50K | 355M | 1024 | masked language model |
XLNetbase (Yang et al., 2019) | 126GB | 32K | 110M | 768 | bidirectional language model |
XLNetlarge (Yang et al., 2019) | 126GB | 32K | 340M | 1024 | bidirectional language model |
XLMenglish | 16GB | 30K | 665M | 2048 | language model |
CTRL (Keskar et al., 2019) | 140GB | 250K | 1.63B | 1280 | conditional transformer language model |
3.1 Analysis
11The first objective of this work is to analyze the vector space produced with the models. This analysis is aimed at investigating the properties of the contextualized vectors. A detailed description of the embedding spaces constructed with the pre-trained models is presented in Table 2. We computed the norm for all the vectors in the vector space A, and averaged them:
12(1)
13This measure gives us an intuition on how diverse the semantic space constructed with the different models is. In fact, we can see that the magnitude of the vectors constructed with BERT, RoBERTa, XLNet, and XLM is low while those of GPT-2 and CTRL are very high.
Table 2: Detailed description of the embedding space produced with each model
Model | AvgNorm | MeanVecNorm(A) | MeanVecNorm(Â) | avg.MEV | avg.IntSim | avg.ExtSim |
BERTbase | 25.78 ± 1.28 | 17.94 | 17.84 | 0.43 ± 0.18 | 0.74 ± 0.05 | 0.69 ± 0.06 |
BERTlarge | 20.83 ± 2.51 | 12.43 | 11.58 | 0.38 ± 0.18 | 0.66 ± 0.08 | 0.59 ± 0.08 |
GPT-2base | 125.13 ± 10.25 | 91.46 | 90.99 | 0.46 ± 0.18 | 0.79 ± 0.05 | 0.76 ± 0.05 |
GPT-2medium | 427.45 ± 38.78 | 371.86 | 360.36 | 0.51 ± 0.18 | 0.85 ± 0.03 | 0.84 ± 0.03 |
GPT-2large | 290.29 ± 38.56 | 226.39 | 212.97 | 0.43 ± 0.18 | 0.75 ± 0.05 | 0.72 ± 0.05 |
RoBERTabase | 25.78 ± 0.56 | 22.17 | 22.25 | 0.51 ± 0.17 | 0.87 ± 0.02 | 0.85 ± 0.03 |
RoBERTalarge | 31.47 ± 0.65 | 26.99 | 27.04 | 0.52 ± 0.18 | 0.88 ± 0.02 | 0.84 ± 0.03 |
XLNetbase | 47.68 ± 0.66 | 43.28 | 43.26 | 0.53 ± 0.17 | 0.88 ± 0.01 | 0.87 ± 0.02 |
XLNetlarge | 28.27 ± 1.42 | 19.56 | 19.68 | 0.38 ± 0.17 | 0.66 ± 0.04 | 0.62 ± 0.05 |
XLMenglish | 44.92 ± 2.61 | 37.13 | 36.7 | 0.45 ± 0.18 | 0.79 ± 0.03 | 0.77 ± 0.03 |
CTRL | 4443.62 ± 351.98 | 3927.86 | 3879.56 | 0.49 ± 0.18 | 0.84 ± 0.02 | 0.83 ± 0.02 |
14We computed also the norm of the vector resulting in averaging all the vectors in the semantic space V, as:
15 (2)
16All the semantic spaces have non-zero mean and the mean norm is high. This result suggests that the vectors contain redundant information and share a common nonzero vector. This is not only because the vector space contains representations of the same sense. In fact, if we create a new semantic space, Â, averaging all the representations of the same word sense, the MeanVecNorm of this space is still high for all the models.
17We used the Maximum Explainable Variance (MEV) for the representations of each word in V. This measure corresponds to the proportion of the variance in the embeddings that can be explained by their first principal components and was computed as:
(3)
where is the first principal component of the vector space œ. It can give an upper bound on how contextualized representations can be replaced by a static embedding (Ethayarajh 2019). The model with the lowest MEV is BERTlarge and XLNetlarge.
18The other measures that we used for the evaluation of the vector space are based on the very notion of a cluster, which imposes that the data points inside a cluster must satisfy two conditions: internal similarity and external dissimilarity (Pelillo 2009). To this end, we used the senses of each word in the vocabulary of SemCor as clusters and extracted the corresponding vectors from V. We then computed the internal similarity of a cluster, c, as:
(4)
where n is the number of data points in the cluster. We computed also the external similarity of a cluster c by computing the cosine similarity among each point in c and all the points in the subspace S induced by the senses of a word that has c as one of its senses:
(5)
where m is the total number of data points in the subspace S (excluding those in c) and n is the number of points in the cluster c. Our hypothesis is that good representations should have high internal similarity and low external similarity and that the difference between them should be high.
19As it can be seen from Table 2 the internal similarity is higher than the external for all the models. Despite this, the scores are in a wide range. The lowest IntSim is given by BERTlarge and the highest by RoBERTalarge and XLNetbase. The lowest ExtSim is given by BERTlarge and the highest by XLNetbase. The largest difference between the two measures is given by BERTlarge. RoBERTalarge gives has also a large gap between the two measures, furthermore, their standard deviation is very low. As we will see in Section 4 these last two models perform better than others in clustering and classification tasks.
4. Evaluation
Sense Induction
20This task is aimed at understanding if representations belonging to different senses can be separated using an unsupervised approach. We hypothesize that a good contextualization process should produce more discriminative representations that can be easily identified by a clustering algorithm.
21We used the sense clusters extracted from SemCor as ground truth for this experiment (see Section 3) and grouped them if they are senses of the same word (with a given part of speech). We retained only the groups that have at least 20 data points and we discarded also monosemous words for the evaluation on k-means. The resulting datasets consist of 1871 (entire) and 1499 (without monosemous words) sub-datasets with 141,074 and 116,019 data points in total, respectively. We computed the accuracy on each sub-dataset computing the number of data points that have been clustered correctly and averaged the results to measure the performance of each model.
22The first algorithm is k-means (Lloyd 1982). It is a partitioning, iterative algorithm whose objective is to minimize the sum of point-to-centroid distances, summed over all k clusters. We used the k-means++ heuristic (Arthur and Vassilvitskii 2007) and the cosine distance metric to determine distances. We selected this algorithm because it is simple, non-parametric, and is widely used. It is important to notice that k-means requires the number of clusters to extract, for this reason, we restricted the evaluation only to ambiguous words.
23The second algorithm used is dominant-set (Pavan and Pelillo 2007). It is a graph-based algorithm that extracts compact structures from graphs generalizing the notion of maximal clique defined on unweighted graphs to edge-weighted graphs. We selected this algorithm because it is non-parametric, requires only the adjacency matrix of a weighted graph as input, and, more importantly, does not require the number of clusters to extract. The clusters are extracted from the graph sequentially using a peel-off strategy. This feature allows us to include in the evaluation also unambiguous words and to see if their representations are grouped into a single cluster or partitioned into different ones. We used cosine similarity to weigh the edges of the input graph.
24The results of this evaluation are presented in Table 3. RoBERTa and BERT have the overall best performances on this task using both algorithms. In particular, RoBERTalarge performs consistently well on all parts of speech and across algorithms, while other models perform well only in combination with one of the two algorithms. This is presumably owing to the big gap between the internal and the external similarity produced by this model, as explained in Section 3.1.
Table 3: Results (as average accuracy) on clustering divided by algorithm and part of speech: nouns (N), verbs (V), adjectives (A), adverbs (R) and on the concatenations of all datasets (All)
model | k-means | dominant-set | ||||||||
N | V | A | R | All | N | V | A | R | All | |
BERTbase | 57.2 | 50.6 | 56.2 | 62.0 | 54.9 ± 14.8 | 55.7 | 45.3 | 51.7 | 45.8 | 51.0 ± 17.5 |
BERTlarge | 59.3 | 51.9 | 56.9 | 59.0 | 56.2 ± 15.3 | 53.4 | 42.6 | 46.8 | 39.9 | 47.8 ± 17.1 |
GPT-2base | 54.1 | 48.3 | 55.6 | 56.8 | 52.3±14.7 | 54.3 | 45.3 | 50.2 | 46.3 | 50.1 ± 17.2 |
GPT-2medium | 53.9 | 49.1 | 56.2 | 59.8 | 52.8 ± 14.5 | 59.7 | 49.8 | 58.7 | 54.8 | 56.0 ± 18.8 |
GPT-2large | 53.8 | 49.4 | 58.1 | 58.8 | 53.0 ± 14.8 | 50.2 | 44.1 | 46.1 | 44.1 | 47.1 ± 16.0 |
RoBERTabase | 56.4 | 51.4 | 56.7 | 59.7 | 54.8 ± 14.7 | 65.3 | 55.1 | 64.8 | 61.4 | 61.6 ± 19.2 |
RoBERTalarge | 58.5 | 53.0 | 58.6 | 62.7 | 56.7±14.9 | 66.7 | 56.6 | 66.3 | 64.2 | 63.2±19.3 |
XLNetbase | 54.2 | 49.1 | 53.8 | 56.8 | 52.2 ± 14.4 | 67.2 | 55.0 | 68.7 | 63.8 | 62.7±20.7 |
XLNetlarge | 57.6 | 52.5 | 57.9 | 60.8 | 55.9±14.4 | 51.0 | 44.8 | 47.5 | 40.9 | 47.6±15.0 |
XLMenglish | 56.3 | 50.1 | 56.5 | 62.1 | 54.3 ± 15.1 | 60.4 | 51.3 | 59.5 | 55.9 | 57.0 ± 18.1 |
CTRL | 53.8 | 47.0 | 56.5 | 57.4 | 51.9 ± 15.4 | 60.4 | 49.4 | 61.7 | 56.3 | 56.8 ± 19.2 |
25This evaluation tends to confirm the claim that larger versions of the same model achieve better results. From Table 3, we can also see that the models have more difficulties in identifying the different senses of verbs, while nouns and adverbs have higher results. This is probably due to the different distribution of these word classes in the training sets of the models and WordNet’s fine-granularity. The performances of the models with dominant-set are surprisingly high, considering that the setting of this experiment is completely unsupervised. Furthermore, this algorithm is conceived to extract compact clusters and this feature could drive it to over partition the vector space of monosemous words. Instead, the results suggest the opposite: that the models are able to produce representations with high internal similarity, positioning their representations on a defined sub-space.
Word Sense Disambiguation
26We used the method proposed in to create sense vectors from contextualized word vectors. This method consists in averaging all the representations of a given sense. The resulting vector space corresponds to  (see Section 3.1). We evaluated the generated vectors on a standard benchmark (Raganato, Camacho-Collados, and Navigli 2017) for WSD. It consists of five datasets that were unified to the same WordNet version: Senseval-2 (S2), Senseval-3 (S3), SemEval-2007 (S7), SemEval-2013 and SemEval-2015, having in total 10,619 target words.
27The identification of word senses is conducted by feeding the entire texts of the datasets into a pre-trained model and extracting, for each target word wi, its embedding representation as was done for the construction of the semantic space. Once these representations are available, we compute the cosine similarities among and the embeddings in  constructed with the same model and selected the sense with the highest similarity. We did not use more sophisticated models such as WSD-games (Tripodi and Navigli 2019; Tripodi, Vascon, and Pelillo 2016) because we wanted to keep the evaluation as simple as possible as not to influence the evaluation of the results.
28The results of this evaluation are presented in Table 4. The first trend that emerges from the results is the big gap between precision and recall. This is due to the absence of many senses in our training set. We did not want to use back-off strategies or other techniques usually employed in the WSD literature, to not influence the performances and the analysis of the results. Despite the simplicity of the approach, it performs surprisingly well. In particular, BERT, RoBERTa, and XLNet (three bidirectional models) have very high results. The low performances of CTRL are probably due to its large vocabulary and to its objective, designed to solve different tasks.
5. Conclusion and Future Work
29We conducted an extensive analysis of the semantic capabilities of contextualized embedding models. We analyzed the vector space constructed using pre-trained models and found that their vectors contain redundant information and that their first two principal components are dominant.
30The results on sense induction are promising. They demonstrated the effectiveness of contextualized embeddings to capture semantic information. We did not find higher performances from more complex models, rather, we found that RoBERTa, a model that was developed by simplifying a more complex model, BERT, was one of the best performers. Neither the dimension of the hidden layers, the size of the training data, nor the size of the vocabulary seems to play a big role in modeling semantics. As stated in previous works, inserting an anisotropy penalty to the objective function of the models could improve directly the representations. We also noticed that, even if BERT models and XLNet have different objectives and are trained on different data, they have similar performances. It emerged that these models are less redundant than others.
31The conclusion that we can draw from our analysis and evaluation is that pre-trained language models can capture lexical-semantic information and that unsupervised models can be used to distinguish among their representations. On the other hand, these representations are redundant and anisotropic. We hypothesize that reducing these aspects can lead to better representations. This operation can be carried out post-hoc but we think that training new models keeping this point in mind could lead to the development of better models.
Bibliographie
Des DOI sont automatiquement ajoutés aux références bibliographiques par Bilbo, l’outil d’annotation bibliographique d’OpenEdition. Ces références bibliographiques peuvent être téléchargées dans les formats APA, Chicago et MLA.
Format
- APA
- Chicago
- MLA
David Arthur, and Sergei Vassilvitskii. 2007. “K-Means++: The Advantages of Careful Seeding.” In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, Usa, January 7-9, 2007, 1027–35. http://0-dl-acm-org.catalogue.libraries.london.ac.uk/citation.cfm?id=1283383.1283494.
Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. “Clueweb09 Data Set.”
José Camacho-Collados, and Mohammad Taher Pilehvar. 2018. “From Word to Sense Embeddings: A Survey on Vector Representations of Meaning.” J. Artif. Intell. Res. 63: 743–88. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1613/jair.1.11259.
10.1613/jair.1.11259 :Common Crawl. n.d. https://commoncrawl.org/.
Kawin Ethayarajh. 2019. “How Contextual Are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp), 55–65. Hong Kong, China: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/D19-1006.
10.18653/v1/D19-1006 :Allyson Ettinger. 2020. “What Bert Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models.” Transactions of the Association for Computational Linguistics 8: 34–48. https://arxiv.org/abs/1907.13528.
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. “Ctrl: A Conditional Transformer Language Model for Controllable Generation.” arXiv Preprint arXiv:1909.05858.
Adam Kilgarriff. 2004. “How Dominant Is the Commonest Sense of a Word?” In Text, Speech and Dialogue, edited by Petr Sojka, Ivan Kopeček, and Karel Pala, 103–11. Berlin, Heidelberg: Springer Berlin Heidelberg.
Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. “Revealing the Dark Secrets of BERT.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp), 4365–74. Hong Kong, China: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/D19-1445.
10.18653/v1/D19-1445 :Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. “Large Memory Layers with Product Keys.” arXiv Preprint arXiv:1907.05242.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized Bert Pretraining Approach.” arXiv Preprint arXiv:1907.11692.
Stuart P. Lloyd. 1982. “Least Squares Quantization in PCM.” IEEE Trans. Information Theory 28 (2): 129–36. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1109/TIT.1982.1056489.
10.1109/TIT.1982.1056489 :Paul Michel, Omer Levy, and Graham Neubig. 2019. “Are Sixteen Heads Really Better Than One?” In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neurips 2019, 8-14 December 2019, Vancouver, Bc, Canada, 14014–24. http://papers.nips.cc/paper/9551-are-sixteen-heads-really-better-than-one.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held December 5-8, 2013, Lake Tahoe, Nevada, United States, 3111–9. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.
George A. Miller. 1995. “WordNet: A Lexical Database for English.” Commun. ACM 38 (11): 39–41. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1145/219717.219748.
10.1145/219717.219748 :George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunker. 1993. “A Semantic Concordance.” In HUMAN LANGUAGE TECHNOLOGY: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993. https://www.aclweb.org/anthology/H93-1061.
Jiaqi Mu, and Pramod Viswanath. 2018. “All-but-the-Top: Simple and Effective Postprocessing for Word Representations.” In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, Bc, Canada, April 30 - May 3, 2018, Conference Track Proceedings. https://openreview.net/forum?id=HkuGJ3kCb.
Roberto Navigli. 2006. “Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance.” In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, 105–12. ACL-44. Stroudsburg, PA, USA: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.3115/1220175.1220189.
10.3115/1220175.1220189 :Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. “English Gigaword Fifth Edition, June.” Linguistic Data Consortium, LDC2011T07 12.
Massimiliano Pavan, and Marcello Pelillo. 2007. “Dominant Sets and Pairwise Clustering.” IEEE Trans. Pattern Anal. Mach. Intell. 29 (1): 167–72. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1109/TPAMI.2007.250608.
10.1109/TPAMI.2007.250608 :Marcello Pelillo. 2009. “What Is a Cluster? Perspectives from Game Theory.” In Proc. Of the Nips Workshop on Clustering Theory.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–37. New Orleans, Louisiana: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/N18-1202.
10.18653/v1/N18-1202 :Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. “Language Models as Knowledge Bases?” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp), 2463–73. Hong Kong, China: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/D19-1250.
10.18653/v1/D19-1250 :Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.”
Alessandro Raganato, Jose Camacho-Collados, and Roberto Navigli. 2017. “Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 99–110. Valencia, Spain: Association for Computational Linguistics. https://www.aclweb.org/anthology/E17-1010.
Rocco Tripodi, and Roberto Navigli. 2019. “Game Theory Meets Embeddings: A Unified Framework for Word Sense Disambiguation.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp), 88–99. Hong Kong, China: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/D19-1009.
10.18653/v1/D19-1009 :Rocco Tripodi, and Stefano Li Pira. 2017. “Analysis of Italian Word Embeddings.” In Proceedings of the Fourth Italian Conference on Computational Linguistics (Clic-It 2017), Rome, Italy, December 11-13, 2017. http://ceur-ws.org/Vol-2006/paper045.pdf.
10.4000/books.aaccademia.2314 :Rocco Tripodi, Sebastiano Vascon, and Marcello Pelillo. 2016. “Context Aware Nonnegative Matrix Factorization Clustering.” In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016, 1719–24. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1109/ICPR.2016.7899884.
10.1109/ICPR.2016.7899884 :Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, ca, USA, 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.” In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neurips 2019, 8-14 December 2019, Vancouver, Bc, Canada, 3261–75. http://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, et al. 2019. “HuggingFace’s Transformers: State-of-the-Art Natural Language Processing.” ArXiv abs/1910.03771.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. “Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation.” arXiv Preprint arXiv:1609.08144.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” arXiv Preprint arXiv:1906.08237.
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. “Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books.” In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 19–27. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1109/ICCV.2015.11.
10.1109/ICCV.2015.11 :Notes de bas de page
1 We used the transformers library (Wolf et al. 2019).
Auteur
University of Bologna – rocco.tripodi@unibo.it
Le texte seul est utilisable sous licence Creative Commons - Attribution - Pas d'Utilisation Commerciale - Pas de Modification 4.0 International - CC BY-NC-ND 4.0. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.
Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015
3-4 December 2015, Trento
Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)
2015
Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016
5-6 December 2016, Napoli
Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)
2016
EVALITA. Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 7 December 2016, Naples
Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)
2016
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
11-12 December 2017, Rome
Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)
2017
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
10-12 December 2018, Torino
Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 12-13 December 2018, Naples
Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop
Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)
2020
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
Bologna, Italy, March 1-3, 2021
Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)
2020
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
Milan, Italy, 26-28 January, 2022
Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)
2022