• Contenu principal
  • Menu
OpenEdition Books
  • Accueil
  • Catalogue de 15381 livres
  • Éditeurs
  • Auteurs
  • Facebook
  • X
  • Partager
    • Facebook

    • X

    • Accueil
    • Catalogue de 15381 livres
    • Éditeurs
    • Auteurs
  • Ressources numériques en sciences humaines et sociales

    • OpenEdition
  • Nos plateformes

    • OpenEdition Books
    • OpenEdition Journals
    • Hypothèses
    • Calenda
  • Bibliothèques

    • OpenEdition Freemium
  • Suivez-nous

  • Newsletter
OpenEdition Search

Redirection vers OpenEdition Search.

À quel endroit ?
  • Accademia University Press
  • ›
  • Collana dell'Associazione Italiana di Li...
  • ›
  • EVALITA Evaluation of NLP and Speech Too...
  • ›
  • Track “Time and Diachrony”
  • ›
  • DaDoEval: Dating Documents
  • ›
  • rmassidda @ DaDoEval: Document Dating Us...
  • Accademia University Press
  • Accademia University Press
    Accademia University Press
    Informations sur la couverture
    Table des matières
    Liens vers le livre
    Informations sur la couverture
    Table des matières
    Formats de lecture

    Plan

    Plan détaillé Texte intégral 1. Introduction2. Methodological survey3. Data Analysis4. Building blocks selection5. Model Architecture6. Results7. Conclusions Bibliographie Auteur

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Ce livre est recensé par

    Précédent Suivant
    Table des matières

    rmassidda @ DaDoEval: Document Dating Using Sentence Embeddings at EVALITA 2020

    Riccardo Massidda

    p. 403-408

    Résumé

    This report describes an approach to solve the DaDoEval document dating subtasks for the EVALITA 2020 competition. The dating problem is tackled as a classification problem, where the significant length of the documents in the provided dataset is addressed by using sentence embeddings in a hierarchical architecture. Three different pre-trained models to generate sentence embeddings have been evaluated and compared: USE, LaBSE and SBERT. Other than sentence embeddings the classifier exploits a bag-of-entities representation of the document, generated using a pre-trained named entity recognizer. The final model is able to simultaneously produce the required date for each subtask.

    Texte intégral Bibliographie Auteur

    Texte intégral

    1. Introduction

    1To solve the DaDoEval task (Menini et al. 2020) for the EVALITA 2020 competition (Basile et al. 2020) a model should be able to assign a temporal span from a discrete set of candidates to a document, i.e. recognizing when the document was issued. As many other NLP tasks, like author identification or topic assignment, this task can be reduced to a classification problem.

    2The provided dataset contains documents written by the Italian statesman Alcide De Gasperi in the time span 1901-1954, labeled with the year in which they were issued. The dating task is divided into different subtasks of increasing granularity. The first subtask requires to classify a document into one of five representative periods in De Gasperi’s life as identified by historians. (Table 1) The second and the third subtasks require to date a document more precisely, using a five-year span for the former and the precise year for the latter. These subtasks are referred to as the same-genre subtasks.

    3Other than on a blind test set kept from the same-genre dataset, the model has been also evaluated on three additional cross-genre subtasks. In this case, documents coming from a De Gasperi’s epistolary archive were used to build an external blind test set. The cross-genre subtasks require to classify documents with the same increasing time granularity as the same-genre ones.

    Table 1: Historical periods of De Gasperi’s life

    ID

    Period description

    Time span

    A

    Habsburg years

    1901-1918

    B

    Beginning of political activity

    1919-1926

    C

    Internal exile

    1927-1942

    D

    From fascism to the Italian Republic

    1943-1947

    E

    Building the Italian Republic

    1948-1954

    4The tasks are evaluated using macro-averaged F1. Baseline results using logistic regression and tf-idf on a bag-of-word representation are provided by the task proponents in table 2.

    Table 2: Proponents baseline

    Subtask

    Macro-Average F1

    Historical

    0.827

    Five-years

    0.485

    Single-year

    0.126

    5All of the results and the described experiments have been implemented using TensorFlow and executed on the platform Google Colab. The limitations of the platform regarding continuous usage are not negligible and had an acknowledgeable weight in multiple decisions.

    6In section 2 different approaches to deal with long text classification are described and the various sentence embeddings models are presented. In section 3 the peculiarities of the dataset are discussed. In section 4 the different sentence embeddings models are evaluated and compared with alternative approaches over a single subtask. In section 5 the architecture of the final model used to solve all the subtasks is described, its results are reported in section 6 and discussed in section 7.

    2. Methodological survey

    7The use of pre-trained transformers such as BERT (Devlin et al. 2019) has remarkably improved the state of the art in many NLP tasks, text classification included. Furthermore contextual word embeddings produced by pre-trained transformers are preferable when dealing with polysemy. Documents from a wide time span could manifest lexical change, so polysemy may significantly emerge (Blank 1999).

    8When dealing with text classification using the transformer model the first architectural issue is given by the length of the documents. To classify a text a special symbol is usually inserted at the start of the input sequence, then the output corresponding to that symbol is fed into a neural network to retrieve the predicted class. Since the maximum input size for a BERT transformer is 512 tokens, it is unlikely that the whole document will fit. Different architectures are available to overcome this problem.

    9For certain domains it has been studied that not all of the text is needed to achieve good classification accuracy. For instance Sun et al. propose to select only part of the text, like the head, or the tail or both, up to reducing the text size to fit the input layer of the transformer. The random selection of tokens inside a document has also proven to be effective for topic classification of academic papers (Liu et al. 2018).

    10Recently different solutions started to exploit hierarchical architectures, segmenting the text to consequently analyze it in its entirety. The use of sentences may be intuitively perceived as more meaningful than fixed-length segments. Accordingly, three different sentence embeddings solutions have been selected to be implemented and evaluated for the DaDoEval task. All of them provide pre-trained multilingual models, satisfying so the computational constraints and the task requirements.

    11Sentence-BERT, also known as SBERT, produces sentence embeddings by stacking a pooling layer on the top of a BERT transformer. A pre-trained BERT model is fine-tuned using Siamese networks, back-propagating over the cosine similarity of supposedly semantically related sentences. (Reimers and Gurevych 2019) A monolingual model can be then distilled and expanded to other languages by training a student model to replicate the behavior of the teacher model, and under the assumption that the vector representation of translated sentences should coincide. (Reimers and Gurevych 2020). The authors of SBERT published distiluse-base-multilingual-cased, a distilled model pre-trained on many languages including Italian.

    12The Universal Sentence Encoder, or USE, comprises different architectures trained on the same set of tasks to enable transfer learning for many NLP tasks with different requirements. (Cer et al. 2018) The original USE has then been expanded for multilingual applications providing two pre-trained models, a transformer and a CNN, both available on Tensorflow HUB. (Yang et al. 2019)

    13Lastly, the Language-agnostic BERT Sentence Embedding model, or LaBSE, produces sentence embeddings by using a fine-tuned BERT model. The LaBSE model is designed similarly to SBERT, using two sharing-weights transformers initialized by a pre-trained BERT model. The main difference lies in the datasets and the tasks used for fine-tuning. The authors report the remarkable results of LaBSE for languages unseen but somehow related to those in the training set. (Feng et al. 2020) This result may be useful to fill the gaps between contemporary Italian and the XX-century Italian language in the dataset.

    3. Data Analysis

    14The overall dataset contains 2759 manually labeled documents of variable length written by Alcide De Gasperi during its political life. However, the development dataset provided by the proponents contains only 2210 of them, since the remaining ones are kept for the blind same-genre test set. The dataset is extremely unbalanced since the number of elements per time period varies considerably. For instance by analyzing figure 1 it is evident how some years contribute to the dataset with few documents. The lack of data for these periods remarkably impacts the overall accuracy of the learning process. The development set provided by the proposers has been split into a training set and a validation set to assess the capabilities of the different tested models. The training set was composed by sampling the 80\% of the development dataset, leaving the remaining 20\% to the validation split. This choice reflects the proportion between the size of the provided development set and the overall dataset.

    Figure 1: Number of documents per year from 1901 to 1954

    Image 10000000000003B0000001C92C8EA82A8FDEAD15.png

    15Without altering the validation split for the assessment, the training data can be augmented to contrast the unbalancing. The hierarchical solution highly increases the number of tokens that can be used to classify a document, nonetheless the number of sentences per document should be constrained under a fixed constant. When truncating a document to limit the number of sentences, the remaining part is then inserted in the dataset as a new document instead of discarding it. The data augmentation procedure described has been implemented under the assumption that the less represented years contain the longest documents. While this holds for some classes, the effect of data augmentation didn’t impact on the overall distribution.

    16The tokenizer for the Italian language included in the NLTK library has been used to split each document into a list of sentences (Bird, Klein, and Loper 2009). The content of each sentence has been tokenized instead with a custom tokenizer for each one of the sentence embeddings techniques, since they may require different configurations and their vocabulary must be used. A common issue in this scenario is given by the rate of out-of-vocabulary tokens (Wang et al. 2019), but this hasn’t been evaluated since the interfaces offered by the selected models don’t offer insights over the OOV rate or other token-level statistics. The time required to produce the embeddings over the training set is reported in table 3.

    Table 3: Time required by each sentence embedding technique to process the training set

    Method

    Time

    SBERT

    223.068s

    LaBSE

    3364.272s

    USETRANS

    154.277s

    USECNN

    29.681s

    4. Building blocks selection

    17Because of the computational limitations, many experiments have been conducted only on one subtask, relegating the others to a subsequent phase. The historical subtask has been chosen because of the better balancing of the dataset and the foreseeable and more promising results. The provided dataset has been split using stratified sampling and data augmentation in a consistent training set and a smaller validation set. The training split covers the 80\% of the provided development set, leaving the remaining 20\% to the validation one. All of the results are produced by averaging multiple runs, to overcome the non-deterministic and unpredictables effects of the GPUs used for training.

    4.1 Truncation based classification

    18The first experiments used a pre-trained BERT multilingual model for text classification. To overcome the constraint over the input size the documents were truncated up to their first 512 tokens. As expected the truncation has proven to be ineffective since, even after fine-tuning, the model didn’t converge on the training set for any subtask. The results aren’t significant and therefore not reported.

    4.2 Sentence embeddings

    19Once each document is represented as a sequence of sentence embeddings, two different classification models have been implemented and evaluated. The first is a Recurrent Neural Network with two bidirectional LSTM layers followed by a combination of dropout and dense layers of reducing width. The other classifier is based on the transformer architecture, where a transformer block composed of a multi-headed self-attention layer with 128 heads, dropout and layer normalization is followed by a combination of dropout and dense layers as in the previous solution.

    20The results of the experiments over the combination of sentence embeddings and the two classifiers are reported in table 4, showing how the combination of SBERT and the transformer-based classifier is the most adequate. With the exception of LaBSE, all the other sentence embeddings models gave better results when coupled with a transformer block than with a recurrent neural network. Also, the two variants of USE manifested a more significant gap when coupled with the RNN classifier than with the transformer-based one. Finally, the performance drop of the LaBSE model may reflect a condition also explored by Reimers and Gurevych (2020), where a comparable performance gap with SBERT occurs in semantic textual similarity tasks.

    Table 4: Results for the historical periods subtask over training and validation set using different sequence embeddings

    TR

    VL

    Top

    Loss

    Acc

    F1

    Loss

    Acc

    F1

    LaBSE

    RNN

    0.356

    0.875

    0.884

    0.663

    0.778

    0.781

    Trans

    0.559

    0.771

    0.697

    0.960

    0.713

    0.616

    SBERT

    RNN

    0.143

    0.955

    0.975

    0.690

    0.824

    0.829

    Trans

    0.060

    0.982

    0.987

    1.235

    0.850

    0.851

    USECNN

    RNN

    0.193

    0.937

    0.959

    0.780

    0.775

    0.780

    Trans

    0.217

    0.920

    0.937

    0.850

    0.821

    0.819

    USETransformer

    RNN

    0.105

    0.969

    0.978

    0.780

    0.815

    0.823

    Trans

    0.192

    0.923

    0.972

    0.773

    0.822

    0.830

    4.3 Bag-of-entities

    21Another approach to tackle the subtasks consists of exploiting the knowledge of a pre-trained named entity recognizer. It is reasonable to suppose that the entities extracted by a document will produce a good representation for the document itself. In the context of document dating this could be meaningful by assuming that the issues discussed by the author will vary during the years, consequently influencing the entities contained. By building a vocabulary of unique entities it is possible to represent each document as a bag-of-entities, then a multi-layer dense classifier with dropout can be trained to predict the correct time span.

    22Named entity recognition is achieved using one pre-trained CNN for the Italian language distributed by spaCy (Honnibal and Montani 2017). Three variants of the same model are provided but, since their differences heavily impact on the model size rather than on the performances (Table 5), the medium sized model has been chosen without further validation. Because of this it is not possible to assess how the performances of the NER alone influence the performances of the overall system.

    Table 5: Model size and benchmark as provided by spaCy for the Italian language pre-trained models. (Explosion.ai 2020)

    Small

    Medium

    Large

    F1

    86.57

    88.54

    89.40

    Precision

    86.85

    88.76

    89.56

    Recall

    86.29

    88.33

    89.24

    Size

    13MB

    43MB

    544MB

    23The NER model returns for each entity a pair containing its content and a label regarding its role. It is possible to consider as a member of the entities vocabulary only the textual content or the unique pair of text and label, both methods were implemented and compared but finally only the label was chosen as representative of the entity.

    4.4 Results

    24The transformer classifier using sentence embeddings provided by SBERT is chosen as the final candidate since it’s the best performing model on the validation set. As previously discussed, the model selection procedure only considered the first subtask because of the magnitude and the balancing of its dataset. To roughly estimate the behavior on all the subtasks both the sentence embeddings classifier and the bag-of-entities solution have been retrained from scratch on the specific subtasks labels and evaluated on the validation set. The results are reported in table 6.

    Table 6: Macro-averaged F1 for all the subtasks

    SBERT+Trans

    Bag-of-entities

    Task

    Baseline

    TR

    VL

    TR

    VL

    Historical

    0.827

    0.930

    0.846

    0.997

    0.841

    Five-years

    0.485

    0.482

    0.354

    0.996

    0.563

    Single-year

    0.126

    0.086

    0.040

    0.990

    0.211

    5. Model Architecture

    25It is therefore clear that both the approaches have their advantages on different subtasks. More precisely the sentence embeddings one has proven to be more effective when dealing with the historical periods subtask, while the bag-of-entities obtains better results on the finer ones. The problem of combining these two solutions is now tackled.

    26The trivial solution would be to hardwire in a single model the different approaches, producing so the output for the first subtask using a sentence embeddings model and for the other subtasks with a bag-of-entities one. While this solution would be acceptable, and seemingly over the baseline according to the estimates on the validation set, it is reasonable to assume that the representations for these subtasks could be shared, improving the performances. Different variations of the same architecture are therefore evaluated on the validation set to monitor such improvement.

    27In the final model, the sentence embeddings produced by SBERT are fed to a transformer block containing a multi-headed self-attention layer, its output is then averaged and concatenated with the bag-of-entities representation of the document before being fed to a multi-layer neural network. The output of each layer of this network is also fed to a dedicated neural network that produces the output of each subtask. The selected order for the subtasks in the multi-layer dense classifier places the historical classification first, followed by the five-years and then the single-year classification. A graphical representation of the architecture is in figure 2.

    Figure 2: Architecture of the final model

    Image 10000201000003B00000040EA4FC9F0C9E0D9203.png

    28Both the reverse of the subtasks order and the absence of hierarchy, by connecting all the classification networks directly to the transformer block, have been tested. Also, the supposed additional value of the concatenation with the entities representation has been experimentally evaluated. The results of these variations are reported in table 7, where the selected final model for the competition is on the fourth row.

    Table 7

    Historical

    Five-years

    Single-year

    BoE

    Order

    TR

    VL

    TR

    VL

    TR

    VL

    N

    F

    0.987

    0.828

    0.961

    0.554

    0.577

    0.144

    N

    B

    0.988

    0.828

    0.930

    0.566

    0.871

    0.204

    N

    A

    0.983

    0.813

    0.973

    0.560

    0.920

    0.228

    Y

    F

    0.991

    0.842

    0.980

    0.599

    0.852

    0.236

    Y

    B

    0.993

    0.842

    0.988

    0.578

    0.897

    0.247

    Y

    A

    0.991

    0.820

    0.994

    0.560

    0.967

    0.242

    Results for the different subtasks over the training and the validation sets using different architectures. The first column refers to the use of the bag-of-entities representation in the model as in Yes or No, the second to the order of the subtasks as in Backward, Forward and Absent.

    6. Results

    29The model has been evaluated by using two independent test sets: same-genre and cross-genre. The first one is a blind test set, containing documents from the same source of the provided development dataset. The cross-genre set is instead an external test set, containing documents from a different source, specifically from an archive of epistolary documents of the same subject.

    30For each subtask two runs per test set were submitted, for brevity in table 8 only the average result of the submitted runs is reported. The model performs over the baseline in the same-genre evaluation for each subtask, also improving the performances with respect to the validation set. Instead, concerning the cross-genre evaluation, the model replicates the results of the baseline and shows a significant drop in respect to the validation set.

    Table 8: F1 macro-averaged results for the different subtasks over the validation set (VL), the test sets (TS) and the respective baselines (BL)

    Same-genre

    Cross-genre

    VL

    BL

    TS

    BL

    TS

    Historical

    0.842

    0.827

    0.857

    0.368

    0.379

    Five-years

    0.599

    0.458

    0.609

    0.171

    0.168

    Single-year

    0.236

    0.126

    0.265

    0.020

    0.055

    7. Conclusions

    31The contribution of the bag-of-entities representation was certainly helpful, but this should not overshadow the performance improvement given by the introduction of the hierarchical model. The first three rows in the already discussed table 7 report the results of the model without any contribution from the bag-of-entities representation. Whilst neither of these was elected as the best candidate, there is a remarkable improvement over the independent use of the very same building blocks of the final architecture for each subtask.

    32The described architecture is prone to multiple variations and only some of them have been formally evaluated and compared. Nonetheless, the selected final model was able to surpass the same-genre baseline for all of the different subtasks. Anyhow the performance drop in the cross-genre test should be interpreted as a limit to the generalization power of the chosen model. A wider exploration of the models may increase the overall performances for both the same-genre and the cross-genre tasks.

    33Also, targeting multiple subtasks at the same time made nontrivial the choice of a final model, therefore it has been carried out intuitively considering the results over the validation set for each subtask. A formal approach to this issue may result in a finer model selection.

    34Despite the discussed approximations, the use of sentence embeddings models has proven to be effective also on tasks different from the ones they were originally conceived for, and compatible with other representations such as bag-of-entities.

    Bibliographie

    Des DOI sont automatiquement ajoutés aux références bibliographiques par Bilbo, l’outil d’annotation bibliographique d’OpenEdition. Ces références bibliographiques peuvent être téléchargées dans les formats APA, Chicago et MLA.

    Format

    • APA
    • Chicago
    • MLA
    Liu, L., Liu, K., Cong, Z., Zhao, J., Ji, Y., & He, J. (2018). Long Length Document Classification by Local Convolutional Feature Aggregation. In Algorithms (Vols. 11, Issues 8, p. 109). MDPI AG. https://0-doi-org.catalogue.libraries.london.ac.uk/10.3390/a11080109
    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (1–). (2019). (1–). Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/d19-1
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (1–). (2020). (1–). Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2020.emnlp-main
    Liu, Liu, Kaile Liu, Zhenghai Cong, Jiali Zhao, Yefei Ji, and Jun He. “Long Length Document Classification by Local Convolutional Feature Aggregation”. Algorithms. MDPI AG, July 24, 2018. https://0-doi-org.catalogue.libraries.london.ac.uk/10.3390/a11080109.
    “Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)”. []. Association for Computational Linguistics, 2019. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/d19-1.
    “Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)”. []. Association for Computational Linguistics, 2020. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2020.emnlp-main.
    Liu, Liu, et al. “Long Length Document Classification by Local Convolutional Feature Aggregation”. Algorithms, vol. 11, no. 8, MDPI AG, 24 July 2018, p. 109. Crossref, https://0-doi-org.catalogue.libraries.london.ac.uk/10.3390/a11080109.
    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). [], Association for Computational Linguistics, 2019. Crossref, https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/d19-1.
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). [], Association for Computational Linguistics, 2020. Crossref, https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/2020.emnlp-main.

    Cette bibliographie a été enrichie de toutes les références bibliographiques automatiquement générées par Bilbo en utilisant Crossref.

    Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro. 2020. “EVALITA 2020: Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian.” In Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (Evalita 2020), edited by Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro. Online: CEUR.org.

    Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. "O’Reilly Media, Inc.".

    Andreas Blank. 1999. “Why Do New Meanings Occur? A Cognitive Typology of the Motivations for Lexical Semantic Change.” Historical Semantics and Cognition 61.

    Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, et al. 2018. “Universal Sentence Encoder.” arXiv:1803.11175 [Cs], April. http://arxiv.org/abs/1803.11175.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv:1810.04805 [Cs], May. http://arxiv.org/abs/1810.04805.

    Explosion.ai. 2020. “Italian · spaCy Models Documentation.” https://spacy.io/models/it.

    Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2020. “Language-Agnostic BERT Sentence Embedding.” arXiv:2007.01852 [Cs], July. http://arxiv.org/abs/2007.01852.

    Matthew Honnibal, and Ines Montani. 2017. “spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing.”

    Liu Liu, Kaile Liu, Zhenghai Cong, Jiali Zhao, Yefei Ji, and Jun He. 2018. “Long Length Document Classification by Local Convolutional Feature Aggregation.” Algorithms 11 (8): 109. https://0-doi-org.catalogue.libraries.london.ac.uk/10.3390/a11080109.

    10.3390/a11080109 :

    Stefano Menini, Giovanni Moretti, Rachele Sprugnoli, and Sara Tonelli. 2020. “DaDoEval @ EVALITA 2020: Same-Genre and Cross-Genre Dating of Historical Documents.” In Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (Evalita 2020), edited by Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro. Online: CEUR.org.

    Nils Reimers, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” arXiv:1908.10084 [Cs], August. http://arxiv.org/abs/1908.10084.

    10.18653/v1/D19-1 :

    Nils Reimers, and Iryna Gurevych. 2020. “Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.” arXiv:2004.09813 [Cs], April. http://arxiv.org/abs/2004.09813.

    10.18653/v1/2020.emnlp-main :

    Hai Wang, Dian Yu, Kai Sun, Janshu Chen, and Dong Yu. 2019. “Improving Pre-Trained Multilingual Models with Vocabulary Expansion.” arXiv:1909.12440 [Cs], September. http://arxiv.org/abs/1909.12440.

    Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, et al. 2019. “Multilingual Universal Sentence Encoder for Semantic Retrieval.” arXiv:1907.04307 [Cs], July. http://arxiv.org/abs/1907.04307.

    Auteur

    Riccardo Massidda

    Università di Pisa – r.massidda@studenti.unipi.it

    Précédent Suivant
    Table des matières

    Creative Commons - Attribution - Pas d'Utilisation Commerciale - Pas de Modification 4.0 International - CC BY-NC-ND 4.0

    Le texte seul est utilisable sous licence Creative Commons - Attribution - Pas d'Utilisation Commerciale - Pas de Modification 4.0 International - CC BY-NC-ND 4.0. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.

    Voir plus de livres
    Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015

    Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015

    3-4 December 2015, Trento

    Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)

    2015

    Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016

    Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016

    5-6 December 2016, Napoli

    Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)

    2016

    EVALITA. Evaluation of NLP and Speech Tools for Italian

    EVALITA. Evaluation of NLP and Speech Tools for Italian

    Proceedings of the Final Workshop 7 December 2016, Naples

    Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)

    2016

    Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017

    Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017

    11-12 December 2017, Rome

    Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)

    2017

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018

    10-12 December 2018, Torino

    Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)

    2018

    EVALITA Evaluation of NLP and Speech Tools for Italian

    EVALITA Evaluation of NLP and Speech Tools for Italian

    Proceedings of the Final Workshop 12-13 December 2018, Naples

    Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)

    2018

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop

    Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)

    2020

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Bologna, Italy, March 1-3, 2021

    Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)

    2020

    Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

    Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

    Milan, Italy, 26-28 January, 2022

    Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)

    2022

    Voir plus de livres
    1 / 9
    Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015

    Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015

    3-4 December 2015, Trento

    Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)

    2015

    Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016

    Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016

    5-6 December 2016, Napoli

    Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)

    2016

    EVALITA. Evaluation of NLP and Speech Tools for Italian

    EVALITA. Evaluation of NLP and Speech Tools for Italian

    Proceedings of the Final Workshop 7 December 2016, Naples

    Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)

    2016

    Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017

    Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017

    11-12 December 2017, Rome

    Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)

    2017

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018

    10-12 December 2018, Torino

    Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)

    2018

    EVALITA Evaluation of NLP and Speech Tools for Italian

    EVALITA Evaluation of NLP and Speech Tools for Italian

    Proceedings of the Final Workshop 12-13 December 2018, Naples

    Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)

    2018

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop

    Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)

    2020

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Bologna, Italy, March 1-3, 2021

    Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)

    2020

    Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

    Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

    Milan, Italy, 26-28 January, 2022

    Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)

    2022

    Accès ouvert

    Accès ouvert

    ePub

    PDF

    PDF du chapitre

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    X Facebook Email

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Ce chapitre est cité par

    • Muneer, Iqra. Nawab, Rao Muhammad Adeel. (2022) Cross-lingual Text Reuse Detection Using Translation Plus Monolingual Analysis for English-Urdu Language Pair. ACM Transactions on Asian and Low-Resource Language Information Processing, 21. DOI: 10.1145/3473331
    • Muneer, Iqra. Nawab, Rao Muhammad Adeel. (2022) Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair. Computer Speech & Language, 75. DOI: 10.1016/j.csl.2022.101381
    • Muneer, Iqra. Nawab, Rao Muhammad Adeel. (2022) Develop corpora and methods for cross-lingual text reuse detection for English Urdu language pair at lexical, syntactical, and phrasal levels. Language Resources and Evaluation, 56. DOI: 10.1007/s10579-022-09613-4

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Si vous avez des questions, vous pouvez nous écrire à access[at]openedition.org

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Vérifiez si votre bibliothèque a déjà acquis ce livre : authentifiez-vous à OpenEdition Freemium for Books.

    Vous pouvez suggérer à votre bibliothèque d’acquérir un ou plusieurs livres publiés sur OpenEdition Books. N’hésitez pas à lui indiquer nos coordonnées : access[at]openedition.org

    Vous pouvez également nous indiquer, à l’aide du formulaire suivant, les coordonnées de votre bibliothèque afin que nous la contactions pour lui suggérer l’achat de ce livre. Les champs suivis de (*) sont obligatoires.

    Veuillez, s’il vous plaît, remplir tous les champs.

    La syntaxe de l’email est incorrecte.

    Référence numérique du chapitre

    Format

    Massidda, R. (2020). rmassidda @ DaDoEval: Document Dating Using Sentence Embeddings at EVALITA 2020. In V. Basile, D. Croce, M. Maro, & L. C. Passaro (éds.), EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020 (1‑). Accademia University Press. https://0-doi-org.catalogue.libraries.london.ac.uk/10.4000/books.aaccademia.7603
    Massidda, Riccardo. « Rmassidda @ DaDoEval: Document Dating Using Sentence Embeddings at EVALITA 2020 ». In EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020, édité par Valerio Basile, Danilo Croce, Maria Maro, et Lucia C. Passaro. Torino: Accademia University Press, 2020. https://0-doi-org.catalogue.libraries.london.ac.uk/10.4000/books.aaccademia.7603.
    Massidda, Riccardo. « Rmassidda @ DaDoEval: Document Dating Using Sentence Embeddings at EVALITA 2020 ». EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020, édité par Valerio Basile et al., Accademia University Press, 2020, https://0-doi-org.catalogue.libraries.london.ac.uk/10.4000/books.aaccademia.7603.

    Référence numérique du livre

    Format

    Basile, V., Croce, D., Maro, M., & Passaro, L. C. (éds.). (2020). EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020 (1‑). Accademia University Press. https://0-doi-org.catalogue.libraries.london.ac.uk/10.4000/books.aaccademia.6732
    Basile, Valerio, Danilo Croce, Maria Maro, et Lucia C. Passaro, éd. EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020. Torino: Accademia University Press, 2020. https://0-doi-org.catalogue.libraries.london.ac.uk/10.4000/books.aaccademia.6732.
    Basile, Valerio, et al., éditeurs. EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020. Accademia University Press, 2020, https://0-doi-org.catalogue.libraries.london.ac.uk/10.4000/books.aaccademia.6732.
    Compatible avec Zotero Zotero

    1 / 3

    Accademia University Press

    Accademia University Press

    • Plan du site
    • Se connecter

    Suivez-nous

    • Facebook
    • Flux RSS

    URL : http://www.aaccademia.it/

    Email : info@aaccademia.it

    Adresse :

    Accademia University Press

    Via Carlo Alberto 55

    I‐10123

    Torino

    Italia

    OpenEdition
    • Candidater à OpenEdition Books
    • Connaître le programme OpenEdition Freemium
    • Commander des livres
    • S’abonner à la lettre d’OpenEdition
    • CGU d’OpenEdition Books
    • Accessibilité : partiellement conforme
    • Données personnelles
    • Gestion des cookies
    • Système de signalement