Hybrid Language Segmentation for Historical Documents
p. 24-28
Résumés
Language segmentation, i.e. the division of a multilingual text into monolingual fragments has been addressed in the past, but its application to historical documents has been largely unexplored. We propose a method for language segmentation for multilingual historical documents. For documents that contain a mix of high- and low-resource languages, we leverage the high availability of high-resource language material and use unsupervised methods for the low-resource parts. We show that our method outperforms previous efforts in this field.
La segmentazione del linguaggio, la divisione di un testo multilingue in frammenti monolingue, è stata affrontata nel passato, ma la sua applicazione a documenti storici è rimasta in gran parte inesplorata. Proponiamo un metodo per la segmentazione linguistica di documenti storici multilingue. Per documenti che contengono sia lingue ad alta disponibilità di risorse che lingue sottorappresentate, utilizziamo a nostro vantaggio l’elevata disponibilità delle lingue con un’ampia gamma di risorse e impieghiamo sistemi non supervisionati per le parti che dispongono di un minor numero di risorse. Mostriamo che il nostro metodo supera gli sforzi precedenti in questo settore.
Texte intégral
1 Introduction
1The computational processing of historical documents presents challenges that modern documents do not; often there is no standard orthography, and the documents may interleave multiple languages (Garrette et al., 2015). Furthermore, the languages used in the documents may by now be considered dead languages.
2This work will address the issue of language segmentation, i.e. segmenting a multilingual text into monolingual fragments for further processing. While this task has been addressed in the past using supervised and weakly supervised methods such as trained language models (Řehŭřek and Kolkus, 2009; King and Abney, 2013), unsupervised methods (Biemann and Teresniak, 2005; Yamaguchi and Tanaka-Ishii, 2012; Alfter, 2015a), the application to short messages (Porta, 2014; Alfter, 2015b) and the application to historical documents with regard to OCR tasks (Garrette et al., 2015), there is still room for improvement, especially concerning historical documents.
3Due to the scarcity of multilingual corpora (Lui et al., 2014), a popular approach is to use monolingual training data. However, in the case of historical documents, the number of available texts in a given historical language might be too low to yield representative language models.
4We propose a method that works on texts containing at least one high resource language and at least one low resource language. The intuition is to use supervised and weakly supervised methods for the high resource languages and unsupervised methods for the low resource languages to arrive at a better language segmentation; supervised methods derived from high-resource languages single out these languages while unsupervised algorithms tackle the remaining unknown language(s) and cluster them by similarity.
5The presented approach is extendable to more than one high-resource language, in which case a separate language model has to be trained for each language; the approach is also applicable to more than one low-resource language, where the unsupervised methods are expected to produce an accurate split of all languages present.
2 Hybrid language segmentation
6Let D = w1...wn be a document consisting of the words w1 to wn. Let Lh be a character-level n-gram language model trained on data for a high resource language which occurs in the document D. We first apply the language model Lh to the document D and assign each word wi the probability given by Lh (1).
7The language model Lh is implemented as a trigram language model with non-linear back-off. For testing purposes, we trained a language model on a dump of the English Wikipedia (3 GB of compressed data).
8Under the assumption that the text contains at least two languages with at least one word from each language, we determine the minimum probability Pmin for a split (2). This probability corresponds to the lowest probability assigned by the language model Lh to any word in the text.
9Next, we determine the maximum probability distance Pa between adjacent words (3) and the global maximum probability distance Pg between any two words (4).
10We also calculate the mean probability Pmean between the two adjacent words which maximize Pa (5).
11Finally we calculate the sharpest drop in probabilities and define Pmindrop as the probability at the lowest point of the drop (6).
12We then set a preliminary language split threshold Psplit based on Pmin, Pa, Pg, Pmean and Pmindrop (7).
13In a first step, every word wi with a probability P above the split threshold Psplit is considered to belong to the high resource language modeled by Lh and is tagged as such, while every word wj with a probability P below the split threshold is considered as belonging to an unknown language and is left untagged.
14In a second step, all untagged words are clustered by similarity. This is done by using language model induction (Alfter, 2015a). All words left untagged by the previous step are regarded as one text. From the first word w1, an initial language model Li is created. The next word w2 is tested against the initial model. If the probability P(w2|Li) exceeds a certain threshold value, the model is updated with w2, otherwise a new model is created. In this way, we iterate through the text, creating language models as necessary.
15The same procedure is done starting from the last word and moving towards the beginning of the text. From the two sets of language model inductions (forward, backward), the most similar models according to their n-gram distribution are then merged. This process is repeated, keeping the previously merged models, until no more models are induced.
16Each word is then tagged with the language model Lm (≈ cluster) which maximizes P(w|Lm)
17Finally, all words are evaluated in a local context using variable-length Markov Models (VMM). This step aims at eliminating inconsistencies, detecting other-language inclusions and merging back together same-language fragments. Řehŭřek and Kolkus (2009) use a similar technique, but they use a fixed-width sliding window while we use a variable window size based on context.
18For each word wi, we look at its tag ti. We then consider all the words immediately to the left of wi and all the words immediately to the right of wi that have a tag different from ti. From these words, we create local context language models left (Ll) and right (Lr). We calculate the similarity between Ll and Lr as well as the similarity of wi to Ll and Lr. There are different possible scenarios:
Ll is similar to Lr
wi is similar to Ll or Lr
wi is similar to Ll or Lr
Ll is dissimilar to Lr
wi is similar to Ll
wi is similar to Lr
wi is dissimilar to Ll and Lr
19In case 1a, we assimilate the tag of wi to the tag of either Ll or Lr; in that case, the labels for Ll and Lr are the same. In case 1b, wi is probably an other-language inclusion, since it is dissimilar to its context, while the left and right contexts are similar. In case 2a, we assimilate the tag of wi to the tag of Ll, and similarly in case 2b, we assimilate the tag of wi to Lr. In case 2c, wi is dissimilar to its context and the left and right contexts are also dissimilar. In this case, we leave the tag unchanged.
20The following sections describe the data used for evaluation as well as the results.
3 Data and Evaluation
Pacati, [Ved. pacati, Idg. *peqǔō, Av. pac-; Obulg. peka to fry, roast, Lith, kepū bake, Gr. pέssw cook, pέpwn ripe] to cook, boil, roast Vin. IV, 264; fig. torment in purgatory (trs. and intrs.): Niraye pacitvā after roasting in N.S.II, 225, PvA. 10, 14. – ppr. pacanto tormenting, Gen. pacato (+Caus. pācayato) D. I, 52 (expld at DA. I, 159, where read pacato for paccato, by pare daṇḍena pīḷentassa). – pp. pakka (q.v.). < - >Caus. pacāpeti & pāceti (q. v.). – Pass. paccati to be roasted or tormented (q. v.). (Page 382)
21In the absence of better comparable data, we re-use the Pali dictionary data entries presented in Alfter (2015a) and compare our calculated language segmentation to the segmentation presented in Alfter (2015a).
22The extract shown corresponds to the fifth Pali text used in the experiments. It shows among others some of the languages used, the unclear boundaries between languages, abbreviations, symbols and references. Monolingual stretches tend to be short with interspersed language inclusions.
23Based on the findings in Alfter (2015a) that neither a high Rand Index nor a high F-score alone yield good segmentations, but a combination of high Rand Index and F-score yield good segmentations, we have adopted a new measure of goodness-of-segmentation Gs, which is the arithmetic mean of the Rand Index and F5 score (8).
24Due to how precision and recall are calculated in the context of cluster evaluation, setting β>1, and thus placing more emphasis on recall, penalizes the algorithm for clustering together data points that are separated in the gold standard and lowers the impact splitting of data points which are clustered together in the gold standard. Indeed, it is preferable to have multiple clusters of a certain language than to have clusters of mixed languages. Thus, we use F5 (β=5) instead of F1 scores.
25We have found left context assimilation to be working better than right context assimilation or both side context assimilation. We therefore use only left context assimilation and leave out the other two options.
4 Results
26The following table shows our results (Hybrid Language Segmentation, HLS) compared to the results given in Alfter (2015a) (Language Model Induction, LMI). We converted the scores given in Alfter (2015a) to the new compound score Gs. The baselines from Alfter (2015a) are also indicated. AIO indicates the baseline where each word is thrown into the same cluster; there is only one cluster (all-in-one). AID indicates the baseline where each word is separated into its own cluster; there is one cluster per word (all-in-different).
Table 1: Results
Text | AIO | AID | LMI | HLS |
Pali 1 | 0.3174 | 0.4643 | 0.5296 | 0.6665 |
Pali 2 | 0.3635 | 0.5188 | 0.7662 | 0.5916 |
Pali 3 | 0.4996 | 0.3071 | 0.4700 | 0.6056 |
Pali 4 | 0.4047 | n/a | n/a | 0.4730 |
Pali 5 | 0.5848 | 0.2833 | 0.4402 | 0.5863 |
27As can be seen from the results, our approach outperforms the baselines as well as the purely unsupervised language model induction approach except for one data point where the language model induction produced an almost perfect clustering whereas the hybrid language segmentation method did not.
28A big problem with the dictionary data is that it is transcribed in a noisy manner. This is not immediately clear from looking at the data, but on closer inspection, it can be seen that some symbols like commas and full stops are rendered with non-standard Unicode characters (Unicode codepoint U+FF0C (FULLWIDTH COMMA) and Unicode codepoint U+FF0E (FULLWIDTH FULL STOP)) which break the chosen whitespace tokenization method. This results in chunks that are bigger than they should be, often containing multiple languages. We can also see that the transcription of Greek characters were rendered as character that look alike but are not actually Greek characters (see the quote at the beginning of section 3).
29If we look more closely at the results, we can see that our approach tends to be overly confident when assigning words to the high-resource language, which in this case is English. This includes words that clearly are not English, such as ‘°itar’ and ‘°ātar’1. The following example (Pali 1) shows the full dictionary entry.
[n. ag. fr. abhijjhita in med. function] one who covets M <smallcaps> i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A <smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).
30The poor discriminatory power of the model is probably related to the training data. While the English Wikipedia offers a huge amount of training data, it also includes many non-English words in explanations and on pages about non-English non-translatable terms for example. Thus, the resulting language model is noisy.
31It might be possible to increase accuracy by changing the split threshold Psplit, but while choosing a higher Psplit will effectively reduce the amount of erroneous English tags, it will also decrease the amount of correctly tagged words. It is possible that the unsupervised approach followed by the local context smoothing might re-assign the English words to the English model or at least to a consistent, second model. However, this remains to be tested. We think that simply using more ‘pure’ English training data will improve the language model’s accuracy.
32As for local context smoothing, we have not reached conclusive results. While in some cases, it succeeds in re-assigning the correct tag to a previously incorrectly tagged word, it also induces errors by erroneously re-tagging previously correct tags. This is most probably due to the short monolingual fragments in our data; longer monolingual fragments would yield more reliable language models. In connection to this, calculating similarity based on small contexts seems problematic. Another problem are non-words and their treatment. We have chosen not to cross nonword boundaries when calculating local context, but doing so might improve the results.
33Finally, we have only tested the approach with one high resource language and a multitude of low-resource languages. It would be interesting to test the method more extensively using more high resource language models (which in turn might interfere with each other).
6 Conclusion
34We have introduced a hybrid language segmentation method which leverages the presence of high-resource language content in mixed language historical documents and the availability of the necessary resources to build language models, coupled with an unsupervised language model induction approach which covers the low-resource parts. We have shown that our method outperforms the previously introduced unsupervised language model induction approach.
35We have also found that our method seems to work both on longer texts and on shorter texts, whereas the approach described in Alfter (2015a) seems to be working better on shorter texts such as Twitter messages.
36The local context approach yields inconclusive results. This is most probably due to the similarity measure used and the small size of the context. We would need, if possible, a better similarity measure for small language models or another method of evaluating the word in respect to its context.
Bibliographie
Alfter, D. (2015a). Language Segmentation. Master’s thesis, Universität Trier.
Alfter, D. (2015b). Language segmentation of twitter tweets using weakly supervised language model induction. TweetMT @ SEPLN.
Biemann, C. and Teresniak, S. (2005). Disentangling from babylonian confusion-unsupervised language identification. In Computational Linguistics and Intelligent Text Processing, pages 773–784. Springer.
Garrette, D., Alpert-Abrams, H., Berg-Kirkpatrick, T., and Klein, D. (2015). Unsupervised code-switching for multilingual historical document transcription. In Proceedings of NAACL.
King, B. and Abney, S. P. (2013). Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics and Human Language Technologies, pages 1110–1119.
Lui, M., Lau, J. H., and Baldwin, T. (2014). Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2:27–40.
Porta, J. (2014). Twitter Language Identification using Rational Kernels and its potential application to Sociolinguistics. TweetLID @ SEPLN.
Řehŭřek, R. and Kolkus, M. (2009). Language identification on the web: Extending the dictionary method. In Computational Linguistics and Intelligent Text Processing, pages 357–368. Springer.
Yamaguchi, H. and Tanaka-Ishii, K. (2012). Text segmentation by language using minimum description length. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 969–978. Association for Computational Linguistics.
Notes de bas de page
1 Here, ° stands for the root of the head word of the entry, so °itar should be read ‘abhijjhitar’ and °ātar shoul be read ‘abhijjhātar’.
Auteurs
University of Gothenburg - david.alfter@gu.se
University of Gothenburg - yuri.bizzoni@gu.se
Le texte seul est utilisable sous licence Creative Commons - Attribution - Pas d'Utilisation Commerciale - Pas de Modification 4.0 International - CC BY-NC-ND 4.0. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.
Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015
3-4 December 2015, Trento
Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)
2015
Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016
5-6 December 2016, Napoli
Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)
2016
EVALITA. Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 7 December 2016, Naples
Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)
2016
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
11-12 December 2017, Rome
Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)
2017
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
10-12 December 2018, Torino
Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 12-13 December 2018, Naples
Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop
Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)
2020
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
Bologna, Italy, March 1-3, 2021
Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)
2020
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
Milan, Italy, 26-28 January, 2022
Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)
2022