Dialog-based Help Desk through Automated Question Answering and Intent Detection
p. 443-449
Résumé
Modern personal assistants require to access unstructured information in order to successfully fulfill user requests. In this paper, we have studied the use of two machine learning components to design personal assistants: intent classification, to understand the user request, and answer sentence selection, to carry out question answering from unstructured text. The evaluation results derived on five different real-world datasets, associated with different companies, show high accuracy for both tasks. This suggests that modern QA and dialog technology is effective for real-world tasks.
Texte intégral
1I moderni personal assistant richiedono di accedere ad informazioni non strutturate per soddisfare con successo le richieste degli utenti. In questo articolo, abbiamo studiato l’uso dell’ apprendimento automatico per progettare due componenti di un personal assistant: classificazione degli intenti, per comprendere la richiesta dell’utente, e la selezione della frase di risposta per rispondere alle domande con testo non strutturato. I risultati della valutazione derivati da cinque diversi datasets del mondo reale, associati a diverse società, mostrano un’elevata precisione per entrambi i modelli. Ciò suggerisce che la moderna tecnologia di question answering e dialogo è efficace per attività reali.
1. Introduction
2Help-desk applications use Machine Learning to classify user’s request into intents. The information owned by companies generally is in free text form, from company’s documents or websites. For example, corporate knowledge is typically encoded within documents in an unstructured way. This poses limitations on the effectiveness of standard information access. For example, searching documents by keywords is not a viable solution for the users, as they seldom can find an answer to their questions. The possibility of using QA systems to search for information on a corpus of documents, also through a dialogue system, offers an attractive solution for extracting the best information from the company knowledge bases.
3IMSL company offers virtual agents that can be retrained based on the customer needs. The agent is composed of many Natural Language Understanding components, such as classifiers that map each user utterance in input to their corresponding intent. However, since it is not possible to forecast all the intents corresponding to the questions that the user are going to ask – which are potentially infinite – it is of paramount importance to have an automated QA system able to automatically provide the best answer (paragraph) extracted from a company owned knowledge base.
4Information access is becoming an increasingly critical issue. Traditional Information Retrieval systems, used in industry, help the user in accessing information, but are often imprecise and impractical. Current search engines are an example of this. Searching for information on the web often requires a double effort for the user: first it is necessary to understand how to formulate a query in the most effective manner, and then filter out the proposed results in order to find the most relevant information.
5In this paper, we described our QA system based on answer sentence selection and intent detection, and how we integrate them in a Conversational agent.
2. Related Work
6As today, the largest part of general-purpose QA services are provided by big tech companies such as Amazon Alexa, Google Home, Ask Yahoo!, Quora and many others. Unfortunately, these types of applications are not easily accessible for smaller companies, as the offered QA service cannot be easily adapted to handle corporate knowledge, which is in form of unstructured text. To build their own solutions SMEs can exploit QA components such as Answer Sentence Selection.
7In recent years, deep learning approaches have been successfully applied for automatically modeling text pairs, e.g., (Lu and Li 2013; Yu et al. 2014). Additionally, a number of deep learning models have been recently applied to QA, e.g., Yih et al. (2013) applied CNNs to open-domain QA; Bordes et al. (2014) propose a neural embedding model combined with the knowledge base for open-domain QA. Iyyer et al. (2014) applied recursive neural networks to factoid QA over paragraphs. Miao et al. (2016) proposed a neural variational inference model and a Long-short Term Memory network for the same task. Yin et al. (2016) proposed a siamese convolutional network for matching sentences that employ an attentive average pooling mechanism, obtaining state-of-the-art results in various tasks and datasets.
8The work closest to this paper is by Yu et al.(2014) and Severyn and Moschitti (2015). The former presented a CNN architecture for answer sentence selection that uses bigram convolution and average pooling, whereas the latter use convolution with k-max pooling.
9Nowadays, supporting customers in their activities across applications and websites is becoming always more demanding, due a large number of customers and the variety of topics that have to be covered.
10New tools, such as chatbots, able to answer frequently asked questions, i.e., FAQs, are rising in response to this needs. Classifying the user need expressed in a natural question, into a predefined set of categories, allow conversational agents to recognize which users are asking which types of questions and to react accordingly.
11Traditional approaches to this problem include the use supervised approaches such as Support Vector Machines (SVM) (Cortes and Vapnik 1995), Boosting (Iyer et al. 2000; Schapire and Singer 2000), Kernel machines operating on input structured objects (Moschitti 2006; Lodhi et al. 2002) and Maximum Entropy models (Yaman et al. 2008).
12In the latest years, new models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM) (Cortes and Vapnik 1995), Gated Recurrent Unis (GRU) (Chung et al. 2014) and Convolutional Neural Networks (CNN) (Lecun et al. 1998; Kim 2014) were established as state-of-the-art approaches for text classification.
3. System Description
13Our QA system allows for extracting portions of text from company documents or from websites. This information is then organized into paragraphs, which are then used to provide an answer to the user’s questions.
14One practical problem is the fact that not all PDF files encode text, and many fail to preserve the logical order of the text. Thus, in order to extract paragraphs, we used pdf2text.
15Another practical problem we need to solve was to keep portions of text separated by punctuation together: such as bullet lists or very structured paragraphs. Our designed tool automatically assigns a reference index or summary to each paragraph to improve subsequent searches (see Figure 1).
16Subsequently, each question and answer pair must be annotated with correctness (label TRUE/FALSE). This allows us to create a training set to train the re-ranking network (see Figure 2).
17The final system, shown in Figure 3, therefore allows for using the target company data, appropriately reorganized into paragraphs, to provide answers to the user’s request. On average we provide from 3 to 5 answers for each question. However, we also provide the reference to the document and the summary which the paragraph refers to.
4. Answer Sentence Selection (AS2)
18The AS2 goal is to rank a list of answer candidates by their similarity with respect to an input question qi. We design a network that includes relational information between questions and answers. Our results show that CNNs reach better performance than traditional IR models based on bag of words.
4.1 Model
19The architecture of the network used for mapping sentences in embedding vectors is showed in Figure 4 and is inspired to the CNNs employed by Severyn and Moschitti (2015) to perform many classification activities over sentences. It includes two main components:
(i) an encoder that map an input document si into a vector and (ii) a feed-forward network that computes the similarity between input sentences.
20Our network takes two sentences in input, i.e., a question and a text paragraph that may contain an answer, and it represents each of them into vectors of fixed-size dimension .
21The sentence model is composed of a sequence of convolutional maps followed by some pooling operations. Such model achieves the state of the art in many NLP tasks (Kalchbrenner, Grefenstette, and Blunsom 2014; Kim 2014).
22Then, the sentence vectors, corresponding to the questions and answers, are concatenated together and passed to the following neural network layers. These are composed of a non-linear hidden layer and an output layer with a sigmoid activation unit. At the end, the network returns a value between 0 and 1 corresponding to the relevancy of the answer with respect to the question.
23Finally, we included word overlap embeddings encoding relational information between words in questions and answers (Severyn and Moschitti 2016).
5. Intent Classification
24We adopted advanced techniques, such as deep learning models, to classify the user need, which is semantically expressed by the user question, into a predefined set of categories, i.e., intents.
25We used some common deep learning models for solving the intent detection task. The main point of our study is to test those models and observe how they perform on datasets containing real user questions addressed to a virtual agent, operating in the banking/financial sector.
26At this stage, we dot not consider novel methods based on transformer architecture such as BERT (Devlin et al. 2019), which require a a large amount of resources, typically not avialable to SMEs. Instead, we focused on lighter approaches that can run on small GPUs. We report our experiments and discuss the obtained results using such lighter models.
5.1 Models
27SVM (baseline) fed with word features, derived from the text of the utterances.
28LSTM using recurrent units that take in input the embedding xt of the current word at time step t and the hidden vector encoding the sub-phrase at previous step, i.e., ht-1, and return the vector representation of the phrase at step ht
29CNN uses a set of convolutional filters of different size and max pooling operations to extract the most important features, e.g., bigrms, trigams, etc…, which represent the sentence meaning.
30LSTM + CNN based on an architecture composed of two layers: an LSTM layer that builds a fixed-size vector representation of the sentence at each word, and a convolutional layer. The latter applies a set of convolutional operations on the representations returned by the first layer.
31CNN + CNN composed of two CNN layers, where the second layer takes the previous layer representation as input, and applies a set of convolutional filters and pooling operation to compute the final vector representation of the sentence.
6. Experiments
32In this section, we first describe the datasets we used in our experiments, then we provide the results on the answer sentence selection and the intent classification tasks. Finally, we report an end-to-end evaluation of our system.
6.1 Data Description
33We built our datasets by collecting samples of questions asked by users to conversational agent for either Credit Institution or Bank websites. We collected two intent corpora from each data provider, resulting in a total of four datasets.
34Istituto Credito - synthetic (ICs): This corpus was created by expert dialog engineers. It contains a set of utterances annotated with their corresponding intents. The subject of questions are diverse and spans over many topics. For example, some questions seek information over the bank branch locations, problems regarding how to cash checks, and requests of availability of finance products. It contains 2,305 training examples, and 593 test examples, for a total of 2,898 examples.
35Istituto Credito - full (ICf): This dataset is composed of synthetic questions, generated by language engineers. Subsequently, it has been augmented to take into account also real sentences, retrieved from website chat-bot of a well known Credit Institution operating in Italy. It contains 2,898 training examples and 770 test examples, for a total of 3,668 examples.
36Banca - Area Informativa (BancaAI): This dataset contains real questions asked by users about the Area Informativa of a bank. It includes 3,947 training examples, and 987 test examples, for a total of 4,934 examples divided in 282 intents.
37Banca - Internet Banking (BancaIB): This dataset includes real questions asked by users about the iBanking service offered from a well known Italian bank. It includes 4,380 training instances and 1,906 test instances divided in 251 intents.
38Answer Sentence Selection data: We used an in-house dataset called IMSL-WIKI, which contains a list of question and answer regarding some of the products and services sold by IM Service Lab. For each question, a paragraph list was collected using an off-the-shelf search engine, i.e., Lucene, and manually annotated as either relevant or irrelevant. The dataset is divided into two parts, i.e., a training and test sets, which contain a total of 5,190 and 1,240 QA pairs, respectively. For each question, we retrieved a list of 10 candidate answers.
6.2 Model results
39In this section we report the performance of our two main machine learning components of our system: Answer Sentence Selection and Intent Classification.
6.2.1 Answer Sentence Reranking
40Table 1 reports the performance of the neural network and the baseline system. The first row, i.e., BM25, shows the baseline system, while the second row shows the performance of the CNN. The systems are evaluated according to the Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Precision at 1 (P@1). The final results reported at the bottom is obtained as the average of 5 different models trained and evaluated on the test set. For each measure in the table, we report both mean and standard deviation computed on dev. and test sets.
Table 1: The results of the QA model on the dev. and test set of IMSL-WIKI corpus
Models | MAP | MRR | P@1 | MAP | MRR | P@1 |
BM25 | 64.20 ± 0.00 | 70.20 ± 0.00 | 57.60 ± 0.00 | 55.40 ± 0.00 | 62.40 ± 0.00 | 46.70 ± 0.00 |
CNN | 65.04 ± 1.10 | 69.34 ± 1.36 | 53.34 ± 2.66 | 68.38 ± 1.08 | 72.21 ± 1.33 | 57.42 ± 2.16 |
41We used a small fraction of the training set, i.e., 15% of the data, for early stopping. As it can be seen from the table, CNN performs about 1 point more than the baseline algorithm (BM25) in terms of MAP on the dev. set, and almost 10 absolute points more of MAP on the test set.
42In addition, we observe an increase of 9.8 absolute points in terms of MRR, and 10.65 absolute points of P@1 on the test set. The difference between results on dev. and test sets can be explained by the fact that the used dev. set is very small: only 124 list of questions and 1,239 Q/A pairs, which made it difficult to optimize the three ranking metrics at the same time, so we focused on MAP.
6.2.2 Intent Classification
43We ran state-of-the-art neural classifiers described in Section 6.2.1 on Credit Institute and Bank datasets. To choose the best performance, we used 30% of training data as validation set and select the best hyperparameters. We compare the performance of neural models with respect to strong baseline classifiers, i.e., SVMs, and report the results in terms of Accuracy (Table 2) and F1 (Table 3). The tables show that the final performance heavily depends on the used dataset and models.
Table 2: Accuracy over the datasets
Test Sets (Accuracy) | ||||
Models | ICs | ICf | BancaAI | BancaIB |
Baseline (SVM) | 0.7622 | 0.8065 | 0.8197 | 0.7235 |
CNN | 0.7718 | 0.8058 | 0.8241 | 0.7633 |
CNN + CNN | 0.7577 | 0.8094 | 0.8328 | 0.7663 |
LSTM | 0.7698 | 0.8131 | 0.8529 | 0.7843 |
LSTM + CNN | 0.7737 | 0.8231 | 0.8224 | 0.7479 |
Table 3: F1 score over the datasets
Test Sets (F1 score) | ||||
Models | ICs | ICf | BancaAI | BancaIB |
Baseline (SVM) | 0.7595 | 0.8151 | 0.8009 | 0.7108 |
CNN | 0.7722 | 0.8078 | 0.8064 | 0.7476 |
CNN + CNN | 0.7606 | 0.8117 | 0.8158 | 0.7499 |
LSTM | 0.7689 | 0.8163 | 0.8386 | 0.7691 |
LSTM + CNN | 0.7742 | 0.8252 | 0.8065 | 0.7344 |
44Istituto Credito (IC) datasets. Regarding the IC synthetic dataset, the best model, i.e., LSTM+CNN, obtains Accuracy of 77.37 and a micro-avg F1 of 0.7742. This is about one absolute point of Accuracy higher than the base SVM model (77.37 vs. 76.22) and 1.47 absolute points of F1 more than the base model (77.42 vs. 75.95). Similarly, on the IC full dataset, the performance of the best model, i.e, LSTM+CNN, achieved an accuracy of 82.31%, which is 0.66% absolute points better than the base model (82.31 vs. 80.65) and a micro-avg F1 of 82.52, which is about one point better than the base SVM model (82.52 vs 81.51).
45Banca datasets. Regarding Banca AI dataset, the best model, i.e., LSTM obtained accuracy of 85.29, which is about 4 absolute points better than the base SVM model (85.29 vs. 81.97). Also, in terms of F1, the best model obtained 3.77 absolute points more than the baseline (82.86 vs 80.09). Regarding the Banca IB dataset, the best model, i.e., LSTM, obtained around 6 points more both in terms of Accuracy (78.43 vs 72.35) and F1 (76.91 vs 71.08).
6.3 End-to-End system evaluation
46We trained and evaluated our system using samples of data collecting from IMSL customers.
47We noted that the accuracy of the system improved because more answers are generally provided (from 3 to 5) to the user’s question, thus allowing to almost certainly provide the correct answer.
48The only point of attention is the fact that there is not always a valid answer to the user’s request in company knowledge. Indeed, the questions related to the user’s personal profile or data cannot be precisely answered by the company documentation.
49Furthermore, it often happens that the company policy prevents to provide explicit answers to specific user problems. In all these cases, it is therefore necessary to support the QA system with operators, who can provide personal answers or those not coded in the corporate knowledge.
7. Conclusions
50In this paper, we have presented a modern dialog system for real-world applications. We have tested advanced technology for QA and intent classification on several datasets derived from company data, such as Banks and Credit Institutions. The results show a promising direction for SMEs to build their own effective access to unstructured data.
Bibliographie
Des DOI sont automatiquement ajoutés aux références bibliographiques par Bilbo, l’outil d’annotation bibliographique d’OpenEdition. Ces références bibliographiques peuvent être téléchargées dans les formats APA, Chicago et MLA.
Format
- APA
- Chicago
- MLA
Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.” CoRR abs/1412.3555. http://arxiv.org/abs/1412.3555.
Corinna Cortes and Vladimir Vapnik. 1995. “Support-Vector Networks.” In Machine Learning, 273–97.
10.1007/BF00994018 :Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics. https://0-doi-org.catalogue.libraries.london.ac.uk/10.18653/v1/N19-1423.
10.18653/v1/N19-1423 :Raj D. Iyer, David D Lewis, Robert E Schapire, Yoram Singer, and Amit Singhal. 2000. “Boosting for Document Routing.” In Proceedings of the Ninth International Conference on Information and Knowledge Management, 70–77.
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. “A Convolutional Neural Network for Modelling Sentences.” arXiv Preprint arXiv:1404.2188.
Yoon Kim. 2014. “Convolutional Neural Networks for Sentence Classification.” CoRR abs/1408.5882. http://arxiv.org/abs/1408.5882.
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1109/5.726791.
10.1109/5.726791 :Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. “Text Classification Using String Kernels.” Journal of Machine Learning Research 2 (Feb): 419–44.
Zhengdong Lu and Hang Li. 2013. “A Deep Architecture for Matching Short Texts.” In Advances in Neural Information Processing Systems, 1367–75.
Alessandro Moschitti. 2006. “Making Tree Kernels Practical for Natural Language Learning.” In 11th Conference of the European Chapter of the Association for Computational Linguistics.
Robert E. Schapire and Yoram Singer. 2000. “BoosTexter: A Boosting-Based System for Text Categorization.” Machine Learning 39 (2-3): 135–68.
Aliaksei Severyn and Alessandro Moschitti. 2016. “Modeling Relational Information in Question-Answer Pairs with Convolutional Neural Networks.” arXiv Preprint arXiv:1604.01178.
Sibel Yaman, Li Deng, Dong Yu, Ye-Yi Wang, and Alex Acero. 2008. “An Integrative and Discriminative Technique for Spoken Utterance Classification.” IEEE Transactions on Audio, Speech, and Language Processing 16 (6): 1207–14.
Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. “Deep Learning for Answer Sentence Selection.” arXiv Preprint arXiv:1412.1632.
Auteurs
DISI, University of Trento, Italy
DISI, University of Trento, Italy
Im Service Lab Srl, Italy
Le texte seul est utilisable sous licence Licence OpenEdition Books. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.
Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015
3-4 December 2015, Trento
Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)
2015
Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016
5-6 December 2016, Napoli
Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)
2016
EVALITA. Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 7 December 2016, Naples
Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)
2016
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
11-12 December 2017, Rome
Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)
2017
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
10-12 December 2018, Torino
Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 12-13 December 2018, Naples
Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop
Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)
2020
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
Bologna, Italy, March 1-3, 2021
Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)
2020
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
Milan, Italy, 26-28 January, 2022
Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)
2022