Detecting Early Onset of Depression from Social Media Text using Learned Confidence Scores
p. 73-78
Résumé
Computational research on mental health disorders from written texts covers an interdisciplinary area between natural language processing and psychology. A crucial aspect of this problem is prevention and early diagnosis, as suicide resulted from depression being the second leading cause of death for young adults. In this work, we focus on methods for detecting the early onset of depression from social media texts, in particular from Reddit. To that end, we explore the eRisk 2018 dataset and achieve good results with regard to the state of the art by leveraging topic analysis and learned confidence scores to guide the decision process.1
Remerciements
We would like to thank our reviewers for their useful comments and suggestions that helped us improve this paper and also to the organizers of the eRisk Workshop for their efforts in encouraging the research on mental illnesses detection from social media.
Texte intégral
1. Introduction
1Mental illnesses are a common problem of our modern world. More than one in ten people was living with mental health disorders in 2017 (Ritchie and Roser 2018), with women being the most affected. These disorders affect people’s way of thinking, mood, emotions, behaviour and their relationships with others. Most mental illnesses remain undiagnosed because of the social stigma around them.
2Depression is one of the main causes of disability globally,2 it affects people of all ages. Prevention is used to reduce depression and to save the lives of people at risk of suicide, but prevention is only limited to raising awareness and programs to cultivate positive thinking in case of depression and monitoring people who attempted suicide or self-harm.
3With the rise in social media use, more computational efforts are made to detect mental illnesses such as depression (De Choudhury et al. 2013) and PTSD (Coppersmith et al. 2015), but also to detect misogyny (Anzovino, Fersini, and Rosso 2018), irony and sarcasm (Khokhlova, Patti, and Rosso 2016) from users’ texts.
4People tend to talk more about their emotions and mental health problems online and to seek support. The sources of mental health cues used for detection are Twitter, Facebook, Reddit and forums (Calvo et al. 2017). Reddit3 is a social media site very similar to forums. It is organized in subreddits with specific topics, some dedicated to mental health problems. The use of throwaway accounts to maintain anonymity promotes disclosure, and users are more likely to share problems they have not discussed with anyone before. The use of these accounts makes it difficult for users to receive more social support because the majority of them are used only for one post (Calvo et al. 2017).
5In this work, we choose to tackle the problem of detecting early onset of depression from users’ posts on social media, specifically from Reddit. As such, we explore the eRisk 2018 dataset through topic analysis by means of Latent Semantic Indexing (Deerwester et al. 1990) and learned out-of-distribution confidence scores (DeVries and Taylor 2018). Due to the nature of the dataset, we repurpose the learned confidence score to make a decision on whether to label the user as depressed or non-depressed or to wait for more data, as test chunks were progressively released every week.
2. Related Work
6Recent studies for depression detection from text are reviewed by Guntuku et al. (Guntuku et al. 2017). People diagnosed with mental illnesses from the datasets are identified using screening surveys, self-reported posts about diagnosis from social media or by their membership in different forums related to mental health. The most used features are topic modelling, n-grams, Linguistic Inquiry and Word Count (LIWC), emotion and metadata. The most used methods are Support Vector Machines (SVM), Logistic Regression, Random Forests and Neural Networks.
7Coppersmith et al. show the differences in emoticons use between suicidal users and controls, neurotypicals using emojis with a much higher probability than a user before an attempt. Prior to the suicide attempt, the users at risk tend to use a more self-focused language, same as the people diagnosed with depression. The authors highlight different changes in post emotions before and after the suicide attempt. Users are also more likely to talk about suicide after an attempt than before it.
8Sekulić et al. (2018) indicate that users diagnosed with bipolar disorders use more first-person singular pronouns, same as depressed people. They also use more words associated with emotions; words associated with positive emotions as well as words associated with negative emotions explained by alternating episodes of mania and depression.
9Nalabandian el al. (2019) show that depressed persons tend to use more negative words and a self-focused language when writing about their interactions with a close romantic partner than when writing about other people around them. This is because people experience different symptoms of mental illness based on their interactions with other people.
10Loveys et al. (Loveys et al. 2018) show the differences in language use of users with depression from different cultures to avoid cultural biases. Even if depression affects people all over the world, the way they experience and express it is shaped by their cultural context. Users from some ethnic groups does not address mental health issues online as much as the others and this can make the depression task more difficult. After topic modeling, the authors show that the words from each topic vary for each ethnic group, people discussing different themes relevant to their culture.
11For diagnosis before the onset of the mental health disorders, Eichstaedt et al. (2018) use users’ posts from Facebook to predict a future depression diagnosis. De Choudhury et al. (2013) use a classifier to predict users’ depression likelihood ahead of the onset of illness, with different measures used: language, linguistic style, emotion, ego-network, demographics and user engagement.
12We chose to tackle the problem of detecting early onset of depression from users’ Reddit posts. To that end, we focus our efforts into processing the eRisk 2018 dataset (Losada, Crestani, and Parapar 2018), given its success at the Workshop for Early Risk Detection on the Internet4 within The Conference and Labs of the Evaluation Forum (CLEF) and its fruitful submissions from participants.
13The teams from this workshop had different detection systems, based on bag of words ensembles (Trotzek, Koitka, and Friedrich 2018), machine learning models with hand-crafted features (Trotzek, Koitka, and Friedrich 2018; Ramiandrisoa et al. 2018; Cacheda et al. 2018; Ramı́ırez-Cifuentes and Freire 2018) or with different text embeddings (Trotzek, Koitka, and Friedrich 2018; Ramiandrisoa et al. 2018; Ragheb et al. 2018), on sentence-level analysis to detect self references and extract different features (Ortega-Mendoza et al. 2018), on Latent Dirichlet Allocation (LDA) topic modelling (Maupomé and Meurs 2018), models combining Term Frequency — Inverse Document Frequency with Convolutional Neural Networks (Wang, Huang, and Chen 2018) or other machine learning models. Most systems took the decision after the last chunk, only a few were able to emit a decision in the first chunks.
14Several works addressing depression (Schwartz et al. 2014; Resnik et al. 2015) and PTSD (Coppersmith et al. 2015; Preoţiuc-Pietro et al. 2015) use a topic modelling approach showing that topics encountered texts have important discriminative power to make the distinction between persons suffering from mental illnesses and healthy controls.
3. Dataset
15Early Risk Detection on the Internet (eRisk) workshops organized by CLEF explore the technologies that can be used for people’s health and safety and the issues related to building tests collections (Losada, Crestani, and Parapar 2018). eRisk 2018 has two tasks, for early detection of depression and anorexia. We choose to focus on the task of detecting early onset of depression of social media users.
16This task consists of sequentially processing chunks of Reddit posts from depressed users and controls. Submissions from each user are encoded in an xml file, one subject xml per chunk of data. Each xml contains the id of the subject and his posts and comments. Each submission has the posting time and the actual text. If a submission does not have a title, it is considered a comment. The goal is to detect depression as early as possible and the dataset has to be processed in chronological order. The test collection of posts from depressed and non-depressed users is split into 10 chunks. As training data, the teams had access to data from eRisk 2017, both train and test. The test chunks were released one every week. Every week the teams had to decide whether to label the user as depressed or non-depressed or to wait for the test data of the following week.
17The dataset contains 125 depressed users and 752 non-depressed users as training data and 79 depressed users and 741 non-depressed users as test data. The dataset has more posts and comments from people without depression than from users diagnosed with depression. From a total of 531,349 submissions, only 49,557 submissions are from users diagnosed with depression. The average time from the first to the last submission is between 2 and 3 years, so the posts were collected over a long period of time (Losada, Crestani, and Parapar 2018).
4. Method
18Our methodology for early diagnosis of depression follows a classical Natural Language Processing pipeline. To clean the users’ texts, we transform them into lowercase, we remove the punctuation and stopwords, the numbers and URLs are replaced with specific tokens and we perform stemming with Porter Stemmer (Porter 1980). To reduce the dimension of the dictionary, we use collocations (Bouma 2009) to extract meaningful bigrams and trigrams.
19The number of posts and comments from non-depressed users is much higher than those from depressed users. To balance the two classes, we downsample the majority class to a ratio of 2:1.
20We train our Latent Semantic Indexing model with 128 topics on every users’ post. We use this model to extract topic modelling embeddings from users’ texts and use them as input to our fully connected neural network architecture. The neural network has three hidden layers of 512, 256 and 256 neurons respectively, Leaky ReLU activation and we use Dropout for regularization. We use a random sample of 20% of the training data provided by the organisers of the competition for validation.
21The network has two outputs, one for classifying if the user is depressed or not and one for confidence estimation. The motivation for using this architecture is to learn the confidence (DeVries and Taylor 2018) of our predictions and use it to make a decision on whether to label a user or wait for the next chunk of data. The learned confidence, besides its use case in out-of-distribution detection, can be used as a measure for how much the model trusts its classification output to be correct. As such, we consider the classification output only if the confidence exceeds a certain threshold. As indicated by DeVries et al. (2018), the network loss is computed by interpolating the predicted probabilities p with the target y, using the computed confidence score c, as follows:
22 (1)
23The final loss is then given by:
24 (2)
25Where, in our case, M = 2, is the number of classes. The loss includes an additional term that forces the predicted confidence to be as high as possible. We performed an ablation study on the validation data on the confidence penalty λ.
26A recent study by Hein et al. (2019) shows that neural networks with ReLU activation functions tend to be overconfident on incorrectly classified samples, thus we can not rely only on the output probabilities, and the predicted confidence offers a more reliable measure of uncertainty of the classification.
27As the number of submissions seen by the model increases, we want to make a decision as early as possible and thus we use a decaying function that decreases progressively the fixed threshold for confidence. The decision function is defined as follows:
28 (3)
29Where x is the embedding for the current user’s posts, w is the week number (i.e. the current chunk), s is a scaling factor and T is the initial threshold. We choose T = 85% and progressively scale it down to 40%. The scaling factor is computed such that, at the final chunk, the threshold is less than the smallest confidence encountered on the training data.
30At the test phase, the proposed model does not make an independent decision for each chunk of data in the test set. In the first chunk of data, if the model is not confident enough to make a final decision regarding the depressed or non-depressed status of a user, then, starting with the second chunk of data, we concatenate the current chunk with the previously available chunks for the current user. This way, the LSI model has more data for making better informed predictions.
5. Results
31Our results on eRisk 2018 dataset are presented in Table 1. Even if F1 is a standard evaluation measure used for imbalanced classification, it does not include the time component of the early detection task, thus Losada and Crestani propose an evaluation metric better suited for this task, the Early Risk Detection Error (ERDE).
32ERDE is defined as:
33 (4)
34The use of false positive (FP), false negative (FN), true positive (TP) and true negative (TN) for prediction d is to avoid the classifiers that always predict the label of the majority class. encodes a cost for the delay in detecting TP. For the eRisk datasets, where the number of negative labels is greater than positive labels, the value of cfn is 1 and cfp is 0.1296, set according to the proportion of depressed users in eRisk 2017 dataset (Losada, Crestani, and Parapar 2018). ctp is set to cfn because the late detection of people at risk of depression can have serious consequences, a late detection is considered as equivalent to not detecting the depressed user at all. The late detection of TN cases does not affect the effectiveness of the system.
35The goal of the system is to detect as early as possible people at risk of depression. For the detection of non-depressed users, the time of the detection is not relevant. The latency cost function, which grows with k (the number of submissions seen by the algorithm), is defined as:
36 (5)
o represents the number of posts after which the cost grows more quickly.
37Classification results on the detection of early onset of depression task from eRisk 2018 dataset.
Method | ERDE5 | ERDE50 | F1 | Prec | Rec |
Baseline LSI | 9.98% | 8.29% | 0.25 | 0.22 | 0.29 |
LSIc | 14.19% | 11.41% | 0.25 | 0.15 | 0.87 |
LSIc | 11.12% | 9.09% | 0.28 | 0.20 | 0.48 |
LSIc | 10.24% | 7.74% | 0.30 | 0.25 | 0.38 |
LSIc | 11.15% | 8.53% | 0.25 | 0.17 | 0.47 |
LSIc | 12.67% | 10.17% | 0.25 | 0.15 | 0.71 |
LSIc | 10.53% | 8.08% | 0.30 | 0.21 | 0.56 |
Funez et al. | 8.78% | 7.39% | 0.38 | 0.48 | 0.32 |
Trotzek et al. | 9.50% | 6.44% | 0.64 | 0.64 | 0.65 |
38The detection task is difficult, as seen in the low values of F1 and Precision. However, the task is to predict early onset of depression, and for that, the ERDE metrics are more appropriate, as they are a measure of prediction delay. ERDE5 metric is very sensitive to delays, after the first 5 submissions from the user the penalties grow quickly. In contrast to ERDE5, for ERDE50 the penalties grow only after the first 50 submissions from the user. The difference between ERDE5 and ERDE50 is very important in practice because of the consequences of late detection of depression signs. As the task suggests, the detection should be made as early as possible.
39To measure the impact of our learned out-of-distribution confidence from the neural network, we also trained a plain ReLU network with cross-entropy loss. For this model, we employed a hard threshold on the output probabilities for whether to wait for more data or classify the sample. As shown by Hein et al. (2019), ReLU networks can be overly confident on misclassified examples. This is shown in Table 1: the model has a low ERDE5 score as the output probabilities mostly have extreme values, which means that for most users the model makes a decision from the first chunk of data.
40We trained our model with different λ values in order to see the impact of the confidence component on the results. Larger values for λ make the model overly confident, as expected from Equation 2, the best performing model being the one with λ = 0.2. Smaller values of λ generate a wider confidence distribution on the training examples, facilitating the decision process, as extreme values either make the model overly-confident on every example, or not confident at all. This is consistent with findings by DeVries et al. (2018).
41In Table 1 we also present the best two submission from the eRisk 2018 Workshop, the one from Funez et al. , having the best results for the ERDE5 metric, and the one from Trotzek et al. (2018) having the top ERDE50 score.
42We can assume from these results that topics encountered in user writings have important discriminatory power. Depressed users mostly write about different subjects than non-depressed subjects, consistent with results from the work of Resnik et al. (2015). The writings from users diagnosed with depression are more focused on their feelings and their life events. Topics related to those themes contain words such as someone kill, bad though, never able to get, forever alone, life save, stay sober, i am sad, still can’t, improve life. new hope, oneself, tell anything, happy sad, hope one day. Texts from non-depressed users are found in topics related to their hobbies containing specific words: black mirror, first season, movie adaptation, hologram, nine inch nails, jimi hendrix, artist name, vlog, game, fallout, terra mistica, way to make money, paid time, really proud, amazon whishlist, food industry, white bread.
6. Conclusion
43In this paper, we use the eRisk 2018 dataset on Early Detection of Signs of Depression for depression classification from Reddit posts. Our method uses Latent Semantic Indexing for topic modelling and to generate the embeddings used as input for our neural network, but focuses on using a learned out-of-distribution confidence score alongside the classification output to decide whether to label the user or wait for more data. Besides its initial use case in out-of-distribution detection, we repurposed the confidence score as a measure for how much the model trusts its classification output to be correct. We showed that, in general, there is a significant difference in writing topics depending on the users’ mental health, to the extent that it contains enough information for use in classification.
44Liviu P. Dinu was supported by a grant of the Romanian Ministry of Education and Research, CCCDI—UEFISCDI, project number 411PED/2020, code PN-III-P2-2.1-PED-2019-2271, within PNCDI III.
Bibliographie
Des DOI sont automatiquement ajoutés aux références bibliographiques par Bilbo, l’outil d’annotation bibliographique d’OpenEdition. Ces références bibliographiques peuvent être téléchargées dans les formats APA, Chicago et MLA.
Format
- APA
- Chicago
- MLA
Maria Anzovino, Elisabetta Fersini, and Paolo Rosso. 2018. “Automatic Identification and Classification of Misogynistic Language on Twitter.” In International Conference on Applications of Natural Language to Information Systems, 57–64. Springer.
10.1007/978-3-319-91947-8 :Gerlof Boum. 2009. “Normalized (Pointwise) Mutual Information in Collocation Extraction.” Proceedings of GSCL, 31–40.
Fidel Cacheda, Diego Fernández Iglesias, Francisco Javier Nóvoa, and Victor Carneiro. 2018. “Analysis and Experiments on Early Detection of Depression.” CLEF (Working Notes) 2125.
Rafael A. Calvo, David N Milne, M Sazzad Hussain, and Helen Christensen. 2017. “Natural Language Processing in Mental Health Applications Using Non-Clinical Texts.” Natural Language Engineering 23 (5): 649–85.
Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead, and Margaret Mitchell. 2015. “CLPsych 2015 Shared Task: Depression and Ptsd on Twitter.” In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, 31–39.
Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. 2013. “Predicting Depression via Social Media.” In Seventh International Aaai Conference on Weblogs and Social Media.
10.1609/icwsm.v7i1.14432 :Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41 (6): 391–407. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
10.1002/(SICI)1097-4571(199009)41:6 :Terrance DeVries, and Graham W Taylor. 2018. “Learning Confidence for Out-of-Distribution Detection in Neural Networks.” arXiv Preprint arXiv:1802.04865.
Sharath Chandra Guntuku, David Yaden, Margaret Kern, Lyle Ungar, and Johannes Eichstaedt. 2017. “Detecting Depression and Mental Illness on Social Media: An Integrative Review.” Current Opinion in Behavioral Sciences 18 (December): 43–49. https://0-doi-org.catalogue.libraries.london.ac.uk/10.1016/j.cobeha.2017.07.005.
10.1016/j.cobeha.2017.07.005 :Maria Khokhlova, Viviana Patti, and Paolo Rosso. 2016. “Distinguishing Between Irony and Sarcasm in Social Media Texts: Linguistic Observations.” In 2016 International Fruct Conference on Intelligence, Social Media and Web (Ismw Fruct), 1–6. IEEE.
10.1109/FRUCT.2016.7584765 :David E. Losada, Fabio Crestani, and Javier Parapar. 2018. “Overview of eRisk 2018: Early Risk Prediction on the Internet (Extended Lab Overview).” In Proceedings of the 9th International Conference of the Clef Association, Clef.
Kate Loveys, Jonathan Torrez, Alex Fine, Glen Moriarty, and Glen Coppersmith. 2018. “Cross-Cultural Differences in Language Markers of Depression Online.” In Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, 78–87.
10.18653/v1/W18-06 :Diego Maupomé, and Marie-Jean Meurs. 2018. “Using Topic Extraction on Social Media Content for the Early Detection of Depression.” CLEF (Working Notes) 2125.
Rosa Marı́a Ortega-Mendoza, Adrián Pastor López-Monroy, Anilu Franco-Arcega, and Manuel Montes-y-Gómez. 2018. “PEIMEX at eRisk2018: Emphasizing Personal Information for Depression and Anorexia Detection.” In CLEF (Working Notes).
Martin F. Porter. 1980. “An Algorithm for Suffix Stripping.” Program 14 (3): 130–37.
Daniel Preoţiuc-Pietro, Johannes Eichstaedt, Gregory Park, Maarten Sap, Laura Smith, Victoria Tobolsky, H Andrew Schwartz, and Lyle Ungar. 2015. “The Role of Personality, Age, and Gender in Tweeting About Mental Illness.” In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, 21–30.
Waleed Ragheb, Bilel Moulahi, Jérôme Azé, Sandra Bringay, and Maximilien Servajean. 2018. “Temporal Mood Variation: At the Clef Erisk-2018 Tasks for Early Risk Detection on the Internet.” In Proceedings of the 9th International Conference of the Clef Association.
Faneva Ramiandrisoa, Josiane Mothe, Farah Benamara, and Véronique Moriceau. 2018. “IRIT at E-Risk 2018.” In Proceedings of the 9th International Conference of the Clef Association.
Diana Ramı́ırez-Cifuentes, and Ana Freire. 2018. “UPF’s Participation at the Clef eRisk 2018: Early Risk Prediction on the Internet.” In Cappellato L, Ferro N, Nie Jy, Soulier L, Editors. Working Notes of Clef 2018-Conference and Labs of the Evaluation Forum; 2018 Sep 10-14; Avignon, France.[Avignon]: CEUR Workshop Proceedings; 2018. P. 1-12. CEUR Workshop Proceedings.
Philip Resnik, William Armstrong, Leonardo Claudino, Thang Nguyen, Viet-An Nguyen, and Jordan Boyd-Graber. 2015. “Beyond Lda: Exploring Supervised Topic Modeling for Depression-Related Language in Twitter.” In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, 99–107.
Hannah Ritchie, and Max Roser. 2018. “Mental Health.”
H. Andrew Schwartz, Johannes Eichstaedt, Margaret Kern, Gregory Park, Maarten Sap, David Stillwell, Michal Kosinski, and Lyle Ungar. 2014. “Towards Assessing Changes in Degree of Depression Through Facebook.” In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, 118–25.
Marcel Trotzek, Sven Koitka, and Christoph M Friedrich. 2018. “Word Embeddings and Linguistic Metadata at the Clef 2018 Tasks for Early Detection of Depression and Anorexia.” In CLEF (Working Notes).
Yu-Tseng Wang, Hen-Hsen Huang, and Hsin-Hsi Chen. 2018. “A Neural Network Approach to Early Risk Detection of Depression and Anorexia on Social Media Text.” In CLEF (Working Notes).
Notes de bas de page
1 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Auteurs
University of Bucharest, Romania – ana-maria.bucur@drd.unibuc.ro
University of Bucharest, Romania – ldinu@fmi.unibuc.ro
Le texte seul est utilisable sous licence Licence OpenEdition Books. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.
Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015
3-4 December 2015, Trento
Cristina Bosco, Sara Tonelli et Fabio Massimo Zanzotto (dir.)
2015
Proceedings of the Third Italian Conference on Computational Linguistics CLiC-it 2016
5-6 December 2016, Napoli
Anna Corazza, Simonetta Montemagni et Giovanni Semeraro (dir.)
2016
EVALITA. Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 7 December 2016, Naples
Pierpaolo Basile, Franco Cutugno, Malvina Nissim et al. (dir.)
2016
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
11-12 December 2017, Rome
Roberto Basili, Malvina Nissim et Giorgio Satta (dir.)
2017
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018
10-12 December 2018, Torino
Elena Cabrio, Alessandro Mazzei et Fabio Tamburini (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian
Proceedings of the Final Workshop 12-13 December 2018, Naples
Tommaso Caselli, Nicole Novielli, Viviana Patti et al. (dir.)
2018
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop
Valerio Basile, Danilo Croce, Maria Maro et al. (dir.)
2020
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
Bologna, Italy, March 1-3, 2021
Felice Dell'Orletta, Johanna Monti et Fabio Tamburini (dir.)
2020
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
Milan, Italy, 26-28 January, 2022
Elisabetta Fersini, Marco Passarotti et Viviana Patti (dir.)
2022