Модели и методы автоматической обработки неструктурированных данных в биомедицинской области тема диссертации и автореферата по ВАК РФ 00.00.00, доктор наук Тутубалина Елена Викторовна

  • Тутубалина Елена Викторовна
  • доктор наукдоктор наук
  • 2023, ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики»
  • Специальность ВАК РФ00.00.00
  • Количество страниц 225
Тутубалина Елена Викторовна. Модели и методы автоматической обработки неструктурированных данных в биомедицинской области: дис. доктор наук: 00.00.00 - Другие cпециальности. ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики». 2023. 225 с.

Оглавление диссертации доктор наук Тутубалина Елена Викторовна

Contents

1 Introduction

2 New models and methods for classification and information extraction

2.1 Cross-lingual and cross-domain NER with transfer learning

2.2 Multimodal model with text and drug embeddings

2.3 Information extraction pipeline for search

3 New annotated corpora for information extraction

3.1 RuDReC: drug reactions in health-related user reviews

3.2 RuCCoN: clinical concepts in medical histories of patients

3.3 NEREL-BIO: nested named entities in biomedical abstracts

4 New evaluation strategies

4.1 In-terminology and cross-terminology evaluation

5 Conclusion

Bibliography

Appendix A. Paper "On Biomedical Named Entity Recognition: Experiments in Interlingual Transfer for Clinical and

Social Media Texts"

Appendix B. Paper "Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based

Models"

Appendix C. Paper "Multimodal Model with Text and Drug

Eembeddings for Adverse Drug Reaction Classification"

Appendix D. Paper "Comprehensive Evaluation of Biomedical Entity-centric Search"

Appendix E. Paper "Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer"

Appendix F. Paper "Medical Concept Normalization in Clinical Trials with Drug and Disease Representation Learning"

Appendix G. Paper "The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews"

Appendix H. Paper "RuCCoN: Clinical Concept Normalization in Russian"

Appendix I. Paper "NEREL-BIO: A Dataset of Biomedical

Abstracts Annotated with Nested Named Entities"

Appendix J. Paper "Multiple features for clinical relation extraction: a machine learning approach"

Appendix K. Paper "Medical concept normalization in social

media posts with recurrent neural networks"

Appendix L. Paper "Cross-Domain Limitations of Neural

Models on Biomedical Relation Classification"

Appendix M. Paper "DeepADEMiner: a deep learning phar-macovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter"

Appendix N. Paper "Deep learning for ICD coding: Looking for medical concepts in clinical documents in English and in French"

Appendix O. Paper "Medical Crossing: a Cross-lingual Evaluation of Clinical Entity Linking"

Appendix P. Paper "Deep neural models for medical concept

normalization in user-generated texts"

Рекомендованный список диссертаций по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Введение диссертации (часть автореферата) на тему «Модели и методы автоматической обработки неструктурированных данных в биомедицинской области»

1 Introduction

Topic of the thesis

This dissertation presents a comprehensive study that aims to improve the efficiency and effectiveness of processing unstructured data in the biomedical domain. The study introduces novel approaches for classification and information extraction, including the recognition of various entity mentions such as drugs, diseases, genes, and adverse drug reactions, as well as entity linking (also known as medical concept normalization) and relation extraction.

Natural language processing (NLP) in the biomedical domain poses several challenges due to the complexity of biomedical language and the vast amount of data generated in the field. Biomedical data comes from various sources, such as electronic health records, scientific publications, and clinical trial data, social media, which can have different formats, structures, and levels of quality. Some of the major challenges are as follows. Firstly, biomedical language is often ambiguous, with many terms having multiple meanings. E.g., "adenoid hypertrophy" ("гипертрофия аденоидов") may be linked to "nasopharyngeal tonsil hypertrophy (adenoids)" ("гипертрофия глоточных миндалин (аденоиды)") or "hypertrophy of adenoids exclusively" ("гипертрофия исключительно аденоидов"), two different concept unique identifiers (CUIs) in the Unified Medical Language System (UMLS) [1]. Some concepts have different CUIs while they are synonymous in their meaning; for example, "acholic stool" ("ахоличный стул") has a code C2675627 and "pale stool" ("светлый стул") has a code C0232720. Secondly, the medical language includes domain-specific terms and expressions that are not commonly used in everyday language. In the context of this problem, an NLP model has to be capable of translating layperson language into formal medical language. For example, the phrase "I can't fall asleep all night" ("всю ночь не могу уснуть") should be translated to "insomnia" ("бессонница"), and "head spinning a little" ("немного кружится голова") should be translated to "dizziness" ("головокружение"). This requires more than the simple matching of natural language expressions and vocabulary elements, as string matching approaches may not be effective in linking social media language to medical concepts when the words do not overlap at all. Thirdly, medical terminology is vast and continuously evolving, with different ontologies used across countries and even within different medical specialties. Medical concepts may have different types (e.g., drugs, diseases, or genes/proteins) and may be

retrieved from different single-typed ontologies. The holy grail of modern medical NLP is to effectively identify and map the same concepts across different ontologies without re-training models. Fourthly, annotated data for biomedical NLP is often limited, making it difficult to train and evaluate models effectively. Despite having a large number of resources in the general domain, many languages have not made significant progress in the biomedical field. Russian is one such example; it is one of the top 10 languages in the world and has many NLP datasets and resources, but the biomedical part of Russian NLP is underdeveloped. The Russian UMLS includes translations of Medical Dictionary for Regulatory Activities (MedDRA) [2], Logical Observation Identifiers Names and Codes (LOINC) [3], and Medical Subject Headings (MeSH) [4]. However, it only amounts to 1.8% of the English UMLS in vocabulary and 1.36% in source counts [5]. Addressing these challenges requires the development of new annotated corpora, advanced techniques, and models that can handle the complexity and variation of medical language, as well as the availability of high-quality annotated data for evaluation. The dissertation addresses these challenges by introducing new annotated corpora of texts from various sources, proposing novel evaluation strategies and advanced techniques such as Transformers [6], Bidirectional Encoder Representations from Transformers (BERT) [7], and metric learning [8-10] for optimizing the models. It demonstrates the effectiveness and robustness of the proposed approaches through extensive experiments and evaluations.

Objectives and goals of the dissertation The dissertation has three main objectives:

1. Development of NLP methods and models in the specialized domain based on deep neural networks, pre-trained models, and metric learning approaches.

2. Analysis of limitations and development of novel strategies for evaluating trained models in information retrieval and information extraction tasks.

3. Creation of new annotated corpora of texts from various sources, such as scientific abstracts, drug reviews, electronic health records, and clinical trials, in both English and Russian.

The ultimate objective is to improve the efficiency and effectiveness of biomedical search engines, pharmacovigilance systems, and medical records management and analysis in the healthcare field.

Main results

The following are the main results of this dissertation:

— New models and methods for classification and information extraction were developed:

1. Multilingual BERT-based models were analyzed for cross-domain drug and disease named entity recognition in two languages. The investigation of transfer learning strategies between four corpora demonstrated the effectiveness of pretraining on data with one or both types of transfer [11].

2. Classification-based methods were proposed with (i) a set of informative features at an entity level and a context level for relation extraction [12], and (ii) vectors of semantic similarity for entity linking [13; 14]. The effectiveness of these approaches was demonstrated in multiple shared tasks, ranking first in SMM4H 2019 Task 3, SMM4H 2020 Task 3, and SMM4H 2021 Task 1c [13; 14]. The semantic similarity vectors also proved effective with a proposed encoder-decoder architecture that ranked first in CLEF eHealth 2017 Task 1 [15].

3. DILBERT (Drug and disease Interpretation Learning with Biomedical Entity Representation Transformer) was introduced. The model optimizes the relative similarity of mentions and concept names from a terminology via metric learning. It was shown that the model is robust to vocabulary switches and can recognize concepts that were not present in the training set [16; 17].

4. A multimodal model combining BERT-based models for language understanding and molecular property prediction was proposed to improve the classification of tweets as potential sources of adverse drug events or drug reactions. The model achieved first and second place rankings on SMM4H 2021 Task 2 and Task 1a, respectively [18].

5. Two neural pipelines were developed: (i) a pipeline consisting of two models as a biomedical search engine that showed superior performance over a traditional search model on a manually annotated dataset of abstracts for disease and gene queries [19], and (ii) a pipeline for the classification, extraction and normalization of adverse drug events on

realistic, imbalanced data. The identification of optimal training ratios and undersampling methods was also explored [20].

— New annotated corpora were developed for information extraction. The following are some of the new corpora developed:

6. The Russian Drug Reaction Corpus (RuDReC), a partially annotated corpus of consumer reviews in Russian about pharmaceutical products, and RuDR-BERT models for named entity recognition and sentence classification tasks were introduced [21].

7. Two annotated datasets were developed for clinical concept normalization: a dataset of clinical trials in English for drug and disease normalization [16; 17], and a RuCCoN corpus, a new dataset of electronic health records in Russian, with entities linked to the UMLS [22].

8. NEREL-BIO, an annotation scheme and corpus of PubMed abstracts in Russian and English with general-domain and biomedical entity types, was introduced. The corpus includes the provision of an annotation for nested named entities [23].

— New evaluation strategies were proposed, as follows:

9. The limitations of existing benchmarks for biomedical entity linking were analyzed, and several novel evaluation strategies were proposed: (i) a novel stratified sampling split [13], (ii) in-terminology and cross-terminology evaluation [24]. Additionally, benchmarks were established for the cross-lingual task using clinical reports, clinical guidelines, and medical research papers. A test set filtering procedure was designed to analyze the "hard cases" of entity linking approaching zero-shot cross-lingual transfer learning [25].

10. The limitations of existing benchmarks of scientific abstracts and electronic health records for relation extraction were analyzed. To address performance differences in in-domain and out-of-domain setup, a cross-attention neural model was proposed that exhibits better cross-domain performance [26].

Author's contribution includes the problem formulations, the development of aforementioned methods and models for processing unstructured data, the design of annotation schemes for the aforementioned corpora and evaluation

9

strategies, analysis of results; the first versions of programs implementing the proposed methods and models for classification and named entity recognition and their evaluation were personally developed by the author of the dissertation; the current versions of software modules implementing the methods proposed in the dissertation within various hardware and software architectures were written under the direct supervision of the author of the dissertation.

The scientific novelty of the proposed research lies in the development of new annotated corpora for various texts, the development of novel deep learning architectures and models for biomedical information extraction and classification of texts in several languages, and novel evaluation strategies. The improvement of the quality of the developed methods in comparison with existing methods has been confirmed experimentally using standard quality metrics of natural language text analysis systems. It is experimentally shown that the developed methods are applicable to texts from various sources. The first studies were conducted to solve the problem of extracting mentions of drug effects and biomedical nested named entities for the Russian language.

The scope of dissertation is covered in 42 publications [11-52]. According to regulations of the Dissertation Council in Computer Sciences of Higher School of Economics, at least ten papers are listed below. In this list, papers are specifically mentioned: [12; 13; 17; 18; 20; 21; 23; 26] in Q1-journals; [11; 16; 19; 22; 24] in proceedings of CORE A/A* conferences, [14; 15; 25] in conference proceedings indexed on Scopus. The defense is performed based on at least seven of them (namely, the first nine from the list of first-tier publications).

Publications and probation of the work First-tier publications

1. Miftahutdinov Z., Alimova I., Tutubalina E. On biomedical named entity recognition: experiments in interlingual transfer for clinical and social media texts //European Conference on Information Retrieval (ECIR). - 12036 LNCS, Springer, Cham, 2020. - pages 281-288. [Scopus, WOS, CORE A conf.]

Contribution of the dissertation's author: main co-author; the author of this thesis formulated the scientific problem, developed neural models for named entity recognition, the first versions of programs that implement the proposed models, and performed an experimental evaluation.

10

2. Tutubalina E., Kadurin A., Miftahutdinov Z. Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models //Proceedings of the 28th International Conference on Computational Linguistics (COLING). - 2020. - pages 6710-6716. [Scopus, CORE A conf.] Contribution of the dissertation's author: main co-author; the author of this thesis formulated the scientific problem, proposed novel evaluation strategies, the first versions of programs that implement the proposed evaluation, and performed an experimental evaluation of models.

3. Sakhovskiy A., Tutubalina E. Multimodal model with text and drug em-beddings for adverse drug reaction classification //Journal of Biomedical Informatics. - 2022. - Vol. 135. - pages 104182. (Q1, Impact Factor 8.0) [Scopus, WOS]

Contribution of the dissertation's author: main co-author; the author of this thesis formulated the scientific problem, proposed a multimodal model in collaboration with S.A., and guided the research.

4. Tutubalina E., Miftahutdinov, Z., Muravlev, V., Shneyderman, A. A Comprehensive Evaluation of Biomedical Entity-centric Search //Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. - 2022. - pages 596-605. [Scopus, CORE A conf.] Contribution of the dissertation's author: main co-author; the author of this thesis formulated the scientific problem, guided the annotation process, developed an information retrieval system, the first versions of programs that implement the proposed system, and conducted experiments.

5. Miftahutdinov Z., Kadurin A., Kudrin R., Tutubalina E. Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer //European Conference on Information Retrieval (ECIR). - 12656 LNCS, Springer, Cham, 2021. [Scopus, Core A conf.].

Contribution of the dissertation's author: main co-author; the author of this thesis formulated the scientific problem, proposed evaluation methodology, proposed a DILBERT model in collaboration with M.Z., designed experiments, and guided the research.

6. Miftahutdinov Z., Kadurin A., Kudrin R., Tutubalina E. Medical concept normalization in clinical trials with drug and disease representation learning //Bioinformatics. - 2021. - V. 37. - №. 21. - pages 3856-3864 (Q1, Impact Factor 6.931) [Scopus, WOS]

Contribution of the dissertation's author: main co-author; same as [16] (this is the journal paper based on the conference version [16]).

7. Tutubalina E., Alimova I., Miftahutdinov Z., Sakhovskiy A., Malykh V., and Nikolenko S. The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews //Bioinfor-matics. — 2020. — 07. DOI: 10.1093/bioinformatics/btaa675 (Q1, Impact Factor 6.931) [Scopus, WOS]

Contribution of the dissertation's author: main co-author; the author of this thesis formulated the scientific problem, wrote a program for data collection, developed neural models for classification and named entity recognition, the first versions of programs that implement the proposed models, proposed an annotation scheme, guided the annotation process, and partially conducted experiments.

8. Nesterov, A., Zubkova G., Miftahutdinov Z., Kokh, V., Tutubalina E., Shelmanov A., Alekseev A., Avetisian M., Chertok A., and Nikolenko S. RuCCoN: Clinical Concept Normalization in Russian //Proceedings of the Annual Meeting of the Association for Computational Linguistics. - 2022. - pages 239-245. [Scopus, Core A* conf.]

Contribution of the dissertation's author: the author of this thesis formulated the scientific problem, wrote a program for additional training data collection, proposed several types of test sets for various settings, and partially conducted experiments.

9. Loukachevitch N., Manandhar S., Elina Baral E., Rozhkov, I., Braslavski P., Ivanov V., Batura T., and Tutubalina E. NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities// Bioinfor-matics. — 2023. — 04. — btad161. (Q1, Impact Factor 6.931) [Scopus, WOS]

Contribution of the dissertation's author: main co-author; the author of this thesis designed an annotation scheme in collaboration with L.N., wrote a program for data collection, set up annotation tools, developed models for initial data annotation as well as the program code that implement these models.

10. Alimova I., Tutubalina E. Multiple features for clinical relation extraction: a machine learning approach //Journal of biomedical informatics. - 2020. -T. 103. - pages 103382 (Q1, Impact Factor 8.0) [Scopus, WOS]

12

Contribution of the dissertation's author: main co-author; the author of this thesis formulated the scientific problem, proposed a feature-based model in collaboration with A.I., and guided the research.

11. Tutubalina E., Miftahutdinov Z., Nikolenko S., & Malykh V. Medical concept normalization in social media posts with recurrent neural networks //Journal of biomedical informatics. - 2018. - Vol. 84. - pages 93-102. (Q1, Impact Factor 8.0) [Scopus, WOS]

Contribution of the dissertation's author: main co-author; the author of this thesis formulated the scientific problem, proposed a classification model in collaboration with M.Z., designed evaluation, and guided the research.

12. Alimova I., Tutubalina E., Nikolenko S. I. Cross-Domain Limitations of Neural Models on Biomedical Relation Classification //IEEE Access. - 2021. - Vol. 10. - pages 1432-1439. (Q1, Impact Factor 3.476) [Scopus, WOS] Contribution of the dissertation's author: the author of this thesis formulated the scientific problem, proposed a neural model in collaboration with A.I., designed evaluation, and guided the research.

13. Magge A., Tutubalina E., Miftahutdinov Z., Alimova I., Dirkson A., Verberne S., Weissenbacher D., Graciela Gonzalez-Hernandez G.. DeepADEM-iner: a deep learning pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter //Journal of the American Medical Informatics Association. - 2021. - Vol. 28. - №. 10. -pages 2184-2192. (Q1, Impact Factor 7.942) [Scopus, WOS] Contribution of the dissertation's author: the author of this thesis proposed and developed two information extraction models, and the first versions of programs that implement the proposed models.

Second-tier publications:

14. Miftahutdinov Z., Tutubalina E. Deep learning for ICD coding: Looking for medical concepts in clinical documents in English and in French //Inter-national Conference of the Cross-Language Evaluation Forum for European Languages. - Springer, Cham, 2018. - pages 203-215. [Scopus, WOS] Contribution of the dissertation's author: main co-author; the author of this thesis formulated the scientific problem, proposed an encoder-decoder architecture with semantic similarity features in collaboration with M.Z., designed experiments, and guided the research.

15. Alekseev A., Miftahutdinov Z.,Tutubalina E., Shelmanov A., Ivanov V., Kokh V., Nesterov A., Avetisian M., Chertok A., Nikolenko S. Medical Crossing: a Cross-lingual Evaluation of Clinical Entity Linking // 2022 Language Resources and Evaluation Conference, LREC 2022. — 2022. — pages 4212-4220. [Scopus, Core C conf.].

Contribution of the dissertation's author: the author of this thesis formulated the scientific problem, proposed novel evaluation strategies, the first versions of programs that implement the proposed evaluation, and guided the research.

16. Mftahutdinov Z., Tutubalina E. Deep neural models for medical concept normalization in user-generated texts // ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop. — 2019. — pages 393-399. [Scopus, WOS] Contribution of the dissertation's author: main co-author; same as [13] (in this work, the experimental part from [13] is expanded).

Invited talks at conferences and seminars:

1. 27th Annual Conference on Intelligent Systems for Molecular Biology & 18th European Conference on Computational Biology ISMB/ECCB 2019, Basel, Switzerland, 21.07-25.07.2019, "Towards the Semantic Interpretation of User-Generated Texts about Drug Therapy";

2. Lecture from the cycle "On the edge of science", Moscow, Russia, 23.11.2021, "How to train artificial intelligence to identify adverse drug effects from social media posts";

3. International Scientific Conference "Machine Learning and Artificial Intelligence Technologies" (MLW 2021), Sochi, Russia, 25.11.2021, "Drug and Disease Interpretation Learning";

4. Open conference on artificial intelligence Opentalk.AI 2020, Moscow, Russia, 19.02-21.02.2020, "Processing messages from social media about side effects of drugs";

5. Educational Intensive "Archipelago 20.35", Innopolis, Russia, 11.11.2020, "Processing messages from social networks about side effects of drugs";

6. 4th Social Media Mining for Health Applications Workshop & Shared Task (SMM4H 2019), Florence, Italy, 28.07-03.08.2019, "KFU NLP Team

at SMM4H 2019 Tasks: Want to Extract Adverse Drugs Reactions from Tweets? BERT to The Rescue";

7. Data Fest 2018, 28.04.2018, "What's hurting you? Application of NLP methods in drug discovery";

8. 3rd Kazan Summer School on Chemoinformatics, Kazan, Russia, 5.07-7.07.2017, "Text Mining in Biomedical Research".

Contributed reports at conferences and seminars:

9. 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), Abu Dhabi, OAE, 7.12-11.12.2022, "A Comprehensive Evaluation of Biomedical Entity-centric Search";

10. 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), Ireland, Dublin, 22.05-27.05.2022, "RuCCoN: Clinical Concept Normalization in Russian";

11. 13th Language Resources and Evaluation Conference (LREC 2022), Marseill, France, 21.06-23.06.2022, "Medical Crossing: a Cross-lingual Evaluation of Clinical Entity Linking";

12. 7th Social Media Mining for Health Applications Workshop & Shared Task (SMM4H 2022), online, 17.10.2022, "SMM4H 2022 Task 2: Dataset for stance and premise detection in tweets about health mandates related to COVID-19";

13. 43rd European Conference on Information retrieval (ECIR 2021), online, 28.03-1.04.2021, "Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer";

14. 6th Social Media Mining for Health Applications Workshop & Shared Task (SMM4H 2021), online, 10.06.2021, "KFU NLP Team at SMM4H 2021 Tasks: Cross-lingual and Cross-modal BERT-based Models for Adverse Drug Effects";

15. Widening Natural Language Processing Workshop (WiNLP 2021), online, 21.11.2021, "Adverse Drug Reaction Classification of Tweets with Fusion of Text and Drug Representations";

16. Ivannikov ISP RAS Open Conference 2021, 02.12-03.12.2021, Moscow, Russia, "Cross-Lingual Transfer in Drug-Related Information Extraction from User-Generated Texts";

17. 28th International Conference on Computational Linguistics (COLING 2020), online, 8.12-12.12.2020, "Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models";

18. 5th Social Media Mining for Health Applications Workshop & Shared Task (SMM4H 2020), online, 12.12.2020, "KFU NLP Team at SMM4H 2020 Tasks: Cross-lingual Transfer Learning with Pretrained Language Models for Drug Reactions";

19. 42nd European Conference on Information retrieval (ECIR 2020), online, 14.04-17.04.2020, "On Biomedical Named Entity Recognition: Experiments in Interlingual Transfer for Clinical and Social Media Texts";

20. 8th International Conference on Analysis of Images, Social networks and Texts (AIST 2019), Kazan, Russia, 17.07-19.07.2019, "Biomedical Entities Impact on Rating Prediction for Psychiatric Drugs";

21. Google NLP Summit 2019, 24.06-26.06.2019, "Towards the Semantic Interpretation of User-Generated Texts about Drug Therapy";

22. 57th Conference of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28.07-03.08.2019, "Deep Neural Models for Medical Concept Normalization in User-Generated Texts";

23. 57th Conference of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28.07-03.08.2019, "Detecting Adverse Drug Reactions from Biomedical Texts With Neural Networks";

24. 21th International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID 2019), Kazan, Russia, 15.10-18.10.2019, "A comparative study on feature selection in relation extraction from electronic health records";

25. VI International Conference "Information technologies, telecommunications and control systems" (ITTCS 2019), Innopolis, Russia, 6.12.2019, "Comparative Analysis of Context Representation Models in the Relation Extraction Task from Biomedical Texts";

26. Ivannikov ISP RAS Open Conference 2018, 21.11-22.11.2018, Moscow, Russia, "Comparative analysis of neural networks in the problem of classification of side effects at the level of entities in English texts";

27. 9th International Conference and Labs of the Evaluation Forum (CLEF 2018), Avignon, France, 10.09-14.09.2018, "Deep Learning for ICD Coding: Looking for Medical Concepts in Clinical Documents in English and in French";

28. Machine Learning for Health Workshop (ML4H 2018), Montreal, Canada, 2.12-08.12.2018, "Sequence Learning with RNNs for Medical Concept Normalization in User-Generated Texts";

29. Artificial Intelligence and Natural Language Conference (AINL 2018), St. Petersburg, Russia, 17.10-19.10.2018, "Interactive Attention Network for Adverse Drug Reaction Classification";

30. Russian Summer School in Information Retrieval (RuSSIR 2018), Kazan, Russia, 27.08-31.08.2018, "Using semantic analysis of texts for the identification of drugs with similar therapeutic effect";

31. International Conference on Computational Linguistics and Intellectual Technologies "Dialog", Moscow, Russia, 30.05-02.06.2018, "Leveraging Deep Neural Networks and Semantic Similarity Measures for Medical Concept Normalization in User Reviews";

32. Ivannikov ISP RAS Open Conference 2017, 30.11-1.12.2017, Moscow, Russia, "A machine learning approach to classification of drug reviews in Russian";

33. 8th International Conference and Labs of the Evaluation Forum (CLEF 2017), Dublin, Ireland, 11.09-14.09.2017, "KFU at clef ehealth 2017 task 1: Icd-10 coding of english death certificates with recurrent neural networks";

34. IEEE 30th Neumann Colloquium (NC 2017), Budapest, Hungary, 24-25.11.2017, "End-to-end deep framework for disease named entity recognition using social media data";

35. International Conference on Computational Linguistics and Intellectual Technologies "Dialog", Moscow, Russia, 31.05-3.06.2017, "Identifying disease-related expressions in reviews using conditional random fields".

2 New models and methods for classification and information extraction

New models and methods for classification and information extraction (IE) were proposed and developed by the author of this dissertation [11-20]. Consolidating knowledge about drugs and diseases across different sub-domains is crucial for effective biomedical applications, especially considering the vast amount of biomedical texts that require analysis. Therefore, the use of automated NLP methods is imperative for efficient information retrieval (IR) or IE. In particular, the following key scientific problems, addressed in this chapter, are discussed:

— The first problem, as highlighted in [11], is the significant human effort required to annotate sufficient training examples for each language or subdomain in modern supervised models. Furthermore, Named Entity Recognition (NER) models may exhibit exceptionally poor performance when faced with domain shift or language shift, which is another major challenge in biomedical NLP.

— Current neural network-based approaches for detecting adverse drug events (ADEs) from texts, as discussed in [18], are limited in their ability to leverage drug structure and mainly rely on capturing textual information from user posts about drugs.

— The third problem, as studied in [16; 17], is the cross-terminology mapping of entity mentions to a given lexicon without additional re-training. This is a common challenge in the biomedical domain, where different terminologies and ontologies are used to represent biomedical concepts.

— Another challenge, discussed in [19], is the effective retrieval and analysis of biomedical texts that focus on specific entities such as diseases, genes, and chemicals. With the overwhelming amount of text data produced in the biomedical field, coupled with the limitations of state-of-the-art IR approaches based on dense or sparse embeddings, there is a need for an entity-centric search engine design and evaluation.

To address the problems highlighted above, several new models and methods are developed:

— To address the first problem, multilingual transfer learning between electronic health records (EHRs) and user-generated texts (UGTs) in different

languages is explored, with the goal of investigating whether knowledge can be transferred from a high-resource language, such as English, to a low-resource language, such as Russian, to perform NER of biomedical entities [11]. This approach leverages the multilingual capabilities of pretrained models and incorporates transfer learning.

— To address the second problem, a novel method to utilize both textual and molecular information for ADE classification is proposed [18]. To fuse the drug and tweet representations, two strategies are explored, including using a co-attention mechanism to integrate features of different modalities.

— To address the third problem, a Drug and disease Interpretation Learning with Biomedical Entity Representation Transformer (DILBERT) model is proposed that uses metric learning and negative sampling to obtain entity and concept embeddings. The DILBERT model enables the creation of a shared semantic vector space for entities and concepts from the knowledge base, allowing for cross-terminology entity linking (EL) without the need for re-training [16; 17].

— To address the fourth problem, a BERT-based IE system is designed as an entity-centric search engine [19]. The system consists of two sub-modules, namely the NER sub-module and the EL sub-module, which are applied successively. The NER sub-module is responsible for identifying entities of interest, while the EL sub-module links the extracted entities with concepts from relevant knowledge bases using the DILBERT model. To evaluate the approach, a novel search collection of PubMed abstracts for disease and gene queries is developed, along with corresponding relevance judgments.

The key results and conclusions on transfer learning research [11], which explores multilingual transfer learning for NER in the biomedical domain, are as follows:

— Based on the evaluation results, it can be concluded that the multi-BERT approach exhibits the best transfer capabilities in the zero-shot setting when the training and test sets are either in the same language or in the same domain.

— Transfer learning is shown to effectively reduce the amount of labeled data required to achieve high performance. Specifically, trained models were able to achieve 98-99% of the full dataset performance on both types of entities after training on only 10-25% of sentences.

The key results and conclusions on the multimodal research [18] are as follows:

— The proposed approach is effective in utilizing both textual and molecular information for ADE classification, and achieves state-of-the-art performance on several benchmark datasets in English, French, and Russian.

— Experiments show that the molecular information obtained from neural networks is more beneficial for ADE classification than traditional molecular descriptors.

The key results and conclusions of studies on an information extraction system [16; 17; 19] are as follows:

— Experiments show that the DILBERT model significantly outperforms baseline and state-of-the-art architectures for biomedical EL. Moreover, this model is effective in knowledge transfer from the scientific literature to clinical trial data using a novel annotated dataset for drug and disease linking for evaluation.

— The neural IE architecture shows superior performance in a zero-shot setup for search with both disease and gene concept queries. Furthermore, the IE system can effectively handle out-of-domain abstracts, indicating its potential to be applied to a wide range of biomedical entities.

Похожие диссертационные работы по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Заключение диссертации по теме «Другие cпециальности», Тутубалина Елена Викторовна

5. Conclusion

We have presented the first cross-lingual benchmark for clinical entity linking in English, Spanish, French, German, and Dutch. We perform an extensive evaluation of BERT-based models with state-of-the-art biomedical representations in two setups: with official train/test splits and with filtered test sets. Our filtering strategy keeps only entity mentions, which are dissimilar to entries from the reference set. As the reference set, we adopt a training set or a target entity dictionary. Our evaluation shows the great divergence in performance between official and proposed test sets for all languages and models, answering positively to the RQ1 and supporting the claim that fair evaluation requires the proposed dataset filtering (the answer to the RQ2). Our experiments with SapBERT show that cross-lingual training on the English MCN corpus substantially helps to improve the performance on clinical datasets in other languages, which answers the RQ3. Finally, answering the RQ4, we note that general-purpose models without domain knowledge and fine-tuning are almost useless for the considered task, falling behind even the simplistic tf-idf baseline. Our fair evaluation shows that clinical entity linking requires pre-training at least on the related biomedical corpora. The constructed benchmark for cross-lingual clinical entity linking is available at https://github.com/ AIRI-Institute/medical_crossing. Our study opens up new venues for further work. First, we plan to extend this evaluation to more languages, more corpora, and other types of entities (not only diseases but, e.g., medical procedures or drugs). Second, SapBERT receives a significant boost in the performance by using synonymous relations, but in fact, the concepts form a tree-like hierarchy, and taking it into account may improve the results further. Third, since our method of evaluation moves towards zero-shot territory, we plan to apply other recently developed approaches in zero-shot learning to the entity linking problem.

Список литературы диссертационного исследования доктор наук Тутубалина Елена Викторовна, 2023 год

7. Bibliographical References

Aronson, A. R. (2001). Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap

program. In Proceedings of the AMIA Symposium, page 17. American Medical Informatics Association.

Basaldella, M., Liu, F., Shareghi, E., and Collier, N. (2020). Cometa: A corpus for medical entity linking in the social media. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3122-3137.

Canete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., and Perez, J. (2020). Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020.

Combi, C., Zorzi, M., Pozzani, G., Moretti, U., and Arzenton, E. (2018). From narrative descriptions to meddra: automagically encoding adverse drug reactions. Journal of Biomedical Informatics, 84:184— 199.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2019). Unsuper-vised cross-lingual representation learning at scale. CoRR, abs/1911.02116.

Dermouche, M., Looten, V., Flicoteaux, R., Chevret, S., Velcin, J., and Taright, N. (2016). ECSTRA-INSERM@ CLEF eHealth2016-task 2: ICD10 code extraction from death certificates. CLEF.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171—4186.

Duarte, F., Martins, B., Pinto, C. S., and Silva, M. J. (2018). Deep neural models for icd-10 coding of death certificates and autopsy reports in free-text. Journal ofbiomedical informatics, 80:64—77.

Ghiasvand, O. and Kate, R. J. (2014). Uwm: Disorder mention extraction from clinical text using crfs and normalization using learned edit distance patterns. In SemEval@ COLING, pages 828—832.

Kang, B.-Y., Kim, D.-W., and Kim, H.-G. (2008). Two-phase chief complaint mapping to the umls metathesaurus in korean electronic medical records. IEEE Transactions on Information Technology in Biomedicine, 13(1):78—86.

Kersloot, M. G., van Putten, F. J., Abu-Hanna, A., Cornet, R., and Arts, D. L. (2020). Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies. Journal of biomedical semantics, 11(1):1—21.

Leaman, R. and Lu, Z. (2016). Taggerone: joint named entity recognition and normalization with semi-markov models. Bioinformatics, 32(18):2839— 2846.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2020). Biobert: a pre-trained biomed-

421214

ical language representation model for biomedical text mining. Bioinformatics, 36(4):1234—1240.

Liu, F., Shareghi, E., Meng, Z., Basaldella, M., and Collier, N. (2021a). Self-alignment pretraining for biomedical entity representations. In Proceedings of the 2021 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies, pages 4228^238, June.

Liu, F., Vulic, I., Korhonen, A., and Collier, N. (2021b). Learning domain-specialised representations for cross-lingual biomedical entity linking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 565574.

Logeswaran, L., Chang, M.-W., Lee, K., Toutanova, K., Devlin, J., and Lee, H. (2019). Zero-shot entity linking by reading entity descriptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3449-3460.

Lou, Y., Qian, T., Li, F., Zhou, J., Ji, D., and Cheng, M. (2020). Investigating of disease name normalization using neural network and pre-training. IEEE Access, 8:85729-85739.

Miftahutdinov, Z. and Tutubalina, E. (2019). Deep neural models for medical concept normalization in user-generated texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 393-399.

Mohan, S., Angell, R., Monath, N., and McCallum, A. (2021). Low resource recognition and linking of biomedical concepts from a large ontology. arXiv preprint arXiv:2101.10587.

Phan, M. C., Sun, A., and Tay, Y. (2019). Robust representation learning of biomedical names. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 32753285.

Pradhan, S., Elhadad, N., Chapman, W. W., Manand-har, S., and Savova, G. (2014). Semeval-2014 task 7: Analysis of clinical text. In SemEval@ COLING, pages 54-62.

Rios, A. and Kavuluru, R. (2018). Emr coding with semi-parametric multi-head matching networks. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2018, page 2081. NIH Public Access.

Sevgili, O., Shelmanov, A., Arkhipov, M., Panchenko, A., and Biemann, C. (2020). Neural entity linking: A survey of models based on deep learning. arXiv preprint arXiv:2006.00575.

Shen, W., Wang, J., and Han, J. (2014). Entity linking with a knowledge base: Issues, techniques, and so-

lutions. IEEE Transactions on Knowledge and Data Engineering, 27(2):443-460.

Sung, M., Jeon, H., Lee, J., and Kang, J. (2020). Biomedical entity representations with synonym marginalization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3641-3650.

Suominen, H., Salantera, S., Velupillai, S., Chapman, W. W., Savova, G., Elhadad, N., Pradhan, S., South, B. R., Mowery, D. L., Jones, G. J., et al. (2013). Overview of the share/clef ehealth evaluation lab 2013. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 212-231. Springer.

Tutubalina, E., Kadurin, A., and Miftahutdinov, Z. (2020). Fair evaluation in concept normalization: a large-scale comparative analysis for bert-based models. In Proceedings of the 28th International Conference on Computational Linguistics, pages 67106716.

Usui, M., Aramaki, E., Iwao, T., Wakamiya, S., Sakamoto, T., and Mochizuki, M. (2018). Extraction and standardization of patient complaints from electronic medication histories for pharma-covigilance: Natural language processing analysis in japanese. JMIR medical informatics, 6(3):e11021.

Van Mulligen, E., Afzal, Z., Akhondi, S. A., Vo, D., and Kors, J. A. (2016). Erasmus MC at CLEF eHealth 2016: Concept recognition and coding in French texts. CLEF.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 59986008.

Villena, F. (2021). Spanish biobert embeddings.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-icz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38-45, Online, October. Association for Computational Linguistics.

Wright, D., Katsis, Y., Mehta, R., and Hsu, C.-N. (2019). Normco: Deep disease normalization for biomedical knowledge base construction. In Automated Knowledge Base Construction.

Zhu, M., Celikkaya, B., Bhatia, P., and Reddy, C. K. (2019). Latte: Latent type modeling for biomedical entity linking. arXiv preprint arXiv:1911.09787.

8. Language Resource References

Bodenreider, Olivier and McCray, Alexa T. (2003). Exploring semantic groups through visual approaches. Elsevier.

421§15

Bodenreider, Olivier. (2004). The unified medical language system (UMLS): integrating biomedical terminology. Oxford University Press.

Borchert, Florian and Lohr, Christina and Modersohn, Luise and Langer, Thomas and Follmann, Markus and Sachs, Jan Philipp and Hahn, Udo and Schapra-now, Matthieu-P. (2020). GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines.

Brown, Elliot G and Wood, Louise and Wood, Sue. (1999). The medical dictionary for regulatory activities (MedDRA). Springer.

CodeBooks, Medical. (2016). ICD-10-CM Complete Code Set 2016. Medical Code Books.

Coletti, Margaret H and Bleich, Howard L. (2001). Medical subject headings used to search the biomedical literature. BMJ Group BMA House, Tavistock Square, London, WC1H 9JR.

Jan A. Kors and S. Clematide and Saber Ahmad Akhondi and Erik M. van Mulligen and Dietrich Rebholz-Schuhmann. (2015). A multilingual goldstandard corpus for biomedical concept recognition: the Mantra GSC.

Fangyu Liu and Ivan Vulic and Anna Korhonen and Nigel Collier. (2021). Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking.

Lopez-Ubeda, Pilar and Diaz-Galiano, Manuel Carlos and Martín-Valdivia, María Teresa and Lopez, Luis Alfonso Urena. (2020). Extracting Neoplasms Morphology Mentions in Spanish Clinical Cases through Word Embeddings.

Yen-Fu Luo and Weiyi Sun and Anna Rumshisky. (2019). MCN: A comprehensive corpus for medical concept normalization.

Miranda-Escalada, Antonio and Farre, Eulalia and Krallinger, Martin. (2020a). Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results.

Miranda-Escalada, Antonio and Gonzalez-Agirre, Aitor and Armengol-Estape, Jordi and Krallinger, Martin. (2020b). Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020.

Peters, Ana Carolina and da Silva, Adalniza Moura Pucca and Gebeluca, Caroline P and Gumiel, Yohan Bonescki and Cintho, Lilian Mie Mukai and Car-valho, Deborah Ribeiro and Hasan, Sadid A and Moro, Claudia Maria Cabral and others. (2020). SemClinBr-a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks.

Pedro Ruas and Andre Lamurias and Francisco M. Couto. (2020). Towards a Multilingual Corpus for

Named Entity Linking Evaluation in the Clinical Domain. CEUR-WS.org.

Spackman, Kent A and Campbell, Keith E and Cote, Roger A. (1997). SNOMED RT: a reference terminology for health care.

Uzuner, Ozlem and South, Brett and Shen, Shuying and DuVall, Scott. (2011). 2010 i2B2/VA challenge on concepts, assertions, and relations in clinical text.

Vashishth, Shikhar and Joshi, Rishabh and Newman-Griffis, Denis and Dutt, Ritam and Rose, Carolyn. (2020). MedType: Improving Medical Entity Linking with Semantic Type Prediction.

World Health Organization. (2013). International classification of diseases for oncology (ICD-O). 3rd edition, 1st revision.

4222i6

Appendix P. Paper "Deep neural models for medical concept normalization in user-generated texts"

Mftahutdinov Z., Tutubalina E. Deep neural models for medical concept normalization in user-generated texts // ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop. — 2019. — Pp. 393-399.

DOI: 10.18653/v1/P19-2055

Permission to copy: The article is available in the public domain on the ACL Anthology's website by the link: https://aclanthology.org/P19-2055/

Deep Neural Models for Medical Concept Normalization in User-Generated Texts

Zulfat Miftahutdinov

Kazan Federal University, Kazan, Russia

zulfatmi@gmail.com

Elena Tutubalina

Kazan Federal University, Kazan, Russia Samsung-PDMI Joint AI Center, PDMI RAS, St. Petersburg, Russia

elvtutubalina@kpfu.ru

Abstract

In this work, we consider the medical concept normalization problem, i.e., the problem of mapping a health-related entity mention in a free-form text to a concept in a controlled vocabulary, usually to the standard thesaurus in the Unified Medical Language System (UMLS). This is a challenging task since medical terminology is very different when coming from health care professionals or from the general public in the form of social media texts. We approach it as a sequence learning problem with powerful neural networks such as recurrent neural networks and contextual-ized word representation models trained to obtain semantic representations of social media expressions. Our experimental evaluation over three different benchmarks shows that neural architectures leverage the semantic meaning of the entity mention and significantly outperform an existing state of the art models.

1 Introduction

User-generated texts (UGT) on social media present a wide variety of facts, experiences, and opinions on numerous topics, and this treasure trove of information is currently severely under-explored. We consider the problem of discovering medical concepts in UGTs with the ultimate goal of mining new symptoms, adverse drug reactions (ADR), and other information about a disorder or a drug.

An important part of this problem is to translate a text from "social media language" (e.g., "can't fall asleep all night" or "head spinning a little") to "formal medical language" (e.g., "insomnia" and "dizziness" respectively). This is necessary to match user-generated descriptions with medical concepts, but it is more than just a simple matching of UGTs against a vocabulary. We call the task of mapping the language of UGTs to medical termi-

nology medical concept normalization. It is especially difficult since in social media, patients discuss different concepts of illness and a wide array of drug reactions. Moreover, UGTs from social networks are typically ambiguous and very noisy, containing misspelled words, incorrect grammar, hashtags, abbreviations, smileys, different variations of the same word, and so on.

Traditional approaches for concept normalization utilized lexicons and knowledge bases with string matching. The most popular knowledge-based system for mapping texts to UMLS identifiers is MetaMap (Aronson, 2001). This linguistic-based system uses lexical lookup and variants by associating a score with phrases in a sentence. The state-of-the-art baseline for clinical and scientific texts is DNorm (Leaman et al., 2013). DNorm adopts a pairwise learning-to-rank technique using vectors of query mentions and candidate concept terms. This model outperforms MetaMap significantly, increasing the macro-averaged F-measure by 25% on an NCBI disease dataset. However, while these tools have proven to be effective for patient records and research papers, they achieve moderate results on social media texts (Nikfarjam et al., 2015; Limsopatham and Collier, 2016).

Recent works go beyond string matching: these works have tried to view the problem of matching a one- or multi-word expression against a knowledge base as a supervised sequence labeling problem. Limsopatham and Collier (2016) utilized convolutional neural networks (CNNs) for phrase normalization in user reviews, while Tutubalina et al. (2018), Han et al. (2017), and Belousov et al. (2017) applied recurrent neural networks (RNNs) to UGTs, achieving similar results. These works were among the first applications of deep learning techniques to medical concept normalization.

The goal of this work is to study the use of deep neural models, i.e., contextualized word represen-

393218

Proceedings of the 57 th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 393-399 Florence, Italy, July 28 - August 2, 2019. ©2019 Association for Computational Linguistics

Entity from UGTs Medical Concept

no sexual interest Lack of libido

nonsexual being Lack of libido

couldnt remember long Poor long-term

periods of time or things memory

loss of memory Amnesia

bit of lower back pain Low Back Pain

pains Pain

like i went downhill Depressed mood

just lived day by day Apathy

dry mouth Xerostomia

Table 1: Examples of extracted social media entities and their associated medical concepts.

tation model BERT (Devlin et al., 2018) and Gated Recurrent Units (GRU) (Cho et al., 2014) with an attention mechanism, paired with word2vec word embeddings and contextualized ELMo em-beddings (Peters et al., 2018). We investigate if a joint architecture with special provisions for domain knowledge can further improve the mapping of entity mentions from UGTs to medical concepts. We combine the representation of an entity mention constructed by a neural model and distance-like similarity features using vectors of an entity mention and concepts from the UMLS. We experimentally demonstrate the effectiveness of the neural models for medical concept normalization on three real-life datasets of tweets and user reviews about medications with two evaluation procedures.

2 Problem Statement

Our main research problem is to investigate the content of UGTs with the aim to learn the transition between a laypersons language and formal medical language. Examples from Table 1 show that an automated model has to account for the semantics of an entity mention. For example, it has to be able to map not only phases with shared n-grams no sexual interest and nonsexual being into the concept "Lack of libido" but also separate the phase bit of lower back pain from the broader concept "Pain" and map it to a narrower concept.

While focusing on user-generated texts on social media, in this work we seek to answer the following research questions.

RQ1: Do distributed representations reveal important features for medication use in usergenerated texts?

RQ2: Can we exploit the semantic similarity between entity mentions from user comments and medical concepts? Do the neural models produce better results than the existing effective baselines? [current research]

RQ3: How to integrate linguistic knowledge about concepts into the models? [current research]

RQ4: How to jointly learn concept embeddings from UMLS and representations of health-related entities from UGTs? [future research]

RQ5: How to effectively use of contextual information to map entity mentions to medical concepts? [future research]

To answer RQ1, we began by collecting UGTs from popular medical web portals and investigating distributed word representations trained on 2.6 millions of health-related user comments. In particular, we analyze drug name representations using clustering and chemoinformatics approaches. The analysis demonstrated that similar word vectors correspond to either drugs with the same active compound or to drugs with close therapeutic effects that belong to the same therapeutic group. It is worth noting that chemical similarity in such drug pairs was found to be low. Hence, these representations can help in the search for compounds with potentially similar biological effects among drugs of different therapeutic groups (Tutubalina et al., 2017).

To answer RQ2 and RQ3, we develop several models and conduct a set of experiments on three benchmark datasets where social media texts are extracted from user reviews and Twitter. We present this work in Sections 3 and 4. We discuss RQ4 and RQ5 with research plans in Section 5.

3 Methods

Following state-of-the-art research (Limsopatham and Collier, 2016; Sarker et al., 2018), we view concept normalization as a classification problem.

To answer RQ2, we investigate the use of neural networks to learn the semantic representation of an entity before mapping its representation to a medical concept. First, we convert each mention into a vector representation using one of the following (well-known) neural models:

(1) bidirectional LSTM (Hochreiter and Schmid-huber, 1997) or GRU (Cho et al., 2014) with an attention mechanism and a hyperbolic tangent activation function on top of 200-dimensional word embeddings obtained to answer RQ1;

(2) a bidirectional layer with attention on top of deep contextualized word representations ELMo (Peters et al., 2018);

(3) a contextualized word representation model BERT (Devlin et al., 2018), which is a multilayer bidirectional Transformer encoder.

We omit technical explanations of the neural network architectures due to space constraints and refer to the studies above.

Next, the learned representation is concatenated with a number of semantic similarity features based on prior knowledge from the UMLS Metathesaurus. Lastly, we add a softmax layer to convert values to conditional probabilities.

The most attractive feature of the biomedical domain is that domain knowledge is prevailing in this domain for dozens of languages. In particular, UMLS is undoubtedly the largest lexico-semantic resource for medicine, containing more than 150 lexicons with terms from 25 languages. To answer RQ3, we extract a set of features to enhance the representation of phrases. These features contain cosine similarities between the vectors of an input phrase and a concept in a medical terminology dictionary. We use the following strategy, which we call TF-IDF (MAX), to construct representations of a concept and a mention: represent a medical code as a set of terms; for each term, compute the cosine distance between its TF-IDF representation and the entity mention; then choose the term with the largest similarity.

4 Experiments

We perform an extensive evaluation of neural models on three datasets of UGTs, namely CADEC (Karimi et al., 2015), PsyTAR (Zolnoori et al., 2019), and SMM4H 2017 (Sarker et al., 2018). The basic task is to map a social media phrase to a relevant medical concept.

4.1 Data

CADEC. CSIRO Adverse Drug Event Corpus (CADEC) (Karimi et al., 2015) is the first richly

annotated and publicly available corpus of medical forum posts taken from AskaPatient1. This dataset contains 1253 UGTs about 12 drugs divided into two categories: Diclofenac and Lipi-tor. All posts were annotated manually for 5 types of entities: ADR, Drug, Disease, Symptom, and Finding. The annotators performed terminology association using the Systematized Nomenclature Of Medicine Clinical Terms (SNOMED CT). We removed "conceptless" or ambiguous mentions for the purposes of evaluation. There were 6,754 entities and 1,029 unique codes in total.

PsyTAR. Psychiatric Treatment Adverse Reactions (PsyTAR) corpus (Zolnoori et al., 2019) is the second open-source corpus of user-generated posts taken from AskaPatient. This dataset includes 887 posts about four psychiatric medications from two classes: (i) Zoloft and Lexapro from the Selective Serotonin Reuptake Inhibitor (SSRI) class and (ii) Effexor and Cymbalta from the Serotonin Norepinephrine Reuptake Inhibitor (SNRI) class. All posts were annotated manually for 4 types of entities: ADR, withdrawal symptoms, drug indications, and sign/symptoms/illness. The corpus consists of 6556 phrases mapped to 618 SNOMED codes.

SMM4H 2017. In 2017, Sarker et al. (2018) organized the Social Media Mining for Health (SMM4H) shared task which introduced a dataset with annotated ADR expressions from Twitter. Tweets were collected using 250 keywords such as generic and trade names for medications along with misspellings. Manually extracted ADR expressions were mapped to Preferred Terms (PTs) of the Medical Dictionary for Regulatory Activities (MedDRA). The training set consists of 6650 phrases mapped to 472 PTs. The test set consists of 2500 mentions mapped to 254 PTs.

4.2 Evaluation Details

We evaluate our models based on classification accuracy, averaged across randomly divided five folds of the CADEC and PsyTAR corpora. For SMM4H 2017 data, we adopted the official training and test sets (Sarker et al., 2018). Analysis of randomly split folds shows that Random KFolds create a high overlap of expressions in exact matching between subsets (see the baseline results in Table 2). Therefore, we set up a

1https://www.askapatient.com

specific train/test split procedure for 5-fold cross-validation on the CADEC and PsyTAR corpora: we removed duplicates of mentions and grouped medical records we are working with into sets related to specific medical codes. Then, each set has been split independently into k folds, and all folds have been merged into the final k folds named Custom KFolds. Random folds of CADEC are adopted from (Limsopatham and Collier, 2016) for a fair comparison. Custom folds of CADEC are adopted from our previous work (Tutubalina et al., 2018). PsyTAR folds are available on Zen-odo.org2. We have also implemented a simple baseline approach that uses exact lexical matching with lowercased annotations from the training set.

4.3 Results

Table 2 shows our results for the concept normalization task on the Random and Custom KFolds of the CADEC, PsyTAR, and SMM4H 2017 corpora.

To answer RQ2, we compare the performance of examined neural models with the baseline and state-of-the-art methods in terms of accuracy. Attention-based GRU with ELMo embeddings showed improvement over GRU with word2vec embeddings, increasing the average accuracy to 77.85 (+3.65). The semantic information of an entity mention learned by BERT helps to improve the mapping abilities, outperforming other models (avg. accuracy 83.67). Our experiments with recurrent units showed that GRU consistently outperformed LSTM on all subsets, and attention mechanism provided further quality improvements for GRU. From the difference in accuracy on the Random and Custom KFolds, we conclude that future research should focus on developing extrinsic test sets for medical concept normalization. In particular, the BERT model's accuracy on the CADEC Custom KFolds decreased by 9.23% compared to the CADEC Random KFolds.

To answer RQ3, we compare the performance of models with additional similarity features (marked by "w/") with others. Indeed, joint models based on GRU and similarity features gain 25% improvement on sets with Custom KFolds. The joint model based on BERT and similarity features stays roughly on par with BERT on all sets. We also tested different strategies for con-

2https://doi.org/10.5281/zenodo. 3236318

structing representations using word embeddings and TF-IDF for all synonyms' tokens that led to similar improvements for GRU.

5 Future Directions

RQ4. Future research might focus on developing an embedding method that jointly maps extracted entity mentions and UMLS concepts into the same continuous vector space. The methods could help us to easily measure the similarity between words and concepts in the same space. Recently, Yamada et al. (2016) demonstrated that co-trained vectors improve the quality of both word and entity representations in entity linking (EL) which is a task closely related to concept normalization. We note that most of the recent EL methods focus on the disambiguation sub-task, applying simple heuristics for candidate generation. The latter is especially challenging in medical concept normalization due to a significant language difference between medical terminology and patient vocabulary.

RQ5. Error analysis has confirmed that models often misclassify closely related concepts (e.g., "Emotionally detached" and "Apathy") and antonymous concepts (e.g., "Hypertension" and "Hypotension"). We suggest to take into account not only the distance-like similarity between entity mentions and concepts but the mention's context, which is not used directly in recent studies on concept normalization. The context can be represented by the set of adjacent words or entities. As an alternative, one can use a conditional random field (CRF) to output the most likely sequence of medical concepts discussed in a review.

6 Related Work

In 2004, the research community started to address the needs to automatically detect biomedical entities in free texts through shared tasks. Huang and Lu (2015) survey the work done in the organization of biomedical NLP (BioNLP) challenge evaluations up to 2014. These tasks are devoted to the normalization of (1) genes from scientific articles (BioCreative I-III in 2005-2011);(2) chemical entity mentions (BioCreative IV CHEMDNER in 2014);(3) disorders from abstracts (BioCreative V CDR Task in 2015);(4) diseases from clinical reports (ShARe/CLEF eHealth 2013;SemEval 2014 task 7). Similarly, the CLEF Health 2016

Method CADEC PsyTAR SMM4H

Random Custom Random Custom Official

Baseline: match with training set annotation 66.09 0.0 56.04 2.63 67.12

DNorm (Limsopatham and Collier, 2016) 73.39 - - - -

CNN (Limsopatham and Collier, 2016) 81.41 - - - -

RNN (Limsopatham and Collier, 2016) 79.98 - - - -

Attentional Char-CNN (Niu et al., 2018) 84.65 - - - -

Hierarchical Char-CNN (Han et al., 2017) - - - - 87.7

Ensemble (Sarker et al., 2018) - - - - 88.7

GRU+Attention 82.19 66.56 73.12 65.98 83.16

GRU+Attention w/ TF-IDF (MAX) 84.23 70.05 75.53 68.59 86.28

ELMo+GRU+Attention 85.06 71.68 77.58 68.34 86.60

ELMo+GRU+Attention w/ TF-IDF (MAX) 85.71 74.70 79.52 70.05 87.52

BERT 88.69 79.83 83.07 77.52 89.28

BERT w/ TF-IDF (MAX) 88.84 79.25 82.37 77.33 89.64

Table 2: The performance of the proposed models and the state-of-the-art methods in terms of accuracy.

and 2017 labs addressed the problem of ICD coding of free-form death certificates (without specified entity mentions). Traditionally, linguistic approaches based on dictionaries, association measures, and syntactic properties have been used to map texts to a concept from a controlled vocabulary (Aronson, 2001; Van Mulligen et al., 2016; Mottin et al., 2016; Ghiasvand and Kate, 2014; Tang et al., 2014). Leaman et al. (2013) proposed the DNorm system based on a pairwise learning-to-rank technique using vectors of query mentions and candidate concept terms. These vectors are obtained from a tf-idf representation of all tokens from training mentions and concept terms. Zweigenbaum and Lavergne (2016) utilized a hybrid method combining simple dictionary projection and mono-label supervised classification from ICD coding. Nevertheless, the majority of biomedical research on medical concept extraction primarily focused on scientific literature and clinical records (Huang and Lu, 2015). Zolnoori et al. (2019) applied a popular dictionary look-up system cTAKES on user reviews. cTAKES based on additional PsyTAR's dictionaries achieves twice better results (0.49 F1 score on the exact matching). Thus, dictionaries gathered from layperson language can efficiently improve automatic performance.

The 2017 SMM4H shared task (Sarker et al., 2018) was the first effort for the evaluation of NLP methods for the normalization of health-related text from social media on publicly released data. Recent advances in neural networks have been

utilized for concept normalization: recent studies have employed convolutional neural networks (Limsopatham and Collier, 2016; Niu et al., 2018) and recurrent neural networks (Belousov et al., 2017; Han et al., 2017). These works have trained neural networks from scratch using only entity mentions from training data and pre-trained word embeddings. To sum up, most methods have dealt with encoding information an entity mention itself, ignoring the broader context where it occurred. Moreover, these studies did not examine an evaluation methodology tailored to the task.

7 Conclusion

In this work, we have performed a fine-grained evaluation of neural models for medical concept normalization tasks. We employed several powerful models such as BERT and RNNs paired with pre-trained word embeddings and ELMo em-beddings. We also developed a joint model that combines (i) semantic similarity features based on prior knowledge from UMLS and (ii) a learned representation that captures extensional semantic information of an entity mention. We have carried out experiments on three datasets using 5-fold cross-validation in two setups. Each dataset contains phrases and their corresponding SNOMED or MedDRA concepts. Analyzing the results, we have found that similarity features help to improve mapping abilities of joint models based on recurrent neural networks paired with pre-trained word embeddings or ELMo embeddings while staying roughly on par with the advanced language repre-

sentation model BERT in terms of accuracy. Different setups of evaluation procedures affect the performance of models significantly: the accuracy of BERT is 7.25% higher on test sets with a simple random split than on test sets with the proposed custom split. Moreover, we have discussed some interesting future research directions and challenges to be overcome.

Acknowledgments

We thank Sergey Nikolenko for helpful discussions. This research was supported by the Russian Science Foundation grant no. 18-11-00284.

References

Alan R Aronson. 2001. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In Proceedings of the AMIA Symposium, page 17. American Medical Informatics Association.

M. Belousov, W. Dixon, and G. Nenadic. 2017. Using an ensemble of generalised linear and deep learning models in the smm4h 2017 medical concept normalisation task. CEUR Workshop Proceedings, 1996:54-58.

Kyunghyun Cho, Bart van Merrienboer, Caglar Giilfehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training ofdeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Omid Ghiasvand and Rohit J Kate. 2014. Uwm: Disorder mention extraction from clinical text using crfs and normalization using learned edit distance patterns. In SemEval@ COLING, pages 828-832.

S. Han, T. Tran, A. Rios, and R. Kavuluru. 2017. Team uknlp: Detecting adrs, classifying medication intake messages, and normalizing adr mentions on twitter. CEUR Workshop Proceedings, 1996:49-53.

S. Hochreiter and J. Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, 9(8):1735-1780. Based on TR FKI-207-95, TUM (1995).

Chung-Chi Huang and Zhiyong Lu. 2015. Community challenges in biomedical text mining over 10 years: success, failure and the future. Briefings in bioinfor-matics, 17(1):132-144.

Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55:73-81.

Robert Leaman, Rezarta Islamaj Dogan, and Zhiy-ong Lu. 2013. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics, 29(22):2909-2917.

Nut Limsopatham and Nigel Collier. 2016. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation. In ACL.

Luc Mottin, Julien Gobeill, Anais Mottaz, Emilie Pasche, Arnaud Gaudinat, and Patrick Ruch. 2016. Bitem at clef ehealth evaluation lab 2016 task 2: Multilingual information extraction. In CLEF (Working Notes), pages 94-102.

Azadeh Nikfarjam, Abeed Sarker, Karen OConnor, Rachel Ginn, and Graciela Gonzalez. 2015. Phar-macovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. Journal of the American Medical Informatics Association, 22(3):671-681.

Jinghao Niu, Yehui Yang, Siheng Zhang, Zhengya Sun, and Wensheng Zhang. 2018. Multi-task character-level attentional networks for medical concept normalization. Neural Processing Letters, pages 1-18.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. ofNAACL.

Abeed Sarker, Maksim Belousov, Jasper Friedrichs, Kai Hakala, Svetlana Kiritchenko, Farrokh Mehryary, Sifei Han, Tung Tran, Anthony Rios, Ramakanth Kavuluru, et al. 2018. Data and systems for medication-related text classification and concept normalization from twitter: insights from the social media mining for health (smm4h)-2017 shared task. Journal of the American Medical Informatics Association, 25(10):1274-1283.

Yaoyun Zhang1 Jingqi Wang1 Buzhou Tang, Yonghui Wu1 Min Jiang, and Yukun Chen3 Hua Xu. 2014. Uth_ccb: a report for semeval 2014—task 7 analysis of clinical text. SemEval 2014, page 802.

Elena Tutubalina, Zulfat Miftahutdinov, Sergey Nikolenko, and Valentin Malykh. 2018. Medical concept normalization in social media posts with recurrent neural networks. Journal of biomedical informatics, 84:93-102.

EV Tutubalina, Z Sh Miftahutdinov, RI Nugmanov, TI Madzhidov, SI Nikolenko, IS Alimova, and AE Tropsha. 2017. Using semantic analysis of texts for the identification of drugs with similar therapeutic effects. Russian Chemical Bulletin, 66(11):2180-2189.

E Van Mulligen, Zubair Afzal, Saber A Akhondi, Dang Vo, and Jan A Kors. 2016. Erasmus MC at CLEF eHealth 2016: Concept recognition and coding in French texts. CLEF.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. Joint learning of the embedding of words and entities for named entity disambiguation. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 250-259.

Maryam Zolnoori, Kin Wah Fung, Timothy B Patrick, Paul Fontelo, Hadi Kharrazi, Anthony Faiola, Yi Shuan Shirley Wu, Christina E Eldredge, Jake Luo, Mike Conway, et al. 2019. A systematic approach for developing a corpus of patient reported adverse drug events: A case study for ssri and snri medications. Journal of biomedical informatics, 90:103091.

Pierre Zweigenbaum and Thomas Lavergne. 2016. Hybrid methods for icd-10 coding of death certificates. EMNLP 2016, page 96.

Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.