Автоматические методы распознавания метафоры в текстах на русском языке тема диссертации и автореферата по ВАК РФ 10.02.21, кандидат наук Бадрызлова Юлия Геннадьевна
- Специальность ВАК РФ10.02.21
- Количество страниц 206
Оглавление диссертации кандидат наук Бадрызлова Юлия Геннадьевна
Table of Contents
Chapter I. Metaphor as a computational problem
1. Annotated corpora and databases of metaphor
1.1. Top-down and bottom-up approaches to metaphor identification in discourse
1.2. MIPVU: a procedure for linguistic metaphor identification
1.3. VUAMC: the VU Amsterdam Metaphor Corpus
2. Computational approaches to metaphor identification: state-of-the-art
2.1. Klebanov, Leong, Gutierrez, Shutova, and Flor (2016)
2.2. Klebanov, Leong, Heilman, and Flor (2014)
2.3. Mu, Yannakoudakis, and Shutova (2019)
2.4. Bulat, Clark, and Shutova (2017)
2.5. Shutova, Kiela, and Maillard (2016)
2.6. Stemle and Onysko (2018)
2.7. Wu et al. (2018)
2.8. Turney, Neuman, Assaf, and Cohen (2011)
2.9. Hovy et al. (2013)
3. Computational metaphor identification systems for Russian
3.1. Strzalkowski et al. (2013)
3.2. Tsvetkov, Mukomel, and Gershman (2013)
3.3. Tsvetkov et al. (2014)
Summary of Chapter
Chapter II. Experimental corpus
4. Corpus design
4.1. Selection of data
4.2. Selection of target verbs
5. Corpus annotation
5.1. Non-metaphoric class
5.2. Metaphoric class
5.3. Distribution of metaphoric subclasses in the corpus
6. Annotation reliability test
6.1. Selection of sentences
6.2. Annotator instructions
6.3. Binarization of categorical annotation
6.4. Annotation results and analysis
Summary of Chapter II
Chapter III. Automated metaphor identification experiment
7. Motivation behind the choice of features
7.1. Motivation behind the use of distributional semantic feature
7.2. Motivation behind the use of lexical co-occurrence feature
7.3. Motivation behind the use of morphosyntactic co-occurrence feature
8. Data preprocessing and the context windows
9. The feature set
9.1. Distributional semantic features
9.2. Lexical co-occurrence features
9.3. Morphosyntactic co-occurrence feature
9.4 . Concreteness / abstractness feature
9.5. Flag words and quotation marks features
10. Experimental setup
11. Results
11.1. Evaluation of alternative parameters of the features
11.2. Window sensitivity
11.3. Inefficient features
11.4. Classification results
Summary of Chapter III
Chapter IV. Linguistic analysis of experimental results
12. Discussion: results of the lexical classifier and their implications
12.1. Correlation between lexical diversity of MET and NONMET subcorpora and performance of the lexical classifier
12.2. Feature importance
12.3. Detecting possible lexical predictors
12.4. Correlation between metaphor association and concreteness
13. Discussion: results of the distributional semantic classifier
13.1. Linguistic interpretation of the performance across datasets
13.2. Correlation between metaphoricity, semantic similarity, and accuracy
14. Discussion: results of the morphosyntactic classifier
14.1. Correlation between metaphor association of grammatical categories and the performance of the morphological classifier
14.2. Feature importance
Summary of Chapter IV
Thesis summary
List of References
List of tables
List of figures
Appendix 1. Annotator guidelines for the inter-annotator reliability test (Chapter II. Section 3.2)
Appendix 2. Concrete ('thingness') paradigm words (Chapter III. Section
Appendix 3. Abstract paradigm words (Chapter III. Section 9.4)
Введение диссертации (часть автореферата) на тему «Автоматические методы распознавания метафоры в текстах на русском языке»
Metaphor occupies a prominent place in contemporary linguistic theory: it is recognized to be one of the most powerful cognitive tools with which humans conceptualize (Lakoff & Johnson, 1980a). Evidence from psycholinguistic research demonstrates that metaphor guides reasoning and decision-making in societal (Thibodeau & Boroditsky, 2011), economic (L. Jia & Smith, 2013; Landau, Keefer, & Rothschild, 2014; Morris, Sheldon, Ames, & Young, 2007; Robins & Mayer, 2000), health-related (Gallagher, McAuley, & Moseley, 2013; Hauser & Schwarz, 2015; Hendricks & Boroditsky, 2016; Scherer, Scherer, & Fagerlin, 2015), educational (Landau, Oyserman, Keefer, & Smith, 2014), and environmental (Flusberg, Matlock, & Thibodeau, 2017; Mio, Thompson, & Givens, 1993) issues.
Metaphor is truly ubiquitous in everyday discourse and it forms a fundamental part of the language system. Metaphor's pervasiveness is estimated invariably high: in a multi-domain corpus, on the average, 0.3 (single-word) metaphor occurs in a sentence (Shutova & Teufel, 2010); in genre-specific corpora, the frequency of metaphor ranges within 5-18% of the total number of words (G. J. Steen et al., 2010). Sardinha (2008) estimated the statistical probability for a word form to occur metaphorically in a general-domain corpus as 0.7.
Not surprisingly, metaphor identification and interpretation pose a serious challenge to a wide range of real-world NLP applications, such as information retrieval, machine translation, question answering, information extraction, opinion mining, and others. Computational work on metaphor identification and interpretation began in the early-mid 1990s (Fass, 1991; Martin, 1990, 1994); the latest advances in corpus linguistics and machine learning sparked a large-scale wave of computational metaphor projects. A series of Workshops on Metaphor in NLP was held for several successive years as a part of the NAACL-HLT conference (Klebanov, Shutova, & Lichtenstein, 2014, 2016; Shutova, Klebanov, & Lichtenstein, 2015; Shutova, Klebanov, Tetreault, & Kozareva, 2013). The first competition of NLP systems in a shared metaphor detection task was held in 2018 (Leong, Klebanov, & Shutova, 2018). A comprehensive overview of state-of-the-art approaches to automated metaphor identification is available in (Veale, Shutova, & Klebanov, 2016).
Metaphor identification systems and metaphor-annotated datasets may be primarily divided into the two major groups - those that operate within the theoretic paradigm of conceptual metaphor (CM) (Lakoff & Johnson, 1980a), on the one hand, and those that do not make any a priori assumptions about the underlying conceptual mechanisms of metaphor and focus on linguistic metaphor (LM).
A conceptual metaphor "consists of two conceptual domains, where one domain is understood in terms of another" (Kovecses, 2010). The domain which lends its conceptual structure to another domain is referred to as source domain; the domain which is conceptualized in terms of the source domain is called target domain. There is "a set of systematic correspondences between the source and the target in the sense that constituent conceptual elements of [the target] correspond to constituent elements of [the source]. Technically, these conceptual correspondences are often referred to as mappings" (ibid). Among the projects for conceptual metaphor identification are: (Dodge, Hong, & Stickles, 2015; Gandy et al., 2013; Gedigian, Bryant, Narayanan, & Ciric, 2006; Heintz, Gabbard, Srivastava, et al., 2013; Mohler, Bracewell, Hinote, & Tomlinson, 2013; Mohler, Rink, Bracewell, & Tomlinson, 2014; Ovchinnikova et al., 2014; Rosen, 2018; Shutova & Sun, 2013; Shutova, Sun, Gutiérrez, Lichtenstein, & Narayanan, 2016; Stowe & Palmer, 2018; Strzalkowski et al., 2013).
A linguistic metaphor is "a stretch of language that creates the possibility of activating two distinct domains" (Cameron, 2003). Systems designed within the LM paradigm aim to identify any stretches of text that contain indirectly used lexical units: (Badryzlova & Panicheva, 2018; Hovy, Srivastava, et al., 2013; Klebanov, Leong, Heilman, & Flor, 2014; Klebanov, Leong, Gutierrez, Shutova, & Flor, 2016; Krishnakumaran & Zhu, 2007; Neuman et al., 2013; Panicheva & Badryzlova, 2017; Shutova, Kiela, & Maillard, 2016; Tsvetkov, Boytsov, Gershman, Nyberg, & Dyer, 2014; Tsvetkov, Mukomel, & Gershman, 2013; Turney, Neuman, Assaf, & Cohen, 2011).
The first competition of metaphor identification systems - the VUA metaphor identification shared task (Leong et al., 2018) was held in 2018: the participating systems competed in identification of linguistic metaphor.
Besides the differences in the paradigm (CM vs. LM), experiments in computational identification of metaphor are differentiated by the settings in which they are designed - supervised, unsupervised, or deep learning.
Computational systems for metaphor identification also differ in the types of features exploited in them:
- Lexical features (e.g. Klebanov, Leong, et al., 2014);
- Morphological and syntactic features (e.g. Hovy, Srivastava, et al., 2013; Ovchinnikova et al., 2014);
- Distributional semantic features (e.g. Shutova, Kiela, et al., 2016);
- Topic modelling (e.g. Heintz, Gabbard, Srivastava, et al., 2013);
- Features from lexical thesauri and ontologies: WordNet (e.g. Gandy et al., 2013), FrameNet (e.g. Gedigian et al., 2006), VerbNet (e.g. Klebanov, Leong, et al., 2016), ConceptNet (Ovchinnikova et al., 2014), and the SUMO ontology (J. Dunn, 2013a, 2013b);
- Psycholinguistic features: concreteness / abstractness, imageability, affect, and force (e.g. Neuman et al., 2013; Strzalkowski et al., 2013; Turney et al., 2011).
Two more characteristics by which metaphor identification systems are differentiated are the type of analysis (binary classification or sequential labelling) and the unit of analysis (e.g. pairs or triples of syntactically related words, or all content words in a sentence).
As the majority of the state-of-the-art metaphor detection systems operate within the supervised setting, the role of annotated datasets for their training and testing becomes paramount. Just as metaphor identification systems, annotated datasets of metaphor can follow either of the two paradigms - the conceptual metaphor or the linguistic metaphor approach. The largest repositories of conceptual metaphor are MetaNet (Dodge et al., 2015) and the LCC Metaphor Dataset (Mohler, Brunson, Rink, & Tomlinson, 2016), both of which are multilingual. By far the largest corpus of linguistic metaphor is the VU Amsterdam Metaphor Corpus (G. J. Steen et al., 2010) which is available for English.
The central goal of this thesis is to provide in-depth linguistic analysis of context features which can be utilized in order to automatically differentiate utterances which contain linguistic metaphor from non-metaphoric ones. We address this goal by running several machine learning experiments for metaphor identification in Russian and by evaluating the importance of each of the proposed features; the latter evaluation is also performed using machine learning algorithms. It should be emphasized that the present work does not aim to engineer an algorithm which would maximize the performance on the metaphor identification task; rather, we intend to suggest feature extraction methods and to assess the efficiency of the extracted features.
The main goal of the thesis is accomplished via the following series of tasks:^
- to develop a customized scheme for annotation of linguistic metaphor at the sentence level;
- to collect and annotate a corpus of contexts containing linguistic metaphor, as well as non-metaphoric ones;
- to evaluate the quality of metaphor annotation;
- to suggest methods of feature engineering for identification of linguistic metaphor at the sentence level;
- to implement machine learning experiments for linguistic metaphor identification using models based on the suggested features and their combination;
- to evaluate the performance of the models and their generalizability for experiments on new datasets;
- to provide an in-depth linguistic analysis of the contextual factors that promote the success or failure of features.
The types of features to be explored in this research are:^
1. Semantic similarity;^
2. Lexical co-occurrence;^
3. Morphosyntactic co-occurrence;^
4. Concreteness indexes;^
5. Occurrences of flag words (lexical signals of metaphoricity) and quotation marks.^ The following methods and algorithms are used in the present research:
- a customized version of the MIPVU procedure for annotation of linguistic metaphor (G. J. Steen et al., 2010);
- distributional semantic models (Baroni, Dinu, & Kruszewski, 2014; Kutuzov & Kuzmenko, 2016);
- AP metric, a statistical measure of association (Ellis, 2006);
- Support Vector Machine algorithm;
- Random Forest algorithm;
- Logistic Regression algorithm;
- K-means clustering algorithm;
- Boruta algorithm (Kursa, Jankowski, & Rudnicki, 2010).
Relevance of the thesis. The bulk of the effort on metaphor annotation and computational metaphor identification has focused on English. Most of metaphor annotation projects for Russian which are known to us adhere to the conceptual metaphor paradigm, such as the Russian sections
of the multilingual resources (Dodge et al., 2015; Mohler et al., 2016). The only known to us Russian dataset of linguistic metaphor is the corpus compiled by Tsvetkov et al. (2014). However, this dataset in several regards differs from the corpus which was collected and annotated in the present study:
- the corpus by Tsvetkov and colleagues is smaller: its size amounts to a total of 240 sentences, while our corpus comprises more than 7,000 sentences; to the best of our knowledge, this is the largest currently existing Russian corpus annotated for linguistic metaphor;
- the corpus by Tsvetkov and coauthors does not concentrate on any specific set of target lexemes and covers a range of most frequent Russian verbs and adjectives; our corpus is designed around twenty target verbs: this allows us to explore the impact of the linguistic characteristics of verbs on the performance of classification features;
- Tsvetkov et al. report that metaphoric sentences in their corpus were selected so as to contain only one metaphor, that is, the metaphoric occurrence of the target verb or adjective; the corpus presented in this thesis was compiled with the aim of approximating the experimental task to the demands of real-world NLP applications: therefore, it contains sentences which may feature multiple instances of figurative language as well as language errors and inaccuracies.
Next, most of the computational metaphor identification work for Russian that we are aware of also follows the top-down design, i.e. is aimed at identifying conceptual metaphors (Dodge et al., 2015; J. Dunn et al., 2014; Mohler et al., 2014; Strzalkowski et al., 2013). There are two experiments for identification of linguistic metaphor in Russian that are known to us: (1) Tsvetkov et al. (2013) and (2) Tsvetkov et al.(2014). However, the design of their experiments is substantially different from the experiments conducted in this thesis:
- the experiments by Tsvetkov and coauthors is based on cross-lingual model transfer: classification features in non-English languages are translated into English with an electronic bilingual dictionary and then they are vectorized using English lexical resources (such as WordNet, the MRC Psycholinguistic Database, or distributional semantic models); our experiments are monolingual: they neither depend on the quality of bilingual translation which may become problematic in cases of polysemy, nor do they require data from other languages and solely rely on resources that are currently available for Russian NLP.
- the experiments by Tsvetkov and colleagues operate on syntactically related pairs (Adjective-Noun) and triples (Subject-Verb-Object): as a consequence, they are dependent on the quality of syntactic parsing which is not always reliable in real-life tasks; our experiments take full sentential context as input: this enables us to explore the impact of contextual and discourse factors on identification of metaphor.
Scientific novelty of the thesis. As pointed out above, we see the main goal of this thesis in suggesting a linguistic explanation and interpretation of the language and discourse-based factors which promote the success of some computational models of linguistic metaphor identification and cause the other models to falter on the task. The output of a machine learning classifier is analyzed by means of statistical methods and other ML algorithms in order to arrive at empirical, data-driven conclusions about the linguistic mechanisms contributing to metaphor identification. To the best of our knowledge, this is the first attempt of such research.
Theoretical significance of the thesis. The findings of the present research may have a value for psycholinguistic and broader cognitive studies. The results presented in the thesis can shed light on the cognitive factors that make processing of metaphor by humans possible, since we explore the lexico-semantic and morphosyntactic cues which are deployed in carrying the signals of metaphoricity across from the speaker to the recipient. The results of this research can help to outline the inventory of metaphor cues and to evaluate their salience. As we look at metaphor in context and apply nonlinear (bag-of-items) representation, it allows us to make conclusions as to whether metaphor can be modelled as a holistic mental process in which the information carried by a verbalized message is a non-compositional unity of its constituent cues. Eventually, the the present research may have implications for efforts aimed at providing a computational model of the metaphor decoding and encoding process.
Practical significance of the thesis. The major contributions of this thesis can be summarized as the follows:^
- The research re-implemented the approaches to corpus annotation which had been suggested in earlier work on metaphor annotation in English. We introduced minor modification and applied the previously suggested protocols to Russian data.^
- We compiled a relatively large dataset of metaphorical and non-metaphorical usages of 20 Russian verbs, which is made available for public use. To the best of our knowledge, this is the first public resource of this kind.^
- An annotation validation experiment in a setting with multiple annotators was conducted.^
- We release a ranking of concreteness indexes for approximately 17K Russian words.^
- The study tested a number of earlier methodologies of feature extraction for metaphor identification in application to Russian (lexical and morphological frequencies, distributional semantic vectors, and concreteness scores).^
- We developed a classifier for sentence-level binary-class identification of metaphoric occurrences in raw running Russian text.^
- The thesis provides linguistic evaluation of the quality of classification and compares the efficiency of models based on different features.^
- We also suggest data-driven linguistic interpretation to the performance of the features and identify the features which hold potential for generalizability.^
- The thesis provides analysis aimed at an empirical verification of the theoretical claims that formed the basis of the computational models.
Public demonstrations of the results. The major results of the research were presented at the following events:
• The 2017 Spring Symposium Series of the Association for the Advancement of Artificial Intelligence (Stanford University, Computer Science Department; Palo-Alto, USA, 2017);
• The 2nd Kolmogorov Seminar on Computational Linguistics and Language Studies (National Research University Higher School of Economics, Moscow, Russia, 2017);
• Dialogue-2017, the 23rd International Conference on Computational Linguistics and Intellectual Technologies (Russian State University for the Humanities, Moscow, Russia, 2017);
• RuSSIR-2017, Russian Summer School in Information Retrieval (Ural State University, Yekaterinburg, Russia, 2017);
• The 3nd Kolmogorov Seminar on Computational Linguistics and Language Studies (National Research University Higher School of Economics, Moscow, Russia, 2018);
• AINL-2018, Artificial Intelligence and Natural Language Conference (ITMO University, Saint Petersburg, 2018)
• The 9th International Cognitive Linguistics Congress (National Research University Higher School of Economics, Nizhny Novgorod, 2019).
Note on collaboration
The initial experiments on Russian verbal metaphor identification with distributional semantic features (Panicheva & Badryzlova, 2017b) were led by Polina Panicheva in collaboration with the author of the thesis. All the other theoretical, experimental and composition work involved in the production of the thesis was carried out by the author alone.^
Organization of the thesis. The thesis consists of Introduction, four Chapters, Summary, and List of References comprising 206 titles.
Chapter I provides an overview of the state-of-the-art approaches to annotation of metaphor in corpora and to engineering computational systems for automated metaphor identification.
Chapter II is devoted to the experimental corpus - the principles of selecting data and target verbs, and annotating the corpus; the chapter also gives an outline of the metaphoric and non-metaphoric classes and describes the inter-annotator reliability test - the annotator instructions, annotations binarization schemes, and the obtained measure of agreement between the annotators; the last subsection of the chapter looks at the cases of inter-annotator disagreement.^
Chapter III details the metaphor identification experiment. It introduces the set of chosen features and explains the theoretical background which motivated the choice. The chapter goes on to describe the statistical approaches and computational resources which were applied in order to convert the input data into vectors, as well as the design of the machine learning experiment. The second half of the chapter discusses the results of the classification experiment: we compare the performance of models and evaluate the utility of increasing the model complexity.^
Chapter IV presents an in-depth analysis of the linguistic factors determining the performance of the models. We identify the linguistic units which are most likely to carry the signal of metaphoricity and make predictions about their generalizability. ^
Finally, we present the Conclusions of the thesis and make suggestions for future research in the area of computational identification of metaphor.^
Похожие диссертационные работы по специальности «Прикладная и математическая лингвистика», 10.02.21 шифр ВАК
Модели связывания именованных сущностей в биомедицинском домене2022 год, кандидат наук Мифтахутдинов Зульфат Шайхинурович
Film Title Translation and Semantization (on English, French, Russian and Arabic Materials) Перевод и семантизация названий фильмов (на английских, французских, русских и арабских материалах)2024 год, кандидат наук Моктар Алия
Эталонное тестирование языковых моделей на задачах понимания естественного языка2023 год, кандидат наук Михайлов Владислав Николаевич
Алгоритмы ускорения сверточных нейронных сетей / Algorithms For Speeding Up Convolutional Neural Networks2018 год, кандидат наук Лебедев Вадим Владимирович
Разработка алгоритмов раннего прогнозирования нестандартных ситуаций при бурении скважин (Development of algorithms for predictive alarming on non-standard situations at well drilling)2024 год, кандидат наук Гурина Екатерина Викторовна
Заключение диссертации по теме «Прикладная и математическая лингвистика», Бадрызлова Юлия Геннадьевна
Thesis summary
We have presented an attempt to conduct statistical modelling of metaphoric occurrences of Russian verbs in the context. The analysis of the obtained models enabled us to make non-trivial observations about the conceptual nature and the linguistic structure of metaphor and metaphoric discourse.
The contributions of the research can be summarized as the follows:
We release a new lexical resource - an annotated Russian corpus of linguistic metaphor. The corpus contains approximately 7,000 occurrences of verbal metaphor in context; to the best of our knowledge, this is the largest currently existing Russian corpus of such kind. The corpus is built around 20 polysemous target verbs: each sentence contains an occurrence of the target verb; the sentence is tagged as metaphoric when the target verb is used metaphorically, and as non-metaphoric when the target verb is used in its non-metaphoric sense. Such approach is similar to the design of the TroFi (Trope Finder) dataset by Birke and Sarkar (2006) which is made for the English language. The sentential contexts in our corpus are represented by unedited free text which was not controlled for conceptual complexity, i.e. each sentence can contain multiple occurrences of metaphor and metonymy outside of the target verb, which allows us to look at the natural behavior of the target metaphor in discourse and to explore the role of the context. In this regard our corpus differs from the dataset of Tsvetkov et al. (2014) in which sentences were selected so that each of them contained only one metaphorical occurrence located in the target verb. The metaphoric class in our corpus contains three types of metaphor: conventionalized and creative usages, and idiomatic expressions with the target verb. The annotation reliability test with three annotators yielded a high degree of inter-annotator agreement (0.83 and 0.9, under various conditions).
We suggest a statistical approach to quantifying the degree of association between metaphor and the constituent elements of the discourse - the index of metaphor association which is based on the AP metric (Ellis, 2006). In our experiments, the index of metaphor association is computed for lexical (lemma) unigrams and full morphosyntactic tags.
We provide a method for computing concreteness indexes of lexemes. The computation is based on a seed set of approximately 500 'thingness' paradigm words; for each word of the corpus, we use a pre-trained distributional semantic model to find its ten nearest neighbors among the
paradigm words, and take the average semantic distance as the concreteness index. The concreteness ranking of about 17,000 Russian lexemes is made publicly available.
Theoretical literature on metaphor studies (e.g. Goatly, 1997) placed substantial emphasis on the role of special lexical markers of metaphoricity (known as 'Flag words') for creating and indicating metaphoricity. We show that flag words (such as буквально 'literally', как будто, словно 'as if4, т.е. 'i.e.', and подобно 'like') as well as quotation marks cannot be efficiently used as features for classification due to sparse data.
We show that the features used in our classification experiments are sensitive to the size of the context window; in aggregate terms, the highest accuracy of classification is achieved on the context of full sentences. This observation accords with the study by Mu et al. (2019) which demonstrates that contextual features significantly enhance the quality of classification.
We conduct several series of classification experiments with models of different complexity (uni-, bi-, tri-, and four-feature models) based on the four types of features: (1) lexical co-occurrence, (2) morphosyntactic co-occurrence, (3) distributional semantic similarity, and (4) concreteness. We evaluate the accuracy of each model's performance of each of the 20 datasets of the individual target verbs, and on the combined dataset of the 20 target verbs. The mean accuracy across the 20 datasets yielded by the distributional semantic model is 0.67; the mean accuracy of the lexical co-occurrence model is 0.82; the mean accuracy of the morphosyntactic model is 0.74; the mean accuracy of the concreteness model is 0.74. The accuracy achieved on the combined dataset of the 20 verbs by the distributional semantic model is 0.65; by the lexical co-occurrence model - 0.82; by the morphosyntactic model - 0.67; by the concreteness model - 0.63. Combining several features in one classifier increases the accuracy of classification by 1-3 accuracy points. We claim that among the uni-feature models, the lexical co-occurrence model may hold the greatest promise for generalizability, since it achieves the highest result both on the combined dataset and across the 20 datasets, while maintaining stable performance. This result is in line with the findings of Klebanov, Leong, Heilman, and Flor (2014) who show that lexical unigrams prove very successful in linguistic metaphor classification.
We offer empirical support to the hypothesis that the efficiency of the lexical co-occurrence classifier may be related to the lexical homogeneity of one of the subcorpora (the metaphoric or the non-metaphoric ones). We measure the index of lexical diversity in the metaphoric and the non-metaphoric parts of each of the 20 individual target verbs datasets and find a strong negative correlation between the lexical diversity of the non-metaphoric subcorpus and the accuracy of classification. Thus, datasets with more lexically homogeneous non-metaphoric subcorpora are more likely to be classified with greater accuracy.
We attempt to induce the set of lexemes that are likely to serve as lexical cues of metaphoricity: we use an algorithm for evaluating the importance of lexical data points for classification and filter them by the variance of their distribution in the corpus. The resulting list contains 230 lexemes; we show that nouns are likely to bear greater importance as lexical cues of metaphoricity, since they are overrepresented in the list (as compared to the entire corpus).
We demonstrate that metaphor association of lexemes correlates with their concreteness indexes: lexemes with stronger metaphor association tend to be more abstract, while words that are associated with non-metaphoric contexts appear to be more concrete.
We outline the direction for future research in which combination of lexical metaphor association indexes and concreteness indexes can potentially be used for induction of conceptual metaphor mappings from the corpus. Sequential clustering of metaphor association and concreteness indexes yields clusters which vary in metaphor association, within which we separate abstract and concrete vocabulary. By adding statistical co-occurrence indexes it may be possible to construct a weighted directed graph in which nodes (lexemes) are connected by weighted edges (where weight is defined as the index of co-occurrence of the lexemes), and the edges are directed from lexemes with high metaphor association to lexemes with low metaphor association, and from abstract words to concrete ones.
We offer the hypothesis that the efficiency of the classifier based on distributional semantic similarity is likely to be related with the semantic homogeneity of one or both - the metaphoric and the non-metaphoric subcorpus.
We show that the efficiency of the morphosyntactic classifier on the datasets of the individual target verbs correlates with the degree to which the grammatical categories with high and low metaphor association are juxtaposed to each other in the corpus. We measure this degree of juxtaposition as the variance and slope of the curve formed by the metaphor association indexes of grammatical categories.
We show that the classifier operating on morphosyntactic features relies on idiosyncratic patterns of morphosyntactic combinability licensed by individual target verbs; therefore, morphosyntactic feature does not generalize well across the datasets.
Список литературы диссертационного исследования кандидат наук Бадрызлова Юлия Геннадьевна, 2019 год
The following papers have been published on the topic of the present thesis; three papers are devoted to computational metaphor identification and two to annotation of metaphor in corpus:
1. Badryzlova, Y. (2017). Opy't korpusnogo modelirovaniya faktorov metaforichnosti na primere russkix glagolov [A corpus-based study of factors and models of metaphoricity: evidence from Russian verbs]. Computational Linguistics and Intellectual Technologies,
2, 30-44. Moscow.
2. Badryzlova, Y., & Lyashevskaya, O. (2017). Metaphor Shifts in Constructions: The Russian Metaphor Corpus. The 2017 AAAI Spring Symposium Series: Technical Reports, 127-130.
3. Badryzlova, Y., Lyashevskaya, O., & Panicheva, P. (2019). Computer and metaphor: when lexicon, morphology, punctuation, and other beasts fail to predict sentence metaphoricity. Cognitive Studies of Language. Integrative Processes in Cognitive Linguistics, 37, 609-615. Nizhny Novgorod.
4. Badryzlova, Y., & Panicheva, P. (2018). A Multi-feature Classifier for Verbal Metaphor Identification in Russian Texts. Conference on Artificial Intelligence and Natural Language, 23-34. Springer.
5. Panicheva, P., & Badryzlova, Y. (2017). Distributional semantic features in Russian verbal metaphor identification. Computational Linguistics and Intellectual Technologies, 1, 179-190. Moscow.
