Индексы интересности замкнутых описаний в задачах анализа данных и обнаружения знаний тема диссертации и автореферата по ВАК РФ 05.13.17, кандидат наук Махалова Татьяна Павловна

  • Махалова Татьяна Павловна
  • кандидат науккандидат наук
  • 2020, ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики»
  • Специальность ВАК РФ05.13.17
  • Количество страниц 92
Махалова Татьяна Павловна. Индексы интересности замкнутых описаний в задачах анализа данных и обнаружения знаний: дис. кандидат наук: 05.13.17 - Теоретические основы информатики. ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики». 2020. 92 с.

Оглавление диссертации кандидат наук Махалова Татьяна Павловна

Содержание

Introduction

1 Basic Definitions

1.1 Formal Concept Analysis

1.2 Pattern Structures and Projections

1.2.1 Pattern Structures on Graphs

1.2.2 Interval Pattern Structures

1.2.3 Projections

1.3 Basics of Information Theory. Minimum Description Length Principle in Pattern Mining

2 Main results

2.1 Static Approaches

2.1.1 Systematisation of Interestingness Measures

2.1.2 Generic Approach to Combination (Adapting Arbitrary Indices for Closed Itemset Assessment)

2.1.3 Experimental Evaluation of Interestingness Measures

2.2 Interestingness Measures for Information Retrieval

2.3 Dynamic Approaches

2.3.1 Combining Static and Dynamic Approaches to Pattern Mining

2.3.2 Numerical Pattern Mining Through Compression

3 Conclusion 33 References 34 Appendices 38 A Article "On Interestingness Measures of Formal Concepts" 38 B Article "Information Retrieval Chatbots Based on Conceptual Models" 57 C Article "MDL for FCA: is there a Place for Background Knowledge?" 67 D Article "Numerical Pattern Mining Through Compression"

Рекомендованный список диссертаций по специальности «Теоретические основы информатики», 05.13.17 шифр ВАК

Введение диссертации (часть автореферата) на тему «Индексы интересности замкнутых описаний в задачах анализа данных и обнаружения знаний»

Introduction

Dissertation relevance. Over the past decades, the models built by methods of Data Mining (DM) and Knowledge Discovery (KD) have become ubiquitous in our everyday life. Nowadays, these models have successful applications in a wide variety of areas: form routine tasks to highly advance science. Modern technologies allow us to collect easily massive arrays of different data.

In such conditions, the most acute problem is data summarisation, i.e., computing a succinct and lossless data description. The objects that provide this description are called patterns. In a broad sense, patterns are used for both data description and prediction (Figure 1). In the narrow sense, a pattern refers to a concept that describes "interesting" repetitive fragments in data. The corresponding field in Data Science is called Pattern Mining (PM).

PM takes an important place in KD and DM. Its objective is to compute a small set of non-redundant interesting patterns that (almost) entirely describe the dataset. The main challenges are the following: (i) ensuring mutual diversity of patterns in a pattern set, (ii) improving descriptiveness of pattern sets (reducing the number of patterns retaining their descriptiveness), (iii) improving interpretability of the existing methods (the more advanced methods are the harder data interpretation is), (iv) improving complexity of algorithms and methods.

Related work. The central notion of PM is "interestingness". Interestingness is a quite subjective measure. There are two main approaches to formalise it, namely static and dynamic [2].

In static approaches, patterns are mined under non-changeable assumptions about interestingness. The large majority of static approaches uses interestingness measures [3] to evaluate patterns. For example, in frequency-based PM, one assumes that patterns with frequency greater than a minimum threshold are interesting. Other popular measures work under independence [4], randomisation models [5] or other constraints [6, 7]. Usually, mined patterns are easy to interpret since assumptions on interestingness are embedded into a measure. The main drawbacks are the following: (i) instead of computing an "interesting" set of most interesting patterns, these approaches return a set of individually interesting patterns, which are often

phc. 1: Patterns in Data Mining [1]

very similar (lack of diversity), (ii) application of a single interestingness measure provides a narrow insight, to get a comprehensive description of data several measures should be combined.

Dynamic approaches were proposed to solve or at least to minimise these issues. The main idea is to set initial assumptions, e.g., row or column marginals and then gradually (progressively) extend a pattern set by selecting the most "promising" patterns w.r.t. the current pattern set and dynamic constraints. Once a pattern has been discovered, the constraints (interestingness criteria) are updated. A wide range of existing dynamic approaches is based on Minimum Description Length principle (MDL) [8-10]. MDL-based approaches allow for computing a succinct non-redundant set of patterns. Since the length minimisation is at least as complex as NP-complete problems, the application of MDL requires efficient heuristics. As a consequence, it is not always easy to explain why particular patterns are chosen as interesting.

Thus, the static approaches are easily adaptable to an expert understanding of interestingness and provide a good interpretation, while dynamic approaches allow for computing a small set of diverse and non-redundant patterns.

The vast majority of the existing static and dynamic approaches deal with binary (or categorical) data. The patterns in binary data are also called itemsets. Over the past decades, a large number of approaches for itemset mining have been proposed. However, the problem of mining complex structured data, such as graphs, sequences, syntax trees, remains much less explored. One of the ways to deal with complex data is to binarise them. The binarisation can cause some information loss and an attribute space explosion. As a consequence, when dealing with binarised data, excessive computing and pattern redundancy (i.e., when semantically similar patterns are described by different attributes) are typical problems of PM. To tackle these issues, Pattern Structures [11] have been proposed. Pattern Structures allow for mining patterns directly from complex data, without prior binarisation (Section 1.2).

From the computational point of view, the main challenge of PM is the exponential explosion of patterns. For a dataset composed of G objects and M attributes the pattern search space is of the size 2|G|+|M|. For binary datasets the size is much smaller, but still exponential, it is bounded by 2mm(|G|,|M|).

The common approach to reducing the pattern search space is to compute only frequent itemsets [12]. An itemset is called 5-frequent in dataset D if its support (i.e., the rate of objects that include the itemset) is greater than or equal to a given minimum support 5. However, a set of frequent patterns is redundant (contains a lot of similar patterns).

To reduce redundancy of frequent patterns a lot of different pattern types were proposed [13], e.g., closed patterns [14-19], itemsets induced by disjunction-free generators [20,21], margin-closed [22], 5-free [23], non-derivable [24], etc.

Among them, closed itemsets are of big importance in DM and KD, since they can be used for both concise representations of association rules and construction of domain taxonomies.

3

Рис. 2: A brief summary on Pattern Mining.

They are based on a solid algebraic base and ensure lossless data description [8,25-27]. In this study, closed itemsets are introduced in the framework of Formal Concept Analysis (Section 1.1). The aforementioned Pattern Structure is an extension of FCA for dealing with more complex data, e.g., numerical intervals, graphs, sequences. We sketch the foregoing in Figure 2.

Object and goals of the dissertation. The goal of this study is to develop approaches to PM that ensure

- minimal redundancy (diversity) of patterns;

- interpretability (allow for explaining why a particular pattern set has been obtained);

- informativeness (patterns contain valuable, not evident information for an expert);

- succinctness (usage of a small number of patterns to describe important information);

- generalisation (the ability to adapt a PM approach to deal with the data of a more complex structure).

To achieve the goal the stages listed below should be completed.

- Studying the existing static approaches to PM as they ensure a natural embedding of understanding of interestingness (background knowledge) into a PM process. The latter allows for improving interpretability of the results.

- Conducting a comparative study of the existing static approaches in order to identify the best combinations of interestingness measures that ensure an exhaustive data

description, i.e., allow for computing a small set of easily computable measures that characterise different aspects of data and return an informative data description.

- Performing an experimental study to identify the approaches that are tolerant to poor-quality data, i.e., are able to discover patterns in noisy data.

- Developing a PM method that combines the advantages of static and dynamic approaches, i.e., the method should support embedding of background knowledge and return a succinct and non-redundant set of patterns.

- Exploiting the framework of FCA and PS to ensure not only a compact and lossless pattern description, but also a natural generalisation of the developed methods to deal with complex data.

Efficiency improvement by means of parallelisation techniques or by adapting existing approaches to use them in the distributed computing environment remains beyond the scope of this study.

Scientific novelty. The main contributions of the work are the following.

- The comparative analysis of the existing static approaches allowed us to define the main characteristic of interestingness measures and to propose the FCA-based systematisation of them.

- Using experimental evaluation, the groups of similar and noise-tolerant measures have been identified and the best combinations (of uncorrelated ones, with a low computational complexity) have been proposed.

- A new approach, that incorporates interestingness measures into MDL-based PM, has been proposed.

- A generalisation of an MDL-based method for mining numerical patterns have been proposed.

In this work, we also demonstrate how interestingness measures can be used in Information Retrieval. We propose an IR-chatbot model that uses an interestingness measure to traverse through a concept-based knowledge model in order to ensure discovery of the information relevant to users and coherent to their preferences.

The structure of the dissertation is outlined in Figure 3.

Practical value. The results of this study allow for (i) improving qualitative characteristics of pattern sets computed by static and dynamic PM approaches; (ii) mining patterns directly from numerical data; (iii) extending the existing tools of NLP [28] by the proposed algorithm to an adaptive search in a concept-based knowledge model.

Похожие диссертационные работы по специальности «Теоретические основы информатики», 05.13.17 шифр ВАК

Заключение диссертации по теме «Теоретические основы информатики», Махалова Татьяна Павловна

5 Conclusion

In this paper we presented a new approach to numerical Pattern Mining that returns a small subset of patterns with short description, thus providing a compression technique. The obtained pattern sets are non-redundant, well-interpretable (i.e., small and short) and can be used to identify in an unsupervised manner true classes of objects.

The proposed approach can be adapted to handle more complex data, such as sequences, trees, graphs, etc. by changing Pattern Structure descriptions and respective D-operators. This extension, as well as the relationship of the proposed approach to R-trees [18] will be the subject of further research.

Список литературы диссертационного исследования кандидат наук Махалова Татьяна Павловна, 2020 год

References

[1] Heikki Mannila, "Theoretical frameworks for data mining," ACM SIGKDD Explorations Newsletter, vol. 1, no. 2, pp. 30-32, 2000.

[2] Sergei O. Kuznetsov and Tatiana Makhalova, "On interestingness measures of formal concepts," Information Sciences, vol. 442, pp. 202-219, 2018.

[3] Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana, Li Wei, Sang-Hee Lee, and John Handley, "Compression-based data mining of sequential data," Data Mining and Knowledge Discovery, vol. 14, no. 1, pp. 99-129, 2007.

[4] Eamonn Keogh, Li Keogh, and John C Handley, "Compression-based data mining," in Encyclopedia of Data Warehousing and Mining, Second Edition, pp. 278-285. IGI Global, 2009.

[e;

[7]

[8] [s;

10 11

12

13

14

15

16

17

18

Li Wei, John Handley, Nathaniel Martin, Tong Sun, and Eamonn Keogh, "Clustering workflow requirements using compression dissimilarity measure," in Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on. IEEE, 2006, pp. 50-54.

Jilles Vreeken, Matthijs Van Leeuwen, and Arno Siebes, "Krimp: mining itemsets that compress," Data Mining and Knowledge Discovery, vol. 23, no. 1, pp. 169214, 2011.

Tatiana Makhalova, Sergei O. Kuznetsov, and Amedeo Napoli, "A first study on what MDL can do for FCA," in CLA, 2018, pp. 25-36.

Tatiana Makhalova, Sergei O. Kuznetsov, and Amedeo Napoli, "MDL for FCA: is there a place for background knowledge?," in CEUR, 2018, pp. 45-56.

Bernhard Ganter and Sergei O. Kuznetsov, "Pattern structures and their projections," in International Conference on Conceptual Structures. Springer, 2001, pp. 129-142.

B Ganter and R Wille, "Formal concept analysis: Logical foundations," 1999.

Mehdi Kaytoue, Sergei O Kuznetsov, and Amedeo Napoli, "Revisiting numerical pattern mining with formal concept analysis," in IJCAI Proceedings-International Joint Conference on Artificial Intelligence, 2011, vol. 22, p. 1342.

Sergei O. Kuznetsov, "On stability of a formal concept," Annals of Mathematics and Artificial Intelligence, vol. 49, no. 1-4, pp. 101-115, 2007.

Andrew Barron, Jorma Rissanen, and Bin Yu, "The minimum description length principle in coding and modeling," IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2743-2760, 1998.

Paul MB Vitanyi and Ming Li, An introduction to Kolmogorov complexity and its applications, vol. 34, Springer Heidelberg, 1997.

Peter D Grunwald, In Jae Myung, and Mark A Pitt, Advances in minimum description length: Theory and applications, MIT press, 2005.

Thomas M Cover and Joy A Thomas, Elements of information theory, John Wiley & Sons, 2012.

Dua Dheeru and Efi Karra Taniskidou, "UCI machine learning repository," 2017.

Yannis Manolopoulos, Alexandros Nanopoulos, Apostolos N Papadopoulos, and Yannis Theodoridis, R-trees: Theory and Applications, Springer Science & Business Media, 2010.

Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.