Рандомизированные алгоритмы на основе интервальных узорных структур для задач классификации и регрессии в задачах кредитного риск-менеджмента тема диссертации и автореферата по ВАК РФ 05.13.18, кандидат наук Масютин, Алексей Александрович
- Специальность ВАК РФ05.13.18
- Количество страниц 102
Оглавление диссертации кандидат наук Масютин, Алексей Александрович
1 Introduction............................................................3
1.1 Ph.D. Thesis Relevance ......................................3
2 Overview of Data Analysis Methods in Commercial Banks..........6
2.1 Mathematical Modeling in Commercial Banks................6
2.2 Neural Networks in Credit Scoring............................10
2.3 Classification Tasks in Marketing Campaign Management . 13
2.4 Loan Default Prediction in Banking: Scorecards ............15
3 Formal Concept Analysis in Classification Problem..................18
3.1 Formal Concept Analysis......................................20
3.2 Lazy Classification with Pattern Structures ..................21
3.3 Query-Based Classification Algorithm ........................22
3.4 Voting Schemes ................................................25
3.5 Experiments with Top-10 Bank Data..........................31
3.6 Experiments with open data..................................39
3.7 QBCA. Alternative approaches................................45
3.8 Interpretability: visualization of premises....................49
3.9 Computational time analysis ..................................59
4 FCA in regression problem............................................64
4.1 Problem description ..........................................64
4.2 Augmented interval pattern structures ........................64
4.3 Query-based regression algorithm with continuous target attribute ..........................................................66
4.4 Data and experiments ..........................................68
5 Conclusion............................................................74
6 Acknowledgements....................................................75
1.1 Ph.D. Thesis Relevance
Although the recent biggest bank failures are mostly viewed from perspective of inability to predict market key factors and lack of banking regulation (financial crisis 2007-'08) the history knows a number of failures driven by purely credit risk mismanagement. For instance, Long-Term Credit Bank of Japan was one of the top three banks in Japan responsible for postwar economic growth and in 1989 it was considered the 9th largest company in the world by asset value. At the time LTCB had more than $19.2 billion in bad debt. In 1998, the Japanese government nationalized LTCB, then restructured it as a commercial bank named Shinsei Bank. The Bank of New England (BNE), along with its two sister banks, Maine National Bank and Connecticut Bank and Trust, failed on January 6, 1991. BNE was the largest bank in the New England area. With its sister banks, it had assets totaling $21.8 billion and deposits totaling $19 billion. Bad loans led to its downfall. Los Angeles-based IndyMac used to be the largest loan originator in the USA. Founded in 1995 as Countrywide Mortgage Investment, IndyMac fueled its aggressive growth through risky loan products like Alt-A mortgages, concentrating on inflated real estate markets like California and Florida, and relying heavily on borrowed funds, especially from the FHLB (Federal Home Loan Bank). As of July 2008, IndyMac had total assets of $32.01 billion. Moreover, the financial crisis of 2007-2008 in its basis was actually led by inappropriate credit risk assessment of mortgage loans.
As far as Russian financial market is concerned within 2016-2017 several banks showed inability to control loan portfolio quality [86], [87], [88]. Central Bank of Russia strengthened its focus on banks assets quality control and implemented its own management teams within problem banks executive boards. To considerable part instability is caused by inconsistent risk management which leads to sufficient losses. The greatest part of loss in Russian banks ( 70%) is due to credit risk. Credit risk is risk of the borrowers are not going to repay the granted loan amount in time. The first step to manage credit risk is ability to assess it. In context of credit risk
assessment there are three key parameters: probability of default (PD), loss given default (LGD), exposure at default (EAD) [89]. Multiplied all together they provide an estimate for expected loss (EL). Majority of decisions in credit process, such as whether to grant a loan, sell the loan, initiate legal bankruptcy procedure, are made based on expected loss estimates.
Mathematical models have been widely used in order to make precise predictions on the level of PD, EAD, LGD [1]. Models are usually calibrated on historical data on borrower performance. From data science standpoint PD estimation is a binary classification problem, EAD and LGD estimation is regression problems. As banking industry has begun to be more and more regulated, the requirements on mathematical models development and validation have become more strict and detailed [89].
One of the serious trade-offs in credit risk modeling is accuracy of prediction versus model interpretability. As it will be later shown, some regulators require banks to be able to provide reject reasons for borrowers and also when central banks examine the bank models they are likely to understand economic intuition behind them to prove the models are going to show expected and stable performance. This can be typically solved given the model is interpretable. At the same time interpretable data analysis algorithms usually belong to the simplest class such as logistic regression or decision trees which not always can provide the desired accuracy. We will make an overview of more complex algorithms applications such as neural networks in credit scoring which are capable of describing non-linear interdependencies within the data but cannot provide the bank with a reason for reject or acceptance of a loan application.
As we stated, accurate credit risk estimation is the key tool for risk management and banks obviously are eager to increase accuracy of algorithms, but keeping them interpretable.
The relevance of this Ph.D. thesis is that it offers data analysis algorithms that have accuracy superior to simple algorithms widely adopted within the banks (such as logisitic regression, decision trees and scorecards) and still maintain the property of interpretability in sense that they provide a decision maker with a set of rules
applicable to the borrower creditworthiness assessment.
In order to achieve this goal several novelties within the methods of formal concept analysis (FCA) and interval pattern structures ([51]) were introduced. The reasons why FCA methods are suitable for credit risk assessment under the interpretabil-ity requirements will be explained in following sections.
The novelty brought to the well developed tools of FCA consists of two parts. The first one is that FCA is adopted to classification problem based on numerical data with the step of concept lattice construction being omitted (query-based classification or "lazy" classification). This allows one to work with the datasets with arbitrary number of observations which is vital for banks as soon as historical data is typically large.
The second is that we introduce a modification to FCA method based on interval pattern structures which allows one to solve regression problem. To our knowledge it is FCA methods were not applicable to such type of data analysis problem before. The crucial difference in regression problem is that the target variable is distributed continuously.
The goal of this thesis is to provide PD and LGD algorithms of estimation keeping them interpretable. At the same time methods should provide higher accuracy than basic wide spread algorithms in banking industry (such as logistic regression, decision trees and scorecards). One also should note that PD and LGD are the main drivers of EL as soon as EAD is modeled only for revolving loans such as credit cards and credit tranches [1].
The work consists of a general overview of data analysis algorithms and mathematical models in banking, FCA terms definitions, detailed description of FCA and interval pattern structures modifications, data, benchmarks and experiments results and appendix with an overview of programming implementation of discussed algorithms.
Заключение диссертации по теме «Математическое моделирование, численные методы и комплексы программ», Масютин, Алексей Александрович
• First and foremost I want to thank my scientific advisor Doctor of Science prof. Sergei O. Kuznetsov. It has been an honor to be his Ph.D. student. He has taught me, both consciously and unconsciously, giving absolutely new ways to look at common data analysis problems. I appreciate all his contributions of time and ideas to make my Ph.D. experience productive;
• This work could not be possible without Yury Kashnitsky who has given me a sound introduction to the formal concept analysis tool set and who has become one of my co-authors and contributors to my work;
• Also, I thank Alexander Ageev, who is currently a master student at HSE, who has helped me a lot with additional data experiments and his own Python code algorithms implementation;
• I thank Ivan Medvedev and Evgeny Zinchenko, who have become my first boss and mentor in area of risk-management at RCI Banque and who have sparked my interest in risk modeling;
• I give special thanks to my current boss Roman Tikhonov, Head of Validation department at Sberbank, who has provided me with an opportunity to devote considerable time to Ph.D. thesis including academic work and conferences attendance;
• Last but not the least, I would like to thank my family: my parents, my brother and my sweetheart, due to their unconditional support and understanding of lack of attention from my side.
