Исследования по разработке методов противодействия мошенничеству в финансовых организациях с применением машинного обучения тема диссертации и автореферата по ВАК РФ 00.00.00, кандидат наук Воробьев Иван Александрович
- Специальность ВАК РФ00.00.00
- Количество страниц 103
Оглавление диссертации кандидат наук Воробьев Иван Александрович
Глоссарий
1. Введение
1.1. Постановка проблемы и актуальность исследования
1.2. Цель и задачи диссертационной работы
1.3. Степень разработанности темы исследования
1.4 Научная новизна исследования
2. Основные результаты
2.1. Основные результаты, выносимые на защиту
2.2. Личный вклад автора
3. Публикации и апробация работы
3.1. Публикации повышенного уровня
3.2. Публикации стандартного уровня
3.3. Доклады на конференциях и семинарах
4. Содержание работы
4.1. Использование методов машинного обучения в задачах выявления мошенничества и подходы к оценке их эффективности
4.2. Архитектура систем фрод-мониторинга и потенциальные зоны их улучшения
4.3. Подготовка данных для обучения классификаторов мошеннических операций и претензий
4.4. Генерация новых алгоритмов для повышения качества системы фрод-мониторинга
4.5. Методика проведения экспериментов и результаты
4.5.1. Корректировка классовой разметки транзакций
4.5.2. Использование в пространстве признаков данных иной природы
4.5.3. Использование в пространстве признаков данных извлеченных из графа
4.5.4. Настройка алгоритмов в системе принятия решения
5. Заключение
Список литературы
Приложение А. Статья «Reducing false positives in bank anti-fraud systems based on rule induction in distributed tree-based models»
Приложение Б. Статья «Fraud risk assessment in car insurance using claims graph features in machine learning» (на рецензии)
Приложение В. Статья «Методы машинного обучения в задаче оценки риска мошенничества в автостраховании» (одобрена в печать)
Приложение Г. Статья «A Hybrid Machine Learning Framework for Ecommerce Fraud Detection»
Глоссарий
Фрод-мониторинг - это автоматизированная система анализа транзакций, направленная на предотвращение мошеннической деятельности и защиту средств и личных данных пользователей.
Эффективность метода обнаружения мошенничества - набор специальных характеристик, позволяющих оценить способность метода предотвращать мошеннические действия. В рамках исследования данный термин может применяться к конкретному алгоритму или системе фрод-мониторинга.
Машинное обучение - методы, основанные на выявлении эмпирических закономерностей в данных. Для разработки таких методов используются средства математической статистики, численных методов, математического анализа, методов оптимизации, теории вероятностей и теории графов.
Признак - это характеристика или атрибут, который описывает объект или данные. Признаки используются для представления информации о объектах и служат основой для обучения моделей машинного обучения.
Классификация данных — это процесс присвоения объектам различных категорий или классов на основе определённых признаков.
Обогащение данных - это процесс добавления дополнительной информации или атрибутов к существующим данным. Этот процесс может включать в себя использование внешних источников данных, анализ и преобразование данных.
Банковская операция - это финансовая транзакция, которая происходит между банком и его клиентами. Она включает в себя различные виды действий, такие как переводы денежных средств, открытие и закрытие счетов, выдачу кредитов, погашение долгов и прочее.
Страховая претензия - это запрос или требование, предъявляемое страхователем к страховой компании в случае наступления страхового случая. В рамках страховой претензии страхователь обращается к страховой компании с целью получить возмещение убытков, покрытие расходов или выплату страхового возмещения в соответствии с условиями договора страхования.
1. Введение
Рекомендованный список диссертаций по специальности «Другие cпециальности», 00.00.00 шифр ВАК
Уголовная ответственность за мошенничество в сфере страхования автотранспортных средств2010 год, кандидат юридических наук Филиппов, Андрей Владимирович
Выявление и раскрытие мошенничества в сфере страхования2009 год, кандидат юридических наук Быкова, Наталья Владимировна
Предупреждение корпоративного мошенничества в сфере банковской деятельности2022 год, кандидат наук Долганов Станислав Игоревич
Воздействие искажений финансовой отчетности компаний на экономические интересы инвесторов: международный опыт на примере США2021 год, кандидат наук Дорохова Марина Владимировна
Методология анализа данных в сфере противодействия отмыванию доходов2023 год, доктор наук Бекетнова Юлия Михайловна
Введение диссертации (часть автореферата) на тему «Исследования по разработке методов противодействия мошенничеству в финансовых организациях с применением машинного обучения»
1.1. Постановка проблемы и актуальность исследования
В диссертационном исследовании рассматривается задача повышения устойчивости финансовых организаций к атакам мошенников, которые направлены как на активы клиентов, так и на активы самих организаций. Такие атаки называют финансовым мошенничеством или просто мошенничеством. В большинстве случаев они проводятся злоумышленниками с целью получения денежных средств клиентов или организаций незаконным путем. Как подчеркивается в [1], финансовое мошенничество является существенной проблемой, так как наносит ущерб, как экономике организаций, так и экономике государства, поэтому минимизация последствий от деятельности мошенников является одной из приоритетных задач основных участников финансового сектора - банков и страховых компаний.
Развитие технологий хранения и обработки данных позволило финансовым организациям вести учет транзакций, данных клиентов и другую информацию во внутренних базах данных. Стало возможным не только накапливать эту информацию, но и использовать технологии больших данных и искусственного интеллекта (ИИ) для автоматического принятия решений в различных процессах, включая обнаружение фактов мошенничества [2]. При этом ретроспективный анализ событий на основе данных стал широко применяться в области информационной безопасности [3]. Большинство финансовых институтов стали применять автоматизированные системы для анализа транзакций, так называемые, системы фрод-мониторинга. Их основное назначение - выявление противоправных действий против клиентов или самих организаций.
В исследовании [4] в разделе 6 финансовое мошенничество классифицируется на несколько типов в зависимости от отрасли: банковское, страховое, телекоммуникационное и т.д. При этом каждая отрасль имеет подтипы в зависимости от способа совершения мошенничества. В предлагаемом исследовании рассматривается возможность применения методов машинного обучения для повышения эффективности борьбы с мошенничеством в двух подтипах - банковском с использованием карт в электронной коммерции и в автостраховании.
Основным объектом исследования являются методы машинного обучения, адаптация и встраивание которых в процессы противодействия мошенничеству, позволят финансовым организациями быть более устойчивыми к рискам, связанным с мошенническими действиями.
Стратегии злоумышленников в исследуемых подтипах различны, но проблемы применения методов машинного обучения сходны - это несбалансированность классов мошеннических транзакций против
легитимных, рассмотренная, например, в исследовании [5] и низкая интерпретируемость результатов моделирования при использовании данных методов [6].
При этом дополнительную сложность обнаружения мошенничества вносит и само поведение мошенников. С течением времени, в попытках обойти систему безопасности банка или страховой компании, поведение злоумышленников меняется, и реакция экспертов финансовых организаций не всегда успевает за этими изменениями. С другой стороны, эксперты безопасности могут иметь собственные предубеждения в части определения признаков мошенничества. Например, эксперты могут оценить один и тот же случай мошенничества по-разному. Поэтому легитимные инциденты, на деле могут оказаться мошенническими, так как вследствие незнания новой схемы эксперт допускает ошибку. По этим причинам в данных о фактах мошенничества, к которым применяются методы машинного обучения, может возникнуть так называемый «дрейф концепции» (concept drift) [7], что приводит к неустойчивости моделей машинного обучения во времени.
Наличие данных проблем и существенные потери от угроз, связанных с деятельностью злоумышленников, подчеркивает актуальность задачи повышения эффективности алгоритмов, создаваемых для предотвращения мошенничества.
1.2. Цель и задачи диссертационной работы
Основной целью предлагаемого исследования является разработка метода, позволяющего более эффективно выявлять случаи мошенничества в финансовых организациях, за счет использования машинного обучения и транзакционных данных.
Для достижения цели были сформулированы следующие задачи:
1. Разработка метода подготовки данных о фактах мошенничества, позволяющего сократить негативное влияние на качества алгоритмов машинного обучения таких факторов, как изменение сценариев схем мошенничества и субъективная экспертная оценка.
2. Разработка новых атрибутов транзакций, оказывающих положительное влияние на эффективность выявления мошенничества.
3. Разработка алгоритма, позволяющего повысить точность системы фрод-мониторинга за счет автоматической генерации правил принятия решения.
4. Проведение экспериментов на реальных данных, позволяющих оценить эффективность выявления мошенничества, достигнутую за счет предложенных методов.
1.3. Степень разработанности темы исследования
В период 1980 - 1990 гг. научные работы, посвящённые выявлению мошенничества, ограничивались применением простых статистических и эконометрических методов [8-10]. Сегодня все чаще при решении таких-то задач применяется искусственный интеллект, в частности методы машинного обучения. Методы обнаружения мошенничества является объектом интереса, как коммерческих компаний, так и научного сообщества. Если за 2015 г. по теме было опубликовано 16 тыс. научных работ, то в 2021 г. - в 1,5 раза больше.
Алгоритмы выявления фактов мошенничества можно разделить на экспертные и статистические. В первом случае мошенничество выявляется на основе правил, созданных экспертами с учётом анализа типичного поведения мошенников в ручном режиме. Во втором случае для классификации операций на мошеннические и легитимные используются статистические методы, включая модели машинного обучения.
Статистические алгоритмы, согласно [11], делятся на задачи классификации, задачи кластеризации и анализ графов. Первые помогают разделять транзакции на мошеннические и легитимные даже в том случае, если мошенники маскируют свою деятельность под легитимную деятельность. Преимущество вторых алгоритмов, которые, однако, хуже распознают замаскированные случаи, заключается в возможности обнаружения новых событий, указывающих на факты мошенничества, которые ещё не встречались в исторических данных. Анализ графов позволяет учесть взаимосвязи между объектами в выборке. Три вида статистических алгоритмов фокусируются на различных аспектах мошенничества и являются взаимодополняющими.
Задача выявления мошенничества с точки зрения машинного обучения - это задача классификации с двумя непересекающимися классами [12]. В диссертационном исследовании основной фокус направлен на повышении качества классификации, путем решения проблемы несбалансированности классов и изменчивости поведения мошенников, а также создании нового признакового описания операций (для банковского фрод-мониторинга) и претензий (для страхового фрод-мониторинга) из существующих данных.
Эффективность классификации зависит от качества данных и признаков [13,14]. В работе [13], где сравниваются результаты классификации ряда методов на различных наборах данных, все методы теряют в эффективности на данных с большим количеством нечисловых признаков. С помощью создания нового признакового описания в исследовании [14] авторы добились большего роста эффективности, чем с помощью перехода от простых и интерпретируемых статистических моделей к более сложным.
Часть исследователей придерживается мнения о более высокой эффективности статистических алгоритмов. Например, в исследовании [15] проводится сравнение эффективности процедур выявления мошенничества, основанных на экспертных правилах, и эффективности нейронной сети, разработанной авторами. Результаты показывают, что нейронная сеть превосходит экспертные правила: она обнаруживает мошенничество на порядок больше и с более высокой точностью.
Несмотря на это, указанные методы могут использоваться как взаимодополняющие. Так, в работе [16] комбинирование нейронной сети, нацеленной на выявление аномалий, и экспертного подхода даёт результаты лучше, чем эти два подхода по-отдельности.
В страховой отрасли выявление мошенничества является сложной задачей, как с использованием экспертных подходов, так и с помощью статистических методов, включая машинное обучение. Это подчеркивается во многих работах, включая исследование [17]. Дополнительную проблему создает ограниченный доступ к данным о мошенничестве. Как отмечается в [18], доступен только один полноценный набор данных для исследования мошенничества методами машинного обучения. Такая ситуация затрудняет прогресс в области выявления мошенничества и приводит к низким показателям классификации.
Для улучшения классификации используют различные подходы. Например, в [19] применяется оценка, которая может изменяться в зависимости от срока действия претензии и использует обработку естественного языка. В [20] авторы воспользовались тем, что мошенники могут искажать анкетные данные и, при выявлении такой аномалии, страховая компания может достичь дополнительного эффекта в уменьшении уровня мошенничества. Кроме того, исследователи стремятся уменьшить количество признаков, используемых для классификации и повысить интерпретируемость результатов [21]. В работе [22] авторы улучшают показатели выявления мошенничества в автостраховании, применяя генетические алгоритмы. В [23] исследована проблема дисбаланса классов при выявлении мошенничества в автомобильном страховании.
Сравнительная таблица результатов исследования, полученная в разные годы для задачи выявления мошенничества в страховании, представлена в [24].
В настоящей научной работе предлагается продолжить исследования, направленные на улучшение качества классификации в задаче выявления мошенничества в банковской и страховой сфере. При этом рассматривается комплекс методов, применяемых в процессе принятия решения по операциям или претензиям на предмет мошенничества.
1.4. Научная новизна исследования
1. Впервые предлагается метод повышения эффективности в задачах выявления мошенничества за счет корректировки целевого класса с помощью нейронной сети, что позволяет сбалансировать данные для использования методов машинного обучения и устранить проблемы дрейфа концепции.
2. Предлагается новый способ комбинирования традиционного экспертного подхода и машинного обучения, который повышает эффективность системы фрод-мониторинга. Метод заключается в использовании составных частей правил, созданных экспертами, для генерации новых, более эффективных правил с помощью машинного обучения.
3. Предложены методы создания новых признаков операций и претензий, улучшающих качество выявления мошенничества.
2. Основные результаты
2.1. Основные результаты, выносимые на защиту
1. Разработан метод, решающий проблему несбалансированности классов при использовании методов машинного обучения и одновременно устраняющий дрейф концепции в данных, возникающий вследствие изменения мошеннических схем или некорректной разметки данных. Данный подход, позволяет улучшить разделяющую способность классификатора путем улучшения качества данных для обучения. Детальное описание метода и полученные результаты опубликованы в [25] (Приложение В).
2. Предложен подход, повышающий эффективность системы фрод-мониторинга за счет создания новых атрибутов операций и претензий. Для банковской сферы - операции обогащаются за счет интеграции истории операций покупок клиентов в обучающие данные для оценки переводов между клиентами. Претензии обогащаются признаками, полученными из графа связей между участниками страховых событий. Описание подходов для создания новых признаков опубликованы в [26] (Приложения Б, Г).
3. Разработан метод, основанный на подходах машинного обучения, позволяющий финансовым организациям сократить ложные срабатывания системы фрод-мониторинга за счет внедрения автоматически созданных алгоритмов принятия решения при оценке операций. Метод опубликован в работе [27] (Приложение A).
4. Разработаны методики проведения экспериментов, позволяющие оценить эффективность предложенных методов. Проведена серия экспериментов, результаты которых демонстрируют повышение качества выявления мошенничества в финансовой сфере при использовании разработанных подходов.
2.2. Личный вклад автора
В ходе диссертационного исследования автором был разработан подход, который позволяет повысить эффективность применения методов машинного обучения путем корректировки экспертной разметки данных.
Кроме того, предложен процесс генерации алгоритмов для принятия решения, представляющий собой совместное применение экспертного и статистического подходов и позволяющий повысить точность классификации мошенничества без ущерба для интерпретируемости. В рамках исследования также был предложен алгоритм построения графа претензий в страховании и способ извлечения новых данных из него для повышения эффективности применения методов машинного обучения. В исследовании было показано, что обогащение данных о банковских переводах сведениями из истории покупок клиента позволяет улучшить качество классификации при использовании методов машинного обучения.
Проведена серия экспериментов, результаты которых показали, что предложенные подходы имеют потенциал для повышения эффективности применения методов машинного обучения и могут быть полезным инструментом в сфере борьбы с мошенничеством, где требуется точная классификация данных.
3. Публикации и апробация работы
3.1. Публикации повышенного уровня
1. Воробьев И. А., Кривицкая А. Reducing False Positives in Bank Anti-fraud Systems Based on Rule Induction in Distributed Tree-based Models // Computers and Security. 2022. Vol. 120, http://doi.org/10.1016/i.cose.2022.102786 (Scopus, Q1)
2. Воробьев И. А., Fraud risk assessment in car insurance using claims graph features in machine learning // Expert Systems with Applications. 2024. Vol. 251, http://doi.org/10.1016/i.eswa.2024.124109 (Scopus, Q1)
3.2. Публикации стандартного уровня
1. Воробьев И. А. Методы машинного обучения в задаче оценки риска мошенничества в автостраховании // Известия Саратовского университета. Новая серия. Серия: Математика. Механика. Информатика. 2024 (Scopus)
2. Феста Ю. Ю., Воробьев И. А. A Hybrid Machine Learning Framework for E-commerce Fraud Detection // Model Assisted Statistics and Applications. 2022. Vol. 17. No. 1. P. 41-49, http://doi.org/10.3233/MAS-220006, (Scopus)
3.3. Доклады на конференциях и семинарах
1. Межвузовской научно-технической конференции студентов, аспирантов и молодых специалистов имени Е.В. Арменского (Москва). Доклад: Исследование применения методов машинного обучения в задаче выявления мошеннических действий в отношении клиентов банка при подтверждении операции, МИЭМ НИУ ВШЭ, 2023
2. XII Конгресс молодых ученых ИТМО (Санкт-Петербург). Доклад: Интерпретируемость моделей машинного обучения и проблема дисбаланса классов в задачах снижения рисков кредитно-финансовых организаций, 2023
3. XII Международная научно-практическая конференция «Математическое и компьютерное моделирование в экономике, страховании и управлении рисками» (Саратов). Доклад: Методы машинного обучения в задаче оценки риска мошенничества в автостраховании, 2023
4. Международная конференция International Conference on Data Analytics and Computational Techniques, ICDACT-21. Доклад: A Hybrid Machine Learning Framework for E-commerce Fraud Detection, 2021
5. Международный конгресс "Современные проблемы компьютерных и информационных наук", VI Международная научная конференция Конвергентные когнитивно-информационные технологии (Москва). Доклад: The application of artificial intelligence for improving the efficiency of transactional fraudmonitoring, 2021
6. XI Международный форум «Борьба с мошенничеством с сфере высоких технологий. Antifraud Russia - 2020» (Москва). Доклад: Антифрод в эквайринге Сбера, 2020
4. Содержание работы
Результаты диссертационного исследования представлены в следующих разделах:
1. Использование методов машинного обучения в задачах выявления мошенничества и подходы к оценке их эффективности.
2. Архитектуры систем фрод-мониторинга и потенциальные зоны их улучшения.
3. Подготовка данных для обучения классификаторов мошеннических операций и претензий.
4. Генерация новых алгоритмов для повышения качества системы фрод-мониторинга.
5. Методика проведения экспериментов и исследований.
4.1. Использование методов машинного обучения в задачах выявления мошенничества и подходы к оценке их эффективности
Под мошенничеством понимается ситуация кражи денежных средств клиента или финансовой организации профессиональными мошенниками.
Мошенничество, согласно [11], обладает следующими специфическими чертами:
1) в сравнении с частотой легитимных операций, мошенничества случается редко;
2) мошенничество тщательно продумано и спланировано;
3) мошенники пытаются замаскировать свою деятельность, выдав её за легитимную;
4) поведение мошенников меняется во времени;
5) мошенники часто работают организованными группировками. Оценка операции или претензии (далее транзакции) на мошенничество
методами машинного обучения осуществляется с помощью исторических данных. Каждая транзакции обладает своим признаковым описанием и, если она обрабатывалась экспертом или у клиента запрашивалось подтверждение о легитимности, имеет ответ на вопрос о наличии в ней признаков мошенничества. Тогда задачу выявления мошенничества можно свести к задаче обучения по прецедентам [12]. В частности, будем рассматривать задачу классификации с двумя непересекающимися классами. При этом найденная решающая функция (далее модель или классификатор) будет использоваться для оценки конкретной транзакции на факт мошенничества с помощью ее признакового описания (далее признаки).
Обозначим множество оцениваемых транзакций X, множество ответов на вопрос «является ли транзакция мошеннической?» - У . Пары «транзакция-ответ» (х^, у{) будем называть прецедентами. Пусть на конечном подмножестве транзакций [хг,..., хг} сХ известны значения некоторой функции у* : Х^У , тогда у^ = у*(х¿). Функцию у* будем
называть целевой функцией, совокупность пар X1 = (, у)^ - обучающей выборкой.
Задача обучения по прецедентам заключается в том, чтобы по выборке X1 восстановить зависимость у*, т.е. построить решающую функцию а: X ^ У, которая приближала бы целевую функцию у*(х) , не только на транзакциях X1, но и на всем множестве X.
Решающую функцию а , также будем называть алгоритмом, в некоторых случаях классификатором, когда ее роль в оценки транзакции будет заключаться в классификации транзакции на категории мошенническая или легитимная. Для практического применения построенный алгоритм а должен обеспечивать эффективную компьютерную реализацию, так как предполагается, что финансовые организации будут использовать его для анализа своих транзакционных данных, хранящихся на их серверах.
Атрибуты транзакций х , получаемые из процессов финансовых организаций (например, сумма операции, возраст клиента, сумма страховой выплаты и пр.), с точки зрения обучения по прецедентам являются признаком и формально являются отображением f : X ^ , где -множество допустимых значения признака.
Различают несколько типов признаков, в зависимости от природы данных.
• = {0,1} - бинарный признак;
• = Ж - количественный признак;
• - конечное множество, номинальный или категориальный признак.
В случае, если в данных все признаки одинаковые, то Б^ = ••• = Б^п и такие данные называются однородными, иначе разнородными. На практике данные транзакций, хранящиеся в финансовых организациях разнородные и содержат все типы признаков. В данном исследовании все категориальные признаки будут преобразоваться в бинарные, общеизвестными алгоритмами машинного обучения.
Пусть имеется набор признаков ^,..., ^ . Вектор (^(х),..., ^(х)) , называют признаковым пространством транзакции х Е X . Совокупность признаковых описаний всех объектов выборки X1 , записанную в виде
таблицы размера I X п называют матрицей объектов-признаков:
......... ) (1)
р= II Г}(хд\\
Пример описания транзакций для задачи выявления мошенничества представлен в Таблице 1.
Таблица 1
Признаки операций
Date and time of transaction Card operation type Type of service Shop МСС Transaction amount Fraud
01.02.2024 13:03 Purchase via pos Car service 5533 26720,00 0
01.02.2024 13:10 Purchase via pos Car service 5533 1500,00 0
02.02.2024 14:12 Purchase via pos Gas station 5541 2202,78 0
08.02.2024 10:00 Purchase via pos Pet Shop 5995 7399,00 0
10.02.2024 23:00 Purchase via ecom P2P 4900 4500,00 1
В данном исследовании множество допустимых ответов У = {0,1}, что является задачей классификации на двух непересекающихся классов. В общем случае, если У= {1,...,М} , множество транзакций X может разбиваться на М непересекающихся классов Ку = {х ЕХ\ у*(х)=у} . Алгоритм а(х) дает ответ на вопрос «какому классу принадлежит х?», а в задаче выявления мошенничества будет получен ответ на вопрос является транзакция мошеннической или нет.
Согласно [12], моделью алгоритмов называется параметрическое семейство отображений А = {д(х, в) | в Е 0}, где д\ X х 0 ^ У некоторая фиксированная функция, множество допустимых значений параметра в , называемое пространством параметров.
Диссертационное исследование направлено на поиск оптимальных параметров модели для классификации транзакций, а также на встраивание полученных моделей на различных этапах процесса обнаружения мошенничества в финансовой организации. В настоящее время существует множество различных подходов и техник для поиска алгоритма и оптимальных параметров (гиперпараметров), которые в итоге позволяют получить необходимый алгоритм а(х) для использования в процессе принятия решений в различных задачах. Совокупность данных подходов также называется методами машинного обучения (далее ML). В данном исследование для внедрения в процесс выявления мошенничества выбраны следующие известные методы ML.
Decision Tree1 (DT) наиболее интерпретируемый и простой инструмент, используемый в машинном обучении. Результат моделирования можно представить в виде древовидной структуры, из которой легко выделить простое правило для принятия решения.
В качестве базового алгоритма для классификации выбран Random Forest 2 (RF), показывающий наилучшие результаты в исследованиях,
1 https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
2 https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
посвященных выявлению мошенничества [28]. Метод заключается в использовании ансамбля алгоритмов Decision Tree, каждое из которых может и не давать высокого качества классификации, но за счет их большого количества можно достигнуть лучшего результата. Выбор в его пользу в данном исследовании обусловлен низкой чувствительностью к размеру признакового пространства, а также высоким качеством классификации при обучении на разнородных данных с категориальными и количественными признаками.
Для построения алгоритма на данных с небольшим количеством признаков выбран многослойный перцептрон3 (MLP). Этот метод также показывает высокие результаты в исследованиях в сфере противодействия мошенничеству. При его использовании, за счет включения скрытых слоев, появляется возможность аппроксимировать нелинейную функцию для классификации.
Поиск лучшей модели из пространства параметров в осуществляется с помощью инструмента GridSearchCV 4 , который оптимизирует гиперпараметры путем перекрестного поиска по сетке параметров.
Для оценки результатов эксперимента были выбраны традиционные метрики, которые обычно используются в задачах выявления мошенничества. В данном исследовании рассматривается классификация на два непересекающихся класса Y = {0,1} . Пусть yt Е Ж будет ответ обученной модели для i-ой транзакции. Далее, для принятия решения по транзакции является ли она мошеннической или легитимной будем использовать порог th , который преобразует значения yt в непересекающиеся классы yf = [yi > th.]5.
С точки зрения статистики, при классификации принимается решение о нулевой гипотезе Н0 о том, что транзакция относится к классу 1 и альтернативной Ht, что транзакция относится к классу 0. Решения, которые принимаются, могут содержать два вида ошибок: ложноположительную (или ошибка первого рода), когда легитимная транзакция классифицируется как мошенническая, и ложноотрицательную (или ошибка второго рода), когда мошенническая транзакция классифицируется как легитимная. Изменение порога позволяет регулировать компромисс между этими двумя типами ошибок, так как с ростом вероятности ошибки первого рода обычно уменьшается вероятность ошибки второго рода, и наоборот.
Порог th выбирается в зависимости от решаемой задачи и при его фиксации можно построить Таблицу 2 (confusion matrix или матрица ошибок):
3 https://scikit-learn.org/stable/modules/generated/sklearn.neural network.MLPClassifier.html
4 https://scikit-learn.org/stable/modules/generated/sklearn.model selection.GridSearchCV.html
5 Квадратные скобки переводят логическое значение в число по правилу [ложь] = 0, [истина] = 1.
Таблица 2
Матрица ошибок
Верная гипотеза
Но Нг
Результат принятия решения Но Т?, Н0 верно принята Б?, Н0 неверно принята (ошибка второго рода)
Нг Щ Н0 неверно отвергнута (ошибка первого рода) Ш, Н0 верно отвергнута
В традиционных терминах машинного обучения данные реализации гипотез можно сформулировать следующим способом:
• TP (True positive) - мошенническая транзакция идентифицирована корректно,
• FP (False positive) - легитимная транзакция идентифицирована как мошенническая,
• TN (True negative) - легитимная транзакция идентифицирована корректно,
• FN (False negative) - мошенническая транзакция идентифицирована как легитимная.
Похожие диссертационные работы по специальности «Другие cпециальности», 00.00.00 шифр ВАК
Методы и программные средства определения значений стационарных демографических атрибутов пользователей социальных сетей2021 год, кандидат наук Гомзин Андрей Геннадьевич
Методы автоматической идентификации личности по изображениям лиц, полученным в неконтролируемых условиях2014 год, кандидат наук Тимошенко, Денис Максимович
Исследование и разработка методов машинного обучения анализа выживаемости2024 год, кандидат наук Васильев Юлий Алексеевич
Разработка метода количественной оценки рисков и неопределенности в прогнозе добычи и расчете потенциальных извлекаемых запасов нефти с использованием машинного обучения2022 год, кандидат наук Назаренко Максим Юрьевич
Разработка метода количественной оценки рисков и неопределенности в прогнозе добычи и расчете потенциальных извлекаемых запасов нефти с использованием методов машинного обучения2021 год, кандидат наук Назаренко Максим Юрьевич
Список литературы диссертационного исследования кандидат наук Воробьев Иван Александрович, 2024 год
Список литературы
1. Al-Hashedi K.G., Magalingam P. Financial fraud detection applying data
mining techniques: A comprehensive review from 2009 to 2019 // Computer Science Review. 2021. Vol. 40. P. 100402.
2. Bao Y., Hilary G., Ke B. Artificial Intelligence and Fraud Detection. 2022. P. 223-247.
3. Bezzateev S.V. et al. Risk assessment methodology for information systems, based on the user behavior and IT-security incidents analysis // Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 2021. Vol. 21, № 4. P. 553-561.
4. Abdallah A., Maarof M.A., Zainal A. Fraud detection system: A survey // Journal of Network and Computer Applications. 2016. Vol. 68. P. 90-113.
5. Gupta P. et al. Unbalanced Credit Card Fraud Detection Data: A Machine Learning-Oriented Comparative Study of Balancing Techniques // Procedia Computer Science. 2023. Vol. 218. P. 2575-2584.
6. Pant P., Srivastava P. Cost-Sensitive Model Evaluation Approach for Financial Fraud Detection System // 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC). IEEE, 2021. P. 1606-1611.
7. Jin C., Feng Y., Li F. Concept drift detection based on decision distribution in inconsistent information system // Knowledge-Based Systems. 2023. Vol. 279. P. 110934.
8. Anderson O.D. A Note on "Trial by Computer"—A Case Study of the Use of Simple Statistical Techniques in the Detection of a Fraud // Journal of the Operational Research Society. 1986. Vol. 37, № 4. P. 423-427.
9. Mercer L.C.J. Fraud detection via regression analysis // Computers & Security. 1990. Vol. 9, № 4. P. 331-338.
10. Wolf D., Greenberg D. The Dynamics of Welfare Fraud: An Econometric Duration Model in Discrete Time // Journal of Human Resources. 1986. Vol. 21, № 4. P. 437-455.
11. Baesens B., Vlasselaer V.V., Verbeke W. Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Hoboken, NJ, USA: John Wiley & Sons, Inc, 2015.
12. Воронцов К. В. Математические методы обучения по прецедентам (теория обучения машин) // Сайт «Машинное обучение», курс лекций. 2011.
13. Kumari P., Mishra S.P. Analysis of Credit Card Fraud Detection Using Fusion Classifiers. 2019. P. 111-122.
14. Baesens B., Hoppner S., Verdonck T. Data engineering for fraud detection // Decision Support Systems. 2021. Vol. 150. P. 113492.
15. Ghosh, Reilly. Credit card fraud detection with a neural-network // 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences. 1994. Vol. 3. P. 621-630.
16. Baader G., Krcmar H. Reducing false positives in fraud detection: Combining the red flag approach with process mining // International Journal of Accounting Information Systems. 2018. Vol. 31. P. 1-16.
17. Nian K. et al. Auto insurance fraud detection using unsupervised spectral ranking for anomaly // The Journal of Finance and Data Science. 2016. Vol. 2, № 1. P. 58-75.
18. Subudhi S., Panigrahi S. Use of optimized Fuzzy C-Means clustering and supervised classifiers for automobile insurance fraud detection // Journal of King Saud University - Computer and Information Sciences. 2020. Vol. 32, № 5. P. 568-575.
19. Yankol-Schalck M. The value of cross-data set analysis for automobile insurance fraud detection // Research in International Business and Finance. 2022. Vol. 63. P. 101769.
20. Vandervorst F., Verbeke W., Verdonck T. Data misrepresentation detection for insurance underwriting fraud prevention // Decision Support Systems. 2022. Vol. 159. P. 113798.
21. Aslam F. et al. Insurance fraud detection: Evidence from artificial intelligence and machine learning // Research in International Business and Finance. 2022. Vol. 62. P. 101744.
22. Yan C. et al. Improved adaptive genetic algorithm for the vehicle Insurance Fraud Identification Model based on a BP Neural Network // Theoretical Computer Science. 2020. Vol. 817. P. 12-23.
23. Salmi M., Atif D. Using a Data Mining Approach to Detect Automobile Insurance Fraud. 2022. P. 55-66.
24. Soufiane E. et al. Automobile Insurance Claims Auditing: A Comprehensive Survey on Handling Awry Datasets. 2022. P. 135-144.
25. Vorobyev I. ML methods for assessing the risk of fraud in auto insurance // Izvestiya of Saratov University. Mathematics. Mechanics. Informatics.
26. Festa Y.Y., Vorobyev I.A. A hybrid machine learning framework for ecommerce fraud detection // Model Assisted Statistics and Applications. 2022. Vol. 17, № 1. P. 41-49.
27. Vorobyev I., Krivitskaya A. Reducing false positives in bank anti-fraud systems based on rule induction in distributed tree-based models // Computers & Security. 2022. Vol. 120. P. 102786.
28. Itri B. et al. Performance comparative study of machine learning algorithms for automobile insurance fraud detection // 2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS). IEEE, 2019. P. 1-4.
29. Fawcett T. An introduction to ROC analysis // Pattern Recognition Letters. 2006. Vol. 27, № 8. P. 861-874.
30. Subelj L., Furlan S., Bajec M. An expert system for detecting automobile insurance fraud using social network analysis // Expert Systems with Applications. 2011. Vol. 38, № 1. P. 1039-1052.
31. Phua C., Alahakoon D., Lee V. Minority report in fraud detection // ACM SIGKDD Explorations Newsletter. 2004. Vol. 6, № 1. P. 50-59.
Приложение А. Статья «Reducing false positives in bank anti-fraud systems based on rule induction in distributed tree-based models»
Воробьев И. А., Кривицкая А. Reducing False Positives in Bank Anti-fraud Systems Based on Rule Induction in Distributed Tree-based Models // Computers and Security. 2022. Vol. 120 doi http://doi.org/10.1016/j.cose.2022.102786
Computen & Security 120 (2022) 102786
ELSEVIER
Contents lists available at ScienceDirect
Computers & Security
journal homepage: www.eteevier.com/locateVcose
Reducing false positives in bank anti-fraud systems based on rule induction in cistributed tree-based models
Ivan Vorobyev Anna Krivitskayab
*HSE lit ¡verity Russian fcdeniiar
bI«monasDv Moscow Stole Ifciverstfy Russan fedaaten
H)
ARTICLE INFO
Arlide Ьыежу: Receded 27 September 2021 Revised 14 March 2022 Accepted 2 June 2022 Available online С June 2022
JCeytvords: rraud detection P^ment card fraud False positives Su> Induction Feature engineering
ABSTRACT
Fraud detection in bank payments transactions suffers from a high number of false positives. To deal with this problem, we introduce a rules generation framework for a fraud-detection system - an automatic rales generation using distributed tree-based ML (machine learning) algorithms such as Decision Tree. Random Forest and Gradient Boosting, where the components of expert rales are used as the features for the model. This approach is a combination of statistical and expert-based approaches. We apply it to the bank's card transaction data. Our dataset covers February 2021 and consists of more than 20 mil. records including information on clients, transactions, and merchants. The autogeneratcd rales were aimed at imprwing FPR (false positive rate) business-metric. The framework was tested in a real fraud-monitoring system of large bank throughout half of the year. The rales obtained using this framework prcwed to be satisfactory efficient while having tangible business effect.
C 2022 Elsevier Ltd. All rights reserved.
I. introduction
I J. Motivation to the stvdy
As fintech and e-ccmmerce thrive, the more and more bank payments and money transfers are facilitated through the online channels which are fester, more convenient, and safer for the health in a coronaviras era. According to Mastercard global consumer study 2020.1 8 out of every 10 Mastercard users all over the world make use of contactless payment methods.
The data on conducted transactions is accumulated in databases of banks, e-commerce platforms, and other e-commerce industry players. For the year 2017 a Visa payment processing capability was as high as 75 000 transactions per second (Zeng. 2018). The proper use of that data allows corporations to improve their operational performance and customer experience. However, that much data is impossible to handle manually. Hence, the corporations are forced to build Big Data infrastructure and utilize Machine Learning (ML) algorithms.
Developing mechanisms to protect client funds against fraudsters is a part of bankng and e-commerce corporations' strategy
• Corresponding author.
£-mm! address: varoby ev-ivan0yandoc.ru (L Vorobyev).
1 https:.IJ www.mastercard.cocn/ newsi pressJ press- releases/2020/aprif mastercard-study- shows-consumers- global ly-make-the-mwe- to- contactless- pay me nts-foe-evetydy- purchases-see Itirig-touch-free- payment-experiences/
for improvinz customer experience. As payments have moved to online channels, the fraudsters have adapted as well, bringing issues of protecting the funds circulating in online channels to the fore. According to SmartMetric. in 2018. the global losses from payment fraud were higher than S24bn.2
Despite the difficulties while detecting fraud (drh/en by the fraudsters imitating normal clients' behavior and using social engineering techniques), the fraud-detection systems of large corporations allow to detect and prevent up to 98X of fraud cases (Carmi-nati et al_ 2015). In terms of ML metrics, this means attaining a high recall.
Nowadays, an issue of a large number of false positives (FP) comes to the fore (Zoldi. 2015). stating the problem of low precision. or high false positive rate (FPR). On average, there is only 1 fraudulent transaction out of 5 transactions blocked and every 6th user was blocked mistakenly over the year (Wedge et al.. 2019).
False positives lead to money losses of corporations on investigating cases and contacting the client, as well as to an increased pressure on a call-center, and even to revenue losses due to declined transactions. Considering one of the possible response strategics - giving a call to the client - the cost of managing one transaction blocked ranges from 1.5 euros (according to the bank that provided us with data for our research) to 5 euros (European Central Bank. (Baesens et al.. 2021b)). As for the e-commerce industry players, according to Merchant Risk Council's 2017 Global
1 https ://www.businesswire^ocn/news/ho me/20191223005414/еп/
S CSfflputci? (.Storfty
https://doi.org/ Ю.Ю IG/ji ose.'022.10278G 01G7-4048/C 2022 Dsevier Ы. All rights reserved.
famjulFr &Ьшл1у J 2D ¡2ШЩ JÜWK
J. VtTcbyev imdA.Krwisicya
Fraud Survey/ one online retailer declines on average 2.Sí oF orders, at least IOS of which should have been accepted. According to RiskiJied,'3 False positives in that caso lead to a loss of 6% of the revenues.
In addition, the more demanding the cítente become to the services quality, the more False positives damage the company's reputation and decrease the customers' It^alty. According to Jarelin Strate^, ■ 613C oF users who didn't manage to conduct a transaction due to the False positive blocking cut down on card us^ge or stopped using them all together.
This paper aims at searching Far the approach to improve the efficiency measures aF tank Fraud detection process, which consist aF two key metrics: 1) Fraud basis point (FBP), accounting For Fraudulent transactions not blocked by Fraud-detection system, and 2) False positive rate (FPR), accounting For the Fraction oF False positives among all the alarms generated by the system.
The work is concentrated on FPR metric. We propose a method to reduce FPR while having no deterioration on FBP metric
We search For the solution that combines ccpert and statistical approaches to Fraud detection in such a way that the resulting algorithm incorporates the advantages oF bath and performs well as measured by the key business metrics while being highly interpretable. At the same time, the solution should be easy to Implement and integrate Into existing Fraud-detection systems used ty the corporations.
We utilize the rules induction techniques. We generate new rules ty Implementing tree-based ML algorithms, the Features oF which consist oF the components aF misting cc ports' rules. Trie resulting rules are aimed at identifying cases that currently arc incorrectly classified as Fraud.
The data For our research includes characteristics oF transactions and clients, as well as the set oF expert rules that Form the decision-mating logic.
Ey the concept oF Fraud we mean a case oF money theFt From a client by proFesslonal Fraudsters. Wc classify Fraud into two main types depending on techniques used by Fraudsters - social engineering, when Fraudsters persuade a client to take actions that enable them to steal the mont^, and others, not Involving a client's participation.
Fraud, according to (Baesens et al., 2015), is characterized ty the Following speclhc characteristics:
■ Fraud Is uncommon, compared to the legitimate transactions
■ Fraud is well-considered and organized
■ Fraudsters try to conccal their actions, pretending to be the normal users
■ Fraud patterns arc dynamic, changing over time
■ Fraudsters can work as a group
1 2ЯТ7 MRU Global Fraud hclpsiliwww.rrierdunLriskiauncN.BigreHiurcr-
crrirnïLr^siîDIT/ÎDI7-mrr-jlDtjl-rraud-ïLrv^7jLliiR«ult=sifCTiïji
1 ShilhjvrL ïfa The True Спя с Г Dedircd Orriez IS hapJiHiuaiuL cnmi blogjLn^-ciKt-deíLnpd-oidpr^
PascuilA. I Li игл pooaling cird aulzicr^.aLiaii jC'l j. htlptil'wwiw.jaKlinslralt^y. cnm'vüvezagr- агедГи1т|сг-ргааИпд-с1пЭ-аи[1юп7.^1:дп
1.5. The nwiiiy of die proposed cpjToocft and is practmii qppf fcai ions
From the scientific perspective, our research contributes to the literature on Fraud detection methods. We propose an approach that combines ccpert-based and Ml approaches such that Features to the mDdel are derived From the expert algorithms and the output oF the model Imitates the algorithms which experts could have constructed themselves. The application of statistical methods for fine-tuning experts' algorithms improves the efficiency of fraud-detection measures, primarily in terms of number of false positives.
Also, we discuss the practical applications of our approach applied to fraud-detection system. According to (Pant and Srivas-rava, 2021), there Is a problem with academic researchers as they lack business contort understanding on how real Fraud-detection systems work and what arc the costs oF blocking legitimate or not blocking Fraudulent transactions. Thus, our discussion contributes to that type oF knowledge in acadcmia.
Regarding practical applications, the outcome of the proposed algorithm Is represented in the form oF decision rules that can be easily Integrated Into Fraud-detection systems the corporations use nowad^s. The method is relevant For the organiz ations that utilize such systems to prevent Fraud (e.g. banks, payment ^sterns, c-commerre platForms, insurance companies, fiscal authorities), as well as For other companies Fostering automatization oF decisionmaking processes on the basis oF decision rules derived From Big Data. In (Schneider and Xhafa. 2022), more details For the cHcahh application area arc provided.
Apart From the description oF an algorithm, we also provide some detail on how to evaluate It in real time and match the actual real-time results with those obtained on the training set. Finally, we suggest a technique to evaluate the net business cFFoct oF It, given the Fact that there is already some level oF precision and recall achieved in corporations with the use oF their current decisions-
F.6. Jesuits
As a result oF implementing the proposed Framework in a bank, we tested rales targeted at detecting False positives or False negatives in a real Industrial antiFraud system The rules performance was evaluated In terms of their classification quality and net financial cFfect. The best-performing rules were then added to the antifraud system.
Although wc indicated a need for further improvements, we have already achieved fairly good performance metrics. The average precision (measured as true positives to sum of true positives and false positives) of the rate is 503! on test sample though ICS in online mode. The averse recall of a rule on test sample is QJGS.
This paper is further organized as follows: Section 2 is a literature review. Section 3 provides detail on the data we used and the field wc waric in, Section 4 describes the algorithm wc implement. Section S presents the results of modeling and Section 6 concludes.
2. literature review
Fraud detection and prevention is of interest to both business and acadcmia. While in 2015, 16,000 scientific papers were published on the topic, in 2021 it was 13 times more.6
Fraud detection algorithms could be classified into ccpert and statistical approaches. In the first case, fraud is detected on the basis of rules created manually ty experts who analyze typical Fraud
fi Sauner: иыичктю к^дгапиянй г. Coaglc Scbalar do lih"«ujm z.ipun Ггand ds-Letlian"
12. Ihe purpose qf [Jh pqper
1Д Dora and mnftods
1.4. Definition of fraud
j. V&deyev and A.Krwtidcya
patterns, [ti the second case, the Artificial Intelligence (Al) methods, especially Machine Learning (ML), ate applied to reveal fraudulent operations.
As slated above, the high false positives rate remains one of the kq1 problems in the field. The recent research that seeks to resolve the problem Is concentrated malnjy on statistical ap proachcs. These include the feature engineering techniques that increase the efficiency of models in terms af FPU such as automated feature engineering (Wedge et al.. 2019): the deep neural networks that help tn automate the decision-making praccss and prove to provide suflicient classification quality in fraud-detect ion (Carrasco and Sicilia-Urban, 2D20)), the classical ML algorithms Including clustering (Liang et al, 2015) and classification models (SeverinD and Peng, 2D21), etc.
A very few works concern the problem of how to efficiently combine expert knowledge and AL Most of the articles that tiy to account for both approaches arc concentrated on the develop ment of Al analytical and data visualization tools to assist fraud experts (Sun et al., 202 D: Lcite et al., 202 D) and explainable Al which fraud experts are believed to trust mote than the black-box models (Cinqueira et al., 2021), as well as on the modeling and data engineering techniques which should complement ntpert-driven traditional approach (Eaesens et al„ 202L). There arc also papers based on the idea of expertise-based feature engineering as a means to extend the typically generated recency, frequency, and temporal features (Hsin et al_ 2021; Xie et al., 2019). One more way to apply experts' knowledge to Improve the efficiency of statistical models is proposed in (Hao et al, 2D21), where experts produce the set of rules that Filter out the noi^ unlabeled data from the training dataset.
Our ctperience, Including constructing new features, performing feature selection automatically, clustering merchants' profiles, detecting abnormal behavior of clients and merchants, creating risk scores based on neural networks and so on, proves that the usage of statistical algorithms docs help fraud ccperts to reduce false positives. But we believe that the performance of fraud-detect ion system as a whole, accounting for both experts and statistical ap proachcs, can be further improved combining two types of intelligence - natucal and artificial - in a more automated w^.
In this work, wc propose to complement the experts' knowledge with Al through application of the mle Induction techniques. So fat the rules Induction was seen to be a data mining technique that helps to reveal hidden patterns in data. The resulting association rules were those used as a supportive tool for experts' decision making. For example, (Xie et al., 2019] imply rules Induction to engineer new features over the set of rales and further use those in a Random Forest classifier. However, all the rales here were generated manually based on expertise accumulated by analytics, not automatically. (Sadgali et al.. 2D21] proposed ML rules generation approach to assess the risk level associated with each transaction. They induce fuzzy association rules sets based on Apri-ori algorithm and then score transactions depending on the share of rules In the set the transaction is consistent with.
Our approach differs from the association rule mining described above on the basis of how wc understand the concept of rule. We aim to derive the ready-to-use expert-like if-then rales in a conjunctive normal form suitable for using in a fraud detection system as they arc, with nD need for further experts' efforts to interpret and adjust them. The approach that is most closely to ours is of (Vousscf et al., 2021). The authors utilize the deep-lcamicig framework OED (continuous/discrete rule cxtractorvla decision tree Induction] to Induce the if-else rules in e-commerre fraud detection. The main purpose of applying rule induction techniques was to shed light on the process of forming predictions in black-box deep learning models. The other related work is (HasanpDuret al., 2D 19) where classification and rale mining were integrated into a rule-
ComjulFr ¡/teainly 120 Г20Щ JDWM
based classifier by merging A priori associative rale induction algorithm with binary Наттос^ Search rule selection and "Gassifica-rion Based on Associations" algarithm for building classifier in the Готт of an If-Then-Else rale list
For the purpose of our research, wc have chosen the rule-based models for rule Induction such as Decision Tree, Random Forest and Gradient Boosting. According to (Hasanpour et al., 2D19), these models, although not demonstrating the better classlhcation metrics than the customized and more complex ones, are satisfactory in terms of the metrics, computing resources and time spent on training and tuning the model. The last two aspects arc particularly relevant for us having billions of data records and applying Big Data technologies stack and distributed ML models.
One more thing that differs our approach is being driven by industrial needs of our company and thus being fully concentratcd on the problem of reducing the number of false positives. Wc select the rules that predict the legitimate transactions, and wc propose a custom set of ML and business metrics that correspond to a given task We explain in detail how our approach can be integrated into the fraud-monitoring system with some given level of precision and recall. Finally, wc test the efficiency and scalability of our approach on real data.
Whlie conducting our research, wc faced the problems typical for the industiy: the large imbalance of classes (according to |SJ, fraudulent transactions make up less than 0.SS of the sample) and unequal costs of classification errors. The latter one concerns the unequal costs of misclasslfying fraudulent and legitimate observations. Especially, the cost of false positives and true positives is fixed at the level of administrative costs for investigating the case and contacting the cardholder, while the cost of false negatives depends on the sum of monq1 stolen ty fraudster [13].
The typically used techniques to account for classes imbalance and unequal costs problems are changing the model Inputs by either oversampling (Kumari et al_ 2019; Barsens et al., 2021a, 2021b), undcTsampllng (Trisanto et al, 2D20) or combining both of them; adjusting weights of observations; changing the model outputs by correcting the thresholds (Shcng. 2D06); or changing the classlhcation algorithm itself by modifying the existing ones or developing the new ones (Hoppner et al., 202.1). If the ratio of two types of error costs does not remain fixed within the classy the weighting and changing algorithm approaches work only.
In case of tree-based algorithms, there are several options of modifying cost-insensitive algorithms to the cost-sensitive one: I) splitting in a cost-sensitive manner, 2) pruning the tree in a cost-sensitive mannec, or 3) using an additional cost adjustment function inside the impurity criteria (Sahin et al., 2013).
In our work, wc apply undersampllng and pcunc rules in a cost-sensitive way. Wc hive also experimented with classes weights.
3. A p plic j 11 on j re j and flau description
3J. Business metrics if iransocti'oniii anrifraud quality
The business metrics to evaluate the efficiency of fraud detection process in transactional antifraud arc influenced by ML and defined by types of transactions in the fraud-monitoring ^stem.
All the banking operations (turnover) that pass through the an-tifraud system can be divided into 5 categories depending on the system verdict (Fig. 1):
- FraucL identified - the operations that were blocked by the system and were given feedback by the client that thiy were actually fraudulent
- Genuine (false positives) - the operations that were blocked by mistake
J. V&dBycv and d. Jfrrvd jJt^ir
fompulFi S-Icninly 120 '2011} JDWSC
fig. 2. Onr af thje poi:=ibli sthemrs of 1hr bjnVs ,ir.i. Trji:d system, ranbinirij rnodrl-bjsrd ¿nd mlr-bjsed ^oichjf
ecution in the order oF rules priority. For each rule, the action is defined - whether ta black in operation if it satisfies the rule's condition! or ■'whitewash" it, excluding Further checking oF lower priority rules.
Thus, bath the elements oF artificial intelligence through which the new Featurej For the anti-Fraud system are created, and Fraud analysts who directly adjust the rules, are involved in the process oF transactional Fraud manitaring. On the ane hand, the combined approach allows the bank to adapt quietly ta changes in Fraudulent scenarios even iF thq1 changcd just a Few hours earlier. On the other hand, it Facilitates developing the right strategy to train ML models, which require a large, labeled ditaset and hence plenty aF time ta oslkct it
The expert rules Formation and adjustment are the regular processes which arc essential to control the TBP and FPR metrics. Currently tln^ are executed manually in the bank. Now, It is impossible to completely replace the «pert approach with a model-based approach in a bank with high transactional activity. The underlying reasons Far the manual control are as Follows: a significant class imbalance while tilting the classification models, and a rapidly changing nature oF Fraud. These Factors increase the likelihood oF incorrect Functioning aF the antiFraud astern as long as the Al models deteriorate, which is reflected in the growth oF rejected legitimate traffic or missed Fraud. Also, Fraudulent schemes are constantly changing and adapting, some of them lasting For a short period aF time and covered efFectlvely by simple rules. Accordingly, the classifier models, once having been fitted, are becoming less and less eFFcctive over the time, meanwhile updating and retraining them requires time and data science resources.
However, manual creation and modification oF the rules that contain numerous scorings From the model-based epgine mates It difficult to manage the ^stem since it is getting more and more complex aod interlinked. Much cFFort oF Fraud analysts is Focused on reducing FBP, while business units oF the bank require cyber-security to ensure a positive customer experience characteriz cd ty the Fraud monitoring not blocking legitimate transact ions.
In this paper, we arc concerned with the problem oF modelbased engine being biased towards the high recall oF Fraud detection. We propose to manage this issue ty searching For rules that reduce False positives. The rules are Formed by fitting decision trees and Drtractirg the most cFFcctivc Features and conditions out of those created ty analysts. The rules that are formed this way are highly interpretable and easily understood.
33. iJaia description
To coo duct this research, we used cross-channel data on transactional Fraud From one oF the largest banks in Eastern Europe. The cross-channel nature oF the anti-Fraud ^stem implies that various bank products, though dlFFcring in architecture, are connected to it. This approach allows accumulating data on events From diFFerent banking service channels (e.g., a mobile application, ATM, cards, IMS-banking, acquirirg, bonus programs) in I o a unified analytical environment.
Eased Dn the taxonomy oF Fraud (Onwubiloo ct al.. 2D20), we are concentrated on financial Fraud committed through the online channels (Web. Mobile, Telephor^) related to bank payments.
To produce rules, Fraud analysts are provided with numerous attributes oF operations (Table I).
The response that the anti -Fraud system returns as a result aF a transaction evaluation includes:
' Predicted labels in the Form of resolutions: 1 For the Fraudulent and G For the legitimate transaction:
' Rules that are triggered on a transaction, and conditions and expressions that FaTtn the rules (Fig. 6).
4. Djij processing ¿nd modeling
Our ML pipeline accounts For spccifidties aF Fraud detection:
' A strong imbalance aF classes and unequal costs of types I and ]] errors;
- A need to quickty decide whether to block or allow transaction;
J. V&daycv andA-Krwislcya
CompulFj S-ieninly 120 '2011} JOWSG
13 Mt I
CaLejpries of djlj on r.ird trarjirtiniu.
Dali
FjumplEi of ulribules
CUehI profile
Merctianl profile
Vorii-li baled MtimjiEi of risk score
C ens i- prod lid attributes
jcpenu
Inrame ind i Ceolncal inn Consumer preferences
Vlrrrlunl caterr^ cnde Merrhanl turrjcjvpr Payner.1 merhnds
Merrhjril reliability
QiEnl'x propensity to jr^v out trarisarlijans in 1he
pdriiculjr merchant uiegpries ,ind bunking service ■fa—*
Graph analytics of clients
Client prufile in j bank nnbile app Credit scorings Blacklists of merchants
■ Hie Lag between the time when transaction Ls conducted and the time when the final resolution on transaction is obtained.
We arc concentrated on interpret able models, although thq1 can be less precise than the black-box ones. We h;wc chosen the tcee-based classifiers - Decision Tree, Random Forest and Gradient Boosting. The branches of a tree constitute rules.
For quick decision-mating, we create a model that is trained offline whereas the resulting rules arc used online in a Fraud-detection system. The model is re-trained an new samples and the rules arc corrected on a regular basis. We do not have possibility to organize the process Fully online with batch training due to the constraints dictated by ouc Fraud-monitoring ^stem (e.g., scenarios in which the monitoring logic could be organized).
The sample size bound us to utilize Big Data technolo^ stack and distributed Ml models.
The pipeline aF our approach consists oF three main stages:
■ Data preparation and preprocessing:
■ Modeling;
■ Rules extraction and evaluation.
4.1. Daia preparation and preprrcesstrig
The first stage consists oFthe Following steps (Fig. 3):
■ Loading historical data From anti-Fraud ^stem;
■ Anti-Fraud system emulation;
■ Dataset preprocessing: Features selection, filtering out noisy data and nttra Features engineering.
The sample includes transactions over Februiiy 2021. Class 1 includes Fraud missed and Fraud stopped by antlFcaud system. Class 0 includes False positives, as well as the sampled legitimate transactions* We sample first minute oF cveiy hour, hence using asternal ic sampling with random starting point and a fixed periodic interval. This sampling method ensured data continuity and mpresentat iveness: based an aur ntpcrience, the patterns oF legitimate behavior change little over such period aF time as two weeks, while legitimate operations of all possible types and patterns occur during the day due to a large flow of operations. For
us, the random sampling performs worse in terms oF the equal distribution oF sample and population as indicated by Chi-Square and Kolmogorov-Smirnav tests.
On the emulation stage, we reproduce how the system would tiivc worked on the historical data. Every transaction is checked Far the correspondence oF its parametecs to the expressions and conditions used in ncpert rules. As a result, we obtain a set oF boolean columns that correspond to expressions and conditions as well as the original Features themselves that Form the expert rules.
We preprocess dataset based on the experts knowledge aF patterns in data, especially those that add noise and result In a law-quality ML models such as noisy labels, mistakes, repeated transactions oF the same user relating to the same case, etc.
42. JWodeiinp
During the second stage (Fig. 4), a prepared dataset is passed to the Decision Tree, Random Forest and Gradient Boosting classification models From the pyspark.ml library.7
Tree-based models are linear classifiers, but using conditions as FcatureSv expressions oF which arc joined an "OR", allows to add non-linearity ta the models.
We tuned the models hypeTpa rameters based an the k-Fald cross-validation and customized confusion matrix as stated in (Hoppnec et al., 2Q20)l It turned out that having so much data, it is hard to ovccfit the model. Also, the specific usage oF model cesults through extracting rules rather than predicting class label docs not allow to judge the classification quality cf the rules based oil the metrics coccespond ing to the decision tcee which the rules came From. Thus, we believe that the computationally ct pensive and rime-consuming grid search cross-validation step can be skipped. It is better to overfit and prune the rules in a cost-sensitive w^.
Also, given such an use case, the ensemble models do nci guarantee the better performance as we do not use their predicitions directly. Making ensembles out of rules should be based an sets of rules rather than an ensembles. In this work, we did not account for such possibility.
7 hrtpsil Ispark.aparhrjonydncsi lalestj ap.'pyttam rcodulry pyspartymY classilici LinnJhlml
J. Vtrchyev nnd^.Xrivilifca^ff
Computus &5ешл1у J2B,f2fl22J JOtfSG
(3.1)
ie«fs for FP'educe jrs si letted Hit* built рт model
(3.2)
Ш Irub) d«u 1 In ml» elm Din -и 1.
i [cendlbmi 1] AND |cnndr1ion UNO (rc-rdiiHSf l| .^тПгЬг
ft 1СШЦ11ПЛЛ L| йЫПЩй.ИМЧА l|
each best le*f ¡s »rule ccir',ir.iinE nrf rnnriihnrit tiy "AND"
(3.3)
Jiltl-ffilld 5/itiM. aypprt riilpi
lis. 5. Fjcrucrinj mlrtfram dpi lic.n 1ree jrc chjMsing rhp t«t of 1h?m bisfd cn rrrmti.
Rules:
if
and
ond
cond_l: e*pr_l(features) or e*pr_2(features) cond_2: expr_3 (features) со n d_3: • xp r_4 (f e a t u res)
then
denv/allow
] ][! & Thjf nilr's jrnjclurr.
After selecting the rules, wc implemented a pruning technique. We tested every condition on how ML and business mctrics change if it is removed From the rule. We -chose to iteratively remove conditions one by one as doing It through subsets is a NP-complctc problem.
Every rule is further comerted to the CNF. Thus, wc obtain the nilci where conditions arc joined on "AND" operator, whereas Depressions within з condition arc joined on "OK" operator (Fig. 6).
Both the rules predicting class 0 and rules predicting class 1 can be used to reduce false positives, though in different ways (Fig. 7). Every rule has an impact on transactions that are suspected by the system ("genuine" and "fraud, identified" areas in Fig. 1\ as well as on transactions not suspected by the system ("fraud, missed" and "turnover" areas in Fig. 7). The main task is:
■ For rules predicting class 0 - maximum сютегще of false positives area (B, or "genuine") given the minimum coverage of fraud identified area (D);
■ For rules predicting class I - maximum coverage of fraud missed area (C) given the minimum coverage of turnover area (AJ,
hi accordancc with the different w^s in which two types of rules influence fcey business mctricSy they are evaluated on the different sets of metrics. A rule for Fraud detection reduces false positives if it replaces the less efficient existing mlcs with no increase in FP (Table 2)l A rule for FP detection reduces false positives if it worts over the existing rules, telling the system to allow transaction if the transaction seems to he the FP gen-crated by the existing rules, with no increase in fraud missed ГГаЫе 3).
Due to the sperihcitics in fraud detection that were discussed in Section 4.1, especially a time lag of Fraud resolutions appearance, this type of metrics is suitable to evaluate mlcs on the hcld-out historical data. But if one needs to assess the rules performance online, she is doomed to wait For two weeks till major part oF resolutions is here in order to calculate these metrics. Wc suggest a quick way to verify that the rule behaves like it is anticipated when training the model: we extrapolate the results obtained on the sample using the special formulas that one should construct based on the sampling strategy chosen. Wc rely on the anticipated number of rule triggers per minute. In our case, the Formula to
J. V&dBycv and d. Krivd jJtiyi
Computes &.Tenirüy J20 '2022J JDWSC
Tabic 4
Models metric*.
iliLibit precisiwi recall rpr
arraraq ri meastur PS.ALH: ROC AUC
Decision Tree train DJ4G3 OS3S7 D.0091 GHt 0J393 0.7032 0.K74
Decision Tree test DJiSI 0J73J D.SIM tuns DHC 15 0.20S5 0.4753
Sjrdcm rnrest rrain 0J71I DUG 15« 0.0057 mis oucsoa 0.7119 D9G!
3ardnm rnrest test DLIJI2 05191 00205 D.402G Dli2i2 0.1GD7 Q.U24
Cradienl lloostm^ train 07732 &3GB5 D.00B2 nan U1B] 0.H357 0.5919'
Gradient Doosting test □.1615 OS DM D.E252 &1759 0369] 0.2421 0.547D
\3icr. !■:■ 1he table: all [he models were Ijlien wrlh itiaj: depth 10. Thjougln the sHuatinn in lerm of dirrewno? in metres on train and test seems ta be overfilling it ix xtLII the bes configuration of parameters.
TaHE 5
lioji t: riy cccifideue irtrrvals for rales.
Mrtric Mean Canfidenre ^iterval
hdii 49.159. (4SJ0&, B0221]
Recall DJOOS (OiSS. 0.G7J;
FLmeasure QAX toss*, a.ssa.;
Wc ct peri merited with a framework during half of the year. Each two weeks, we used to re-fit model using the latest data, generate new rules and put a few of the best tn the anti-fraud system. The average precision of such a rnle in anline was approximately LQS while keeping recall at minimum level that provides the tangible business effect. The metrics depreciation in online mode is caused ty inevitable dilTerences of online compared to offline mode of anti-fraud system (e.g., the hierarchy of rules and ather interdependences between them in online).
It is worth mentioning that the approach requires much customization based on the application area, business goals and data specificities, and regular updating since fraud patterns change dynamically. Hence, we continue our experiments inducing rules on a regular basis and trying to improve an metrics and model robustness. Also, we are still working an reduction of the train, test and online metrics differences.
6. Conclusions jnd future research
This study confirmed the possibility and efficiency of utilizing the Al methods to recombine the conditions which fann the nt pert rules used in the fraud monitoring ^sterns. Wc proposed a Decision Tree approach and tested it on payments data derived from a large bank.
Our main goal was to reduce the number of false positives keeping the amount of missed fraud fixed at the current level. For this purpose, wc suggested generating rules automatically, whether 1) those aimed at class 0 correcting the verdict of the fraud-dctcction system, and hence working over the cuccent rules, or 2) those aimed at catching fraud with efficiency highec than the current niles efficiency and thus correcting (replacing) the original ct-pcrt rules. Currently we gat rules with average precisian of ID'S in online.
Our approach requires further improvement through customization and sample cepcesentatlvity check. However, as follows from the ongoing results, the approach has the potential to improve customer joum^1 in banking and c-cammerce. Furthermore, the effect can be achieved without adding new features or scores obtained by black-box classifiers to the anti-fraud ^stem. Therefore, it guarantees no loss of interprctability of the verdict on the transaction. All in all, we consider It to be the anrifraud analyst's support tool ta indjucc new rules and reveal the most effective features and conditions for separating fraud.
Wc intend to develop our approach in several ways:
- Instance engineering: a more detailed stutfy on sample repne-sentativeness and metrics extrapolation
- Cast-sensitivity: addressing the problem of unequal costs of classification mistakes directly through algorithm modification (e.g., adjusting impurity formula or using sample weights based on sum of transaction in models)
- Various Ml methods for rule induction: fuzzy logic, non-linear nan-tree-based algorithms
- Feature engineering: generating new features, including com-plnt conditions
- Unsupervised learning: clustering clients before classifying their transactions
- Further automation af the rules induction process from the generation till the performance evaluation.
Declaration or Competing Inoeresi
The authors declare that tiny hwc nn known competing financial Interests or personal relationships that could have appeared to influence the work reported in this paper.
CltoliT authorship lonirifiunion statement
Ivan Vorotyev: Conceptualization, Methodology. Writirg - review & editing, Visualization, Validation, Supervision. Anna Krivil-Skjyj: Data curatian, Writing - original draft. Investigation, Software.
References
[fames, D„ llnppner, S, Ortmer, I. Verdonck, I, ICQ La. robSDSE: a ret um aporoacn fee dealing with lie balanced data in fraud detection. Stat. Merlic-ds Appl. dui:IO. 100? is 102 GO-01] -0DS71-7. Gaesens, Ii. Hdppnec S, VenJonrk. T, 20211]. Gata rn^nerric£ Tcc fraud detection.
Oecis. Suppcrt SysL doi: 10. l01Cijidss.2a21.11 j4SJ, L1UHL. Gaesens, Ii., Vlasaelaer, V.Van, Verbeje, W, 2DL5. fraud Arabics Usir^ Descrip-iwe. Predictive, and Social Network Tertinjquex: A Cuide ta Gata Science Fdt fraud Detection. Jahn Why t Sau [nr. [tabol™. NJ. USA doi: It! 1C32, 978I[]9MG5JI.
Carrascn, i.vM.. Sicilii-Urbän, MJ1... 2Q2D. Fvatuatinn of deen neural nrtwnrfcs Tor reduction d credit can! fraud alerts. IEEE Access S, HCl U№1 dr.i:]01l(Bi
ACCESSJJ020.302GI22.
Ctrqueira, □., Nelferi, M.. & Dejbradia, U. [30-1J Tmjrds Jei.^n principles Tor user-rentrir explainable Al infr*id derenion doi: J0L1CHJT i97S-3-03D-77772-U2 llas;npcur, [lesan ft (ituvamv adeh. 3amak [L havi, jteivaiL ; Jli'lj; Inprrvmg rule
based classifij^tinn using harmony search. 1072£7iprer.p:eprin££ijG14vl. Höppner. S., Eaesens, 2, Verbeie, W„ Venlonck, T„ i021. Instance-dependent ccst-seiuitivr l^iruinr Tor decerlmg tranter Tnud. Rur. J. Opec. Hei. dei: ID. 1DI C.j. ejr7.2DlLjil5.D28*. 5037722L7210W5&1 llsir. Y-., I),II. T.-„ Ti Y-„ Huang, M , ¿021. InL^-rprrtiblr iLtliimm. tranter fraud detection with. expvl [earure ounsliuctioni. b: P.ipe: presented al the LTIJIt Wocxsbop Pmcndii^^ p. 3K2. Kamiiiski. Ja^ijtc^^ H. Sj.irfel, P_ 20'IBL A Tranewrrk Tor sensila/ily jiul-ysis of decision tiees CenL Eur J. Oper. 5fcs. 25, 115-159. doi: IOLTM7i sLDIflO-017-009-0.
Leite, A, £ sc hH an direr, S. Miistii, T. Girein, 5., Hunt net. J., 1A2DI NEVA: visual anatyricsto^drntify haudulent networks. CnmpuL Crapliics Parum 39 ; t^ 344 359. dai: !0.11111r#. 1+3il. Orwubikn, C, 2030 Fraud matrix: a mnrphctagical and analysis-based classification and laxmrrry of fraud. Comput. Sr.:'jr. 50 101900. dai: I0.I01GT.rase202il. 101900.
J. Vrrubyev and A. tfrn-dikcya
Campm as & Security ¡20 (2022) J077Si
Pint, P., 5rivasL2tfa, P., 2021. Cost-sensitive model evaluat ion approach Гог financial fraud detection system. Cn: Paper presented aL the Proceedings of the 2nd tn-ternational Conference on Elect tonics and Sustainiblr Communication %sterns, [CESC Ш\ pp. lflQG-ieiL dai:La.llD9.ilCE5C514222D21.95:i274].
Sao, Y„ Ren, X., Duan, C, Mi, X, Cheng, J, Chen. Y, WeL X (2021). Knowledge-guided fraud detection using sent ¡-supervised graph neural network doi: I.Q.I.D07.r97B-i-Q3D-90eSS-1,29
SadgaLi, [mane. SaeL, NavaL Benabbou, Еаоила, 2021. Elurnan behavior soaring in credi.1 card fraud deLeclion. LAE5 InL J. Art if. IntelL ID. G9S. doi: 10.11S91JS|*. v 10i3-ppG9B- 70G, EJ-AI.
Schneider, ? Б ХЬаГа, F. [2022]. QiapLer 5 - Rule-based decision support ^sterns for eEfeaLth: supporting acton and stakeholders оГ health sysLems. Anomaly detection and complex evenL processing over to! data streams, B7-99r doc.: 101010.I197B-0-12-B23B1B-9DDQIS-S.
5«erinn, Mitheus. Rsng, Yaohao, 2021. Machine learning algorithms for fraud prediction in property insurance: empirical evidence using real-world microdata Mach. Learn. Appl. 5, 100074. dnii Ifl.iaiCljjnta^M21.1КЮ74.
Sum J, Li, Y, Chen, L, Let J, Ltu, X_ Zhang, 7_ Xll W. 2020. FDHelpen assist urt-supervised fraud detection ecperLs with mLeractive feature selection and evaluation. Cn: Piper presented aL the Conference on Human Factors in Computing Systems - Proceedings doi: 10.1145/33 L3&J1 J37GL4D.
Trisanto, D_ RismawaiL. N , №uiya, M., Kumiadi. Г., 2020. Effectveness undersarn-plirjg method and feature reduction in credit card fraud delect ion. bit. j. IntelL Еп£. Syst. 13. 173-ISL for. 1Я2226ф|кМ2ШМЗО.17.
Sahin, V- Bulkan. 5_ Duman, F._ 2012. A cost-sensit^e decision tree approach Tor fraud delect ion. 5c pert Syst. Appl. 40. 5016-5923 doi:10.LDLGij.eswa2DL3.DG. 021.
SJheng V_ Ling, Ch. 200G. Thresholding Tor making classifiers cos sensitive. In: Prc^ ceedangs of the Nations] Conference on ArtifiaaJ Intelligence, I.
Wedge R kanter J.ML Veeramachinem VL Rubio S.M. Perez S.I. 5ofcing the false posita'es problem, in fraud prediction using automated feature engineering. In: Эге-ГеУ II Сипу EL Daly E. MacNamee 0. Marascu A. Rnelli F. et aL editors. Mach. Learning Knowledge Discovery Databases, vol. 1.1051 Cham: Springer International Publishing; 2019. pi 372-B8L ht1ps:lTdoLarEilD.lOD7l'97B-3-OQa-10997-4,23.
Xie. Y, Liu, С, Cao, L, Li, Yan, C, Jiarjg, L, 2019. A feature «traction method for credit card Traud detection In: Paper preserved at the Proceedings - 2019 2nd International Conference on Intelligent Autonomous Systems, IColAS 2019,
pp. 7D-75. doi:]D.LI0gj lColASjaig.a0ai9.
Zeng. hi, 20LS. Smart business: Wha: Alibaba's success Serwals Afcraix the Future of Strate^. Harvard Eusmess Herview Press, Boston. Massachusetts.
ZoldL S., 2DL5. Using anti-fraud technology to improve the customer expert псе. CompuL Fraud 5ecur. 2015, IB 20. dot:]Q LDICfS1X1-3723(15J3006J-1.
Lungfiejun, Elu llu, Taihui, Ц- Naniun, X&e, 201.5. False positive elimination in intrusion detection based on clustering. In: 2015 12lh International Conference Ficzy Systems Knowledge Discovery. IEEE, Zhangjiajie. China, pp. 519-523. doi: ID. 1109/ F5KD. 20157 JB 1996. Ybussef. EL, Bouchra, F_ Brahim, 0_ 2020. Rules eictraction and deep learning Tor e-comrnerce fraud detection. In: Paper pcesented ¿t the Colloquium in Information Science and Technology, GST. 2020-June, pp. 145=150 doi:10.LI09i G5t 493 99.202 L9357D6G. Zhou, EL Sun, G., Pu. S.. Wang. L. Ни. J.. Cao. Y, 2D21. Internet riruncial Traud detection based on a distributed big data approach with Ncde2vec. IEEE Access 9, 4JJ7S 4JJBG. doi: 10.1ЩАССЕ55_И2иОИ4С7.
[van A. Vorah^'ei'. Graduated from Moscow State University d EnfirumenL Engineering and Computer Science with a. degree in Applied Mai hemitLcs (20DG). In 2019. got an МБА degree at Sberbank Corporate Unh-er-sity. In 2020. entered the postgraduate соиги пп Information Security at the Higher School оГ Economics. The topic of primary research: ^Machine Learning and artificial intelligence methods Tor combating fraud in credit, finance and banking* Lecturer on anti-fraud technologies and artificial intelligence at Moscow State University, ]Higher School of Economics^ Central Bank of the Russian Federation. Tutor aL Eligher School of Economics, with own course on LooLs for countering cyber fraud for students of Computer Security Department- Team Leider aL Anifraud Acquiring in Sberbank оГ Russia Twice (2020, 2021) received the prestigious VISA awards for "Lowest Cross Fraud - Acquirer"1 for tlv best indicaLors of anLi-fraud in acquiring.
Anna D. Krivitskaya- In 2D2L graduated Ггогп Lomonosov Moscow State University, BA's in economics. Current^ doing master's degree on data analysis at Lomonosov Moscow State University. Working as a data soe-ntist in cybersecurity department оГ Sberbank оГ Russia Published scientific art icle on welfare economics at Journal of Economic theory !|Russia]L
Funnel reading
Carminati, hi., Caron, R., Maggi, l'_ FpiTar-L I., Zsnero, S., 201S. BankSeaie-r: a decision support system Гог online bankng fraud analysis and invesLigaLion. Comput. Se-сшг. 53, 175-1BG. doi: 10. lOlb j.case.>DL5.IMJ0D2.
Kumar. P. MLshra SP. Analysis оГ credit card fraud detection using Tusion classifiers. In: Behera 115. №yak J. Naik П. Abraham A. editors*. Comput. Intelligence Data Mining, voL 7 LI. Singapore: Springer Singapore; 2019. p. Ш-22. hnpr.poioi#](11007,1973-981- L0-B055-5_I L
Приложение Б. Статья «Fraud risk assessment in car insurance using claims
graph features in machine learning»
Воробьев И. А. Fraud risk assessment in car insurance using claims graph features in machine learning // Expert Systems with Applications. 2024. Vol. 251 doi http://doi.Org/10.1016/i.eswa.2024.124109
Expat ivstrms «Ut AppliraliDiu 251 (203+) 1Ï41 №
ELSEVIER
Contents lists available at .SctenceDkect
Expert Systems With Applications
journal homepage: wvv^ el sevier.com.locat&'eswa
s — BL
Syslnrro Fn
AfvfGrtcm rl-
If, 1
Fraud risk assessment in car insurance using claims graph features in machine learning
Ivan Vorobyev
USE L'ntytrxky, iliLinan Fttkistoa
ARTICLE JNFÛ
ABSTRACT
ATrfWiirdi' FrJUil diMt-rilOCi Insurance cLums ?!■].>: h lti I- iL-.iTJiln;; Graph tenures, Rub a
The article proposes ;i procBf for claims assessment in car Insurance, which makes il possible In calculate ihr fraud rat« on the annual set of daims using; .1 reduced set af attributed and graph verirr properfies. This approach amproves ihr security oT insurance companies' xotii against fraudulent attacks. A method for constructing 3 claims graph ;md extracting additional features from il for «¡valuation is described. Il is .shown that in order Id build a graph. It is not necessary to have data on the connection of ¡lie claim partiripants. Two tests were carried out on a ;re;i] opensmirce datasets with labelling of fraudulent cases. Th-e results of the Erst one show ihe increase in daraificniion metrics when using attributes obtained from the graph. The ¿ippLicaliun of ihe proposed approach resulted in doubling the area under ¡he Precision Recall curve. The eicperlmental results demonstrated high quality metrics for fraud detection, with a Recall rate of BÜ.33% and a Specificity rase of 91.05%. The second tew confirmed the possibility of determining, the Insurance fraud level based nn decision rule, which includes the condition ui claims being connected 1o each other. The rule is able to detect claim groups with a high ennemi rat i on of fraud, In which every second participant ic a fraudster.
1. Dniroduciioii
J.I. Background
Fraud Lit che car insurance Industry is an urgent problem. Lnsurance companies (insurers) offer customers ( Insurants) a wide range of car Insurance products: pro perry damage, theft, Motor Third Parry Liability ( MTPL), etc. Such sets of risks are often sold In Insurance offers when buying a new car. when renewing insurance conn ans, or when insuring new custnmers with a second-hand car. The last category nf clients is the riskiest In terms of insurance fraud., as insurers have little historical information about the client, and second-hand cars, having a lower cost, become an accessible tool Ln the hands of a fra udster.
Both clients who have a one-tune opportunity to illegally enrich themselves, and groups oi professional Insurance fraudsters can deceive rhe Insurance company, ln the first case, Insurance companies, as a rule, do not refuse to compensate for damages, since rhe Identification and investigation of such claims may be more costly than the resulting benefit.
Professional insurance fraudsters operating lin a group can have a significant Impact on the Loss ratio of a car insurance portfolio ( the ratio of Josses Incurred and settlement costs to rhe amount of premium
earned). This is primarily due to the fact that a group of intruders acts over a long time period and maximizes ihe insurance benefit in their favour In various ways. Such a group may involve various participants: clients. Insurance agents, path master, car services, car lawyers,, ere. They understand the processes of Insurance and clauns handling and therefore are able to stay out of sight oi the Insurer's securi ty team for a long time.
Identification of the fraudulent group and related claims wiU reduce rhe fraudulent component In the loss ratio of rhe car insurance portfolio. Traditionally, the insurer has two ways to deal with fraudsters: not to conclude an agreement with them or to refuse to pay the claim. To assess rhe success of the fight, insurers can use not only the level of direct refusals of the Insurer to conclude a contract or compensate for an insured event, but also introduce an additional IndLcator that will show the purity of rhe losses structure in the insurer s portfolio for a certain period.
The development of data storage and processing technologies has allowed Insurance companies to keep records of claims, policyholder data and other operational information In internal databases, it became possible not only to accumulate this Informal ion, bur also to use big data technology and artificial Intelligence I.AT) to make decisions automatically in various processes, including fraud detection (Bao et aL. 2D22).
E moil address: vocabvev ivang^yandnc.ru. hitps: ■■ - dcii .org/10.1016/j.eswa.2>024. 12'1109
Received 29 December 2022; Received in revised form 1 February 2024; Accepted 22 April 2u2j1
Available online 26 April 2024
0957r4174/C 2024 Elsevier ltd. AIL rights reserved.
EjpVT Syirm-ii wir.b ..TfpllLLUlLHli 25! CJUJJ; 12*109
L Von&ytv
Traditionally, llif process of Idendfylng fraudulent clAlms cnnsists of analyzing [hi history oi participant and comparing [he findings wlch rhe Insured fi™ under conslderAden. As a result. The eipen makes a decision «i The claim about che presence üf signs of Insurance iraud.The accumulation of such verdicis Is also valuable Informarían and in rhe iumre can be considered as markup for applying supervised machine learning mechüds.
To help experts. Insurance companies are developing Interactive invesdgailve [ools. In rtils process, molssuch jis social network analysis, assessment of claim participant obtained using the machine learning meihod, ere. are used jis "hints". However, niosi otren [he final decision Is made by aji expert who may have his own biases regarding ihe effectiveness oí Jirilñclal mielhgence meihods (P. S-, 2023). Addltion-ally, expert distrust can be caused by the ladi of transparency ofmachine learning methods. The low InterpretAbllity of the result oí llu- model does noi always ensure the AvalLahlliiy of full-fledged evident laty na-terlAl on [he signs of fraud In [he clAlm.
According tn Jin Industry report,1 die majority of Insurance indusiiy special ls[s. In spite nf [he development and aval lability oi ML methods, silll rely on ihe experience oí personnel when detecting iraud. And ihe lejisi popular [ools Include social network analysis. Anomaly detection and oiher digital Approaches. In [he sume report, this paradoxical situ-,h[juji Is explained by the presence of problems In such [ools: too many false positives, poor quality of Internal dam, limited 11' resources.
Therefore, [he developed approaches based on artlñclAl Intelligence meihods, for [heir adoption In [he flghí agamsí Insurance fraud, should have ihe following features. t'irsi, [he results of [he application nf .ML meihods musi be Interprerable and explain [he decision made on ihe claim. Secondly, [he creaied [ools should be highly effecrive and give as few false poslrlves as possible. Thirdly, It Is necessaty io carry nut modeling on low-dimensional dam In order to reduce the cosí of rhe company's TT resources and minimise [he use of sources that are unstable In [erms u[ i.| n,i]j[y. Hit Article Is devnied to the creation of an Albased luul for Assessing [he risk of fraud In aum insurance, taking Into account [he abote ilml[a[lnns. As staled Ln (Hllal ei al., 2V.221, rhe existing ll[era[ure related tn InsurAnce fraud detection using ML Is nor exienslve compared io other Areas, tn Addition, the existing studies Are mainly focused on [he problem of improving [he quality of classification In ihe comexi oi class imbalance.
Mralty, as highllgjued In {Kumaras^amy eT al., 2022), the creation of An effective and snughi-after insurance fraud prevention tool will reduce the transfer of [he cost oí crime In the form nf Increased premiums lo law-Abiding cusmmers.
ülven [he aforementioned business enntext and prnblems associated with ihe AppllcAtlon nf machine learning meihods In the íask oí fraud detection In aum Insurance, the fallowing main research objective has been formulated In ihe following subsection.
13. The purpose of rfe paptr
The purpose of this study Is to And effective controls for management measures to mitigan the risk of auto Insurance fraud. Tin1 suggested meihod focuses on Inducing a business rule to Identify groups of claims with a high concenrratlnn nf fraud.
13. ¡ he íIu'.tVjj- ur:il ríiüi.'i iíwiínhiíjwwi of ihe pressed upjuuudr
This study focuses on leveraging machine leArnlng tools and graph theory m Assess the cat Insurance ponfnllo and specifically identity rhe presence nf professional Insurance fraud. Despite the significant efforts of many researchers, [he problem of detecttng fraud In auto insurflnce remains relevant. 1he majority oí research Is focused on [he application of machine learning meihods, with classification models becoming the
1 https://www.frim.cnn/iiu^lir',LiuuriiDce-fnijd report-2022/.
primary tool In such [asks, extracilng patterns from data accumulated by Insurance companies. The novelty of this research Is as follows:
* Improving classification Is achieved nor by Applying more complex ML meihods, but by cnnstrucdng new inierpremble feA mres from [he already Accumulated dam.
* The proposed approach for constructing feA mres ex[rac[ed from a graph Is applied ro da[ase[s where graph analytics were nm previously used due tn the absence nf direct connectlnns between events.
* KxperimeiuA] results demonstrate char the use nf constructed graph fea[ures allows tor reducing [he feature space when employing ML meihods without significant loss In classtfieadon quality.
J'rum the perspecdve of [he contribution of this research [o the literature on ihe detection of fraud In auto Insurance, [he following can be highlighted:
* It is proposed a process for evaluating ihe loss portfolio of jui insurance compajiy, which allows monimrlng ihe fraudulent component using a minimal set nf claim anrlbuies and pArilclpants of the Insurance evem.
* The stability of using graph, features In ihe tasks of fraud detection in auto insurance has been demons* rated on datasets obtained under dliierent conditions but wi th a similar set of Attributes.
* The proposed meihod allows preserving the InterpreiAblilty nf classification results, which Is necessary In the application Held {Pant A firivastava, 2\>2L) for further work with claims and gathering evidence for their rejection.
* The design of two independent experiments was proposed, and their valldadon was conducted using reAl dAinsets nf Insurance claims labeled for fraud, These experiment demonstrated an Improvement In [he quality of fraud classiflcatinn And [he ability to creAte hisjness rules for fraud risk control.
1.4. «fails
Despite the potential for improvement In the proposed approach (such as creating new graph features, addressing class unbalance, ud-liiing more complex graph meihods. etc.), the experimental pun nf this study already demonstrates the AppllcAblllty of the method on Independent open datAsets of fraud m cat Insurance. As a result, high VAlues of recall And precision In classification have been Achieved, and the ability ro obtain an Assessment of the fraudulent component In the clAlms portfolio over multiple time periods has been demonstrated.
This Article Is iunher otganlied as follows: Section 2 Is an overview of related studies, Section '.i contains Information About the scope of the study and the meihods used. Sectinn 4 describes the Approach Implemented, Section 5 presents the slmulAtion results, Section 6 Is a conclusinn.
2. Related work
Currently, fraud detection Is an utgent problem due to the rapid growth of ihe digital sphere, L-commerce, bank payments, reviews on ecommerce, bans Are all subject ro attack [dien eT al., 2032; Nodrlgues et ah, 3022; ¿hang et al., 202:4). Tins work Is related tn rhe study nf the possibility of evaluadng a claim In car Insurance ior fraud. The review (Al-Hashedi £ Magalingam, 202 i) singles out car Insurance fraud as a gmup of financial frauds that bring signlficanr losses to rhe insurance Industty and society as a whole. The difficulty of detecting fraud in this group, bnth by eipen apprnaches and by using statistical methods. Including machine learning, Is emphasized In many works, for example. iniKiAJiei al.,2j0lb).The reasons areihelackoilabelJeddatadueiothe low proportion nf fraud relative to the legidmate class noted in general for fraud detection tasks In (tiaesens et al., 2U21), as well as the complexity of the process of Ini-estigat Ing And confirming signs that
Eupvr Syilims »lrijyiflllll.......If 17<1 <2H2ii I14!tn
L KflJI)Oytv
indicate fraud. Therefore, some researchers focus on Inrerpretable methods, when solving problems of detecting fraud in insurance. For example. In (Sen.pal L <t Gartgadiaran, iui'd). The authors also point out ill* problem of Labeling fraud dma., In which the decision Is often subjective.
An additional problem is the confidentiality of fraud data. As pointed out In (STibudhl i Panlgriln, 2(12U), there is probably only one dam sei for auto insurance claims studies ai the moment. ihls situation leads toa rather long-term process Lit achievement of high classification scores In rhe fraud detection field. A comparative metrics tahle obtained over different years ts presented in (SotifLane et id., 2022). The paper shows bow, over ihe course uf lu years, various researchers have improved ihe Specificity metric with a high »«all value (thedefinitionofmetrics b given 111 Section 4), seeking to affect Legitimate claims less while maxl-nijijiift fraud detection.
Various approaches are used to Improve [he metrics. For example. (Yaiikol-Schaick, 3022) lis» a type of scoring that ran vaty depending on [he validity of [he claim and uses natural language processing. In this approach, the datm Ls i otisidered for a certain [bote period, aiid Its assessment for fraud Is carried out ar [he key stages of claim settlement, which makes It possible to budd an evolving scone. Along with this, rhe authors use the textual information accumulated during the settlement of the claim, transforming It Into features for training [he model. Eventually, this combination leads iu an increase In [he effectiveness of fraud detection.
In (Vandcrvont et al.. 2022), [he authors rook advantage of [he facr That fraudsters can distort peruana] data and, If such an anomaly ks detected, the Insurance company has a chance to achieve an additional effect In reducing the fraud leueL »¿searchers also seel m reduce rhe feature space dimension and Increase [he results interpretabl Iily. for example, f Aslam e[ a]., 2;J22) Showed that "fa utt~'r "base policy", and -age of the policyholder" are the most significant features In Idend lying fraudulent claims when applying machine learning methods. In (if an ei al.. 2U2U),the authors Improve car Insurance fraud detection rates by applying generic algorithms. Using a neural network as a classifier, rhe authors optimize Irs weights using an adaptive genetic algorithm, which helped overcome several problems, such as falling Into a local minimum, slow convergence, and dependence on [he sample.
f S.i I:: i- JL A[if, 2022} explored [he prohlem of class Imbalance In the car Insurance fraud detection. To solve it, when detecting fraud In aura Insurance, the well-known methods .SMtJTIL flOKL: were used, which showed the same performance. Also, the quality metrics obtained by the liandom Forest classifier1 outperformed [he Logistic Hegression [rained on [he auto Insurance claims da[ase[. A similar superiority of liandom Forest classification metrics Is demonstrated In (Itn «[ a]., 201!), which compares dlfferenr machine learning methods as applied to a fraud detection pioblem.
In works neLaied to fraud detection In other Industries, ihe Auihors also address the problem of Inierpreiability and class imbalance. Une of thegoalslnfUI.Huiryjiskl et al., 3||Sl)l| to build an ML [ocJ [ha[ not only categorizes fraudulent loan applications, but also enables ejipens to gain insight Into car loan fraud. Ihe proposed approach induces human-understandable rules and helps to analyse fraud. The low class imbalance In (Khan et al.. and the need to achieve the effect in terms of computationally and nme-wlse prompted the authors to use an opd-mlzatlon algorithm In combination with ihe SMO'l't class balancing technique.
A loot at fraud detection from the point of view of anomaly detection is discussed In detad In the review (H.lal et al., ¿622^. ihe authors show that, due to the problems that arise when using supervised learning methods, fraud detection researchers need to develop unsupervised methods, based on the anomalous nature of fraud m the financial system.
The review emphasizes that the set of methods for frawl detection in the Insurance industty is more limited than for bank cards. The approach proposed In this paper Is aimed to narrow this gap.
The large se* of features In the problem of detecttng fraud In the banking Industry prompted the authors [b anal JL Abbaslmehj. H>2:\) ro apply a state-of-the-art dimensionality reduction method to simultaneously preserve transaction knowledge and solve the high dimensionality problem. The authors focused on increasing the efficiency of classlflcatlon and achieved results, however, thev left the question of Interpretation of the decisions made about the operation outside the scope of the study.
Expert systems for detecting insurance fraud, which use graph theory, have also been developed. In this article, graphs areusedlnastmllar way and rhe closest works are {Mibell et al.. 2LH ll, (bodaghi JL Tel-nmourpour, liJlH), which describe anomalies that occur In a claim as pan ol a graph and are used in Further Investigation ol fraudster group. Approaches from graph theory are also used In other applied problems of fraud detection - In (Khi ei al., 2tn4]i, manipulations In the stock market are studied, while considering the parameters of the clique inro which the trading network Ls going. In [ Nguyen ei al., 2022) graph-based methods are applied ro Improve the explanatory power of Identified fraudulent cases.
In this study, feature extracrlon is used from ihe vertices of the graph In which cla Ims are grouped. A similar technique Is us*d. for example, in (Palukuji ii. Martntte, 202]}, where the authors study a protein struc-rune using machine learning methods, construct Ing a protein Interaction graph, from which thev later obtain features for a data set. nttere are also studies (Hp et al., 2022) that use graph theoty and feature extraction ro detect telecom fraud. Ihe latge set of features In rhe problem of detecting fraud In the banking industry prompted the authors {Van Uelle et al., 2H22) to apply a state-of-the-art dimensionality reduction method ro simultaneously preserve transaction knowledge and solve rhe high dimensionality problem. The authors focused on increasing the efficiency of classification and achieved results, however, they left the question of Interpretation of the decisions made about the operation outside the scope of the study, tffidency was achieved by Introducing (■raphSAGli and hl-G-HL methods, which extract Information from network data In automated ways using Inductive graph representation learning techniques.
Finally, In ({iskarsdmiLr et al, the authors dek«d inro the
topic of utilizing feat ures extracted from a claim graph for the purpose of Insurance fraud detection. As the edges of such a graph, the authors consider common participants Involved Ln the process of insurance and claim setdement. 'Iltese participants Include policyholders, brokers, garages, and experts. Empirical evidence has demonstrated that the Introduction of these aforementioned features, when combined with the exlsring characteristics of the claim and policyholder, yields positive results In the kdenttflcadon and detecdon of fraudulent acdvlties.
This study applies a similar concept, which has been extended ro Insurance datasets wlih no apparent claims relationshl ps. №ie ro the increase 111 efficiency. It .Is proposed to reduce the dimensionality of the feature space and utilize machine learning models that are more Interpretable.
The reviewed studies can be presented In the form of a table, spe-clilcally table b (Appendix A.), which summarizes past research on modern approaches for detecting fraud in auto Insurance and related subjecr areas.
Application area and methods
This section describes an area where fraud detection ls expected ro Improve. "Ihe main problems In the processes of reducing the risk of fraud are described and an approach for solving Is proposed.
1 Iinp\: ■■.■■rsrilii I k^hrn.ijrrj.-Kl.'hbl-' :ii(i:liil!'\ £!'iLrr;iLrd sjLk^rn.i'nM'niliL-.h.i
ndiHiiFDrrt'tChi^iiu'r.ii ImL
1 vojcdyfM
3.1. Ifiisipiiisi finHiH iifSifTjfihrJf
The insurance сnmpany process In which [he fraud risk on [he рлл of insurant is reduced Is considered, the firs* harrier Cor fraudsiers Is to check [he client hefnre signing an Insurance contract. In adduinn to acruarlal calculating, lhe insurer can refer ю internal black nr white lists, external snurces of clleni da[a {for example, credlr hlsmry], apply their own models for fraud rub assessment on the part of the insurer. Such procedures directly affeci the Insurer - [hey lengthen the process oil selling [he policy, worsening [he clleni experience: and false denials reduce Insurance premium collection. These facts force Insurance companies to simplify and automate checks a[ this stage. In this case, the company's management focuses on rhe fraud detection precision, which allows professional fraudsiers to peneirate [he insurance portfolio.
Then [he insurer analyses [he sta[ed claims and. If signs оf insurance fraud are detected, refuses tn pay. Л[ this stage, bnth expert assessment methods by rhe insurer's security service and techniques using .machine learning methnds are used. A positive effect on fraud detection metrics was seen when combining the work of an expert and systems that allnw evaluating claims using data analysis and social networks f.SribelJ ei aL. ЗОН). In the currenr process, the insurance company focuses on the recall metric of fraud detection In order ro stop a professional fraudster whn has already proven himself and reduce his impact nn the pnrtfollo loss ratio.
Fig. 1 shows the considered processes schematically.
This study proposes a methnd for evaluating fraud risk reduction processes In car insurance using machine learning and graph theory.
3.2. ¡'he Drjiiiii.iril ¿¡r;*g?.( offrw&i&i in insurance
(Sutx lj ет al.. 20П) highlights that professional fraudsters are the biggest problem for Insurance companies. They account for most of the revenue leakage In insurance.
It Is not always possible to assess a claim for the risk of professional fraud at the time of making a payment decision. In some cases, only after a few Incidents, an expert from the Insurer's security service will be able ro identify the signs of fraud and draw up analytical material to challenge claims.
Evaluation of the entire portfolio of reported losses over a certain
EipYTSyilMH №r,1 ..TfillLLUlLKli 25: 12024! 12*109
period makes It possible to auiomadcally calculate the Indicator of the presence of a fraudulent cluster In the insurer's portfolio and manage it. It Is proposed to exploit the Idea that fraudsiers do not have sufficient resources m Imitate the history of customer behaviour that does not pursue fraudulent goals. Similar to the framework [testa s. Vomobyev. 2U221), which Investigates fraud with Ijank cards and assumes that fraudsiers cannot provide dropper accounts with the dally hlstoty of a legitimate client {for example, purchases In pharmacies, pet stores,etc.}, let's consider the anomalous behaviour of an Insurance fraudster. In this case, the smaller the cusmmer's history of Insured events, the more attractive he is for the insurance cnmpany and to a lesser extent he can Introduce the risk nf fraud. On the other hand, an organized group of fraudsiers will most likely nor he able tn prnvlde every car accident with a new driver, Insurant, Insurance objecr, owner, etc. With an Increase In the number of participants and property Involved In fraud schemes, the costs may exceed the benefits of fraudulent activities.
3.3. Mariijne feomiFig ituihodi for fraud derecnnn
Fraud detection algorkhms can be divided Into eipett-based and statistical approaches. In the first case, fraud Is detected based on expert Judgments, sometimes formalized as decision rules. Ilils Includes considering the analysis of typical fraud patterns carried nut by experts manually during event analysis. In the second case, the classificadon of events Into fraudulent and legitimate ts done using stadstlcal methods, particularly machine learning models.
Traditionally, statistical algorithms are divided Into supervised learning {classificatlnn task), unsupervised learning (clustering taslO. and social network analysis. The first ones help to classify events as fraudulent or legitimate. J'lie advantage of the second nnes lies In the ability tn detect new fraud patterns that have not been encountered In historical data. Social network analysis allows for considering the In-tencnnnecdons between objects In the dataset. The three types of statistical algorithms focus on different aspects nf fraud and complement each other.
This research utilizes classification methods, as insurance companies often have labeled data on fraudulent claims. In the process of evaluating the fraudulent component, the following machine learning methods are pioposed to be incorporated: Random forest, Neural
1, Enter Into »n injurine» eontrjct
2. Claims s»ttl«rnent
(м1
blacklists external sources seating model
• security expert
assessment
* ML methods
• »xp»rtj+ ML + soci»l
network analysis
the precision of frsud detection ¡1 №|>ti|rN»d
Rutin - ■
> Amount of claims ji'itk-fjicnt coits y. Premium empuîif
the recall of fraud detection i; optimiied
Pig- I- A putfLblr diagram Df ;in:i fmud ргиспие* in rar insucanrr.
Eipvr Syirr.ra Wlr.b ..TfjiliLLUiLKii 23! fJiJUd; 11*109
L Voiztoytv
Network, and l>cisinn Tm for general Ing a simple decision rule.
The llandaul forest classifier has been chosen as lhe base classifier, as Ir has shown lhe best resulis in studies incused nn detecting fraud In Insurance (Hrl d al, 2(114). The method Involves using an ensemble of decision trees, each nf which may not provide high-quality classlfka-rlon, hui by using a large number of them, a bener result can be achieved. The choice In Its favor In this study is due to Its low sensitivity 10 rhe size of lhe feature space, as well as its h^gh classification quality when trained on data with categorical and numerical features.
Rvr the construction of a claim assessment for fraud based nn a reduced ser of features, a multilayer petceptron Is chosen m this study. This method also shows high results In research In the Held of fraud detection. When using It, The Inclusion of hidden layers allows for rhe approximatIon of a nonlinear function fnr classification. In this stud;, a claim assessment Is built using it based on several simple numerical characteristics, and the result ol this assessment Is subsequently used In conjunction with features extracted from the graph.
Proposed qjjproduh
The proposed approach Is based on graph theory. A graph consists of a collection of points, called vertices, and connections between these points, called edges, tn the context of this study, we define Vc = (k'j . v^. --.»„} as a set of vertices and 1Iq C (»;}e Vc/\i ^ J} as a set of edges. With these definitions, an undirected graph can be succinctly represented as G = (
Tn Identify abnormal behaviour nf clients m an Insurance company. Ir Is proposed to compile an undirected graph, the claims for compensation for a warranty case are the vertices, lhe objects of accidents (drivers, policyholders, etc.) as well as the Incident Itself are the e^ges. The graph Is built for a certain period, fnr example, a calendar year, l-'lg. 2 shows several situational cases considered:
1. Clients without insured events: zero loss rado (LH): zero fraud probability (Ftp >;
2. Clients with one Insured event; low loss ratio: low fraud probability:
3. Customers with multiple accidents;, high loss ratio; the average fraud probability;
4. Clients nrgamilng into a cycle: A had an accident with client H, who had an accident with client C, bur A was already in an accident with C high loss ratio; high fraud probability.
The analysis and identification of such relationships using graph theory Is used In the expert systems of the Insurance companies security service {KubelJ ei al., 201IX Including for compiling the evidence base. A non-fraudulent Insurance event must be random as In case 2, and the high accident frequency of the client must be explained by the driver's Inexperience - then case 3 is possible, if there are numerous abnormal cases 4 In the Insurance company's portfolio of claims, then there is a high probability that an organized fraudulent group Is present.
The estimates nf LH and frP presented In I'lg. 2 based on some domain knowledge. This assumption allows considering Insurance case fromthepointofviewofthe relationship and highlighting the most risky ones.
Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.