Энтропийные тематические модели и методы их агрегирования тема диссертации и автореферата по ВАК РФ 00.00.00, доктор наук Кольцов Сергей Николаевич

  • Кольцов Сергей Николаевич
  • доктор наукдоктор наук
  • 2023, ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики»
  • Специальность ВАК РФ00.00.00
  • Количество страниц 191
Кольцов Сергей Николаевич. Энтропийные тематические модели и методы их агрегирования: дис. доктор наук: 00.00.00 - Другие cпециальности. ФГАОУ ВО «Национальный исследовательский университет «Высшая школа экономики». 2023. 191 с.

Оглавление диссертации доктор наук Кольцов Сергей Николаевич

Введение

1. Аналитический обзор научной литературы

1.1. Подходы к проблеме выбора числа кластеров

1.2. Обзор типов моделей в тематическом моделировании

1.3. Применение энтропийных принципов в области тематического моделирования

1.4. Стабильность тематических моделей

2.Энтропийная тематическая модель на основе параметризованных энтропий Реньи и Шарма-Миттала

2.1. Энтропийная тематическая модель на основе энтропии Реньи

2.2 Связь свободной энергии с энтропией Реньи в тематических моделях

2.3. Энтропийная модель на основе энтропии Шарма-Миттала

2.4. Энтропийная тематическая модель на основе энтропии Шарма-Миттала

2.5. Иерархическая энтропийная тематическая модель

2.6. Экспериментальная проверка применения энтропии Реньи и Тсаллиса в тематических моделях

2.7. Численные эксперименты по семантической стабильности в тематических моделях

2.8. Экспериментальная проверка энтропии Шарма-Миттала и энтропии Реньи как меры качества для оценки числа тем и семантической связности тематических моделей

2.8.1. Эксперименты по применению энтропии Реньи

2.8.2. Эксперименты по применению меры качества на основе энтропии Шарма-Миттала к тематическим моделям

2.9. Эксперименты по применению энтропии Реньи к анализу иерархических тематических моделей

2.9.1. Модель HPAM

2.9.2. Модель HLDA

2.9.3. Модели hARTM

3. Фрактальная модель оценки результатов тематической модели

3.1. Эксперименты по определению фрактальной размерности в тематических моделях

4. Метод агрегации тематических моделей на основе процедуры ренормализации

4.1. Введение в теорию ренормализации

4.2. Общая постановка задачи агрегирования в виде процедуры ренормализации в тематическом моделировании

4.3. Процедура ренормализации для модели VLDA на основе процедуры E-M алгоритма

4.4. Процедура ренормализации для модели LDA на основе процедуры сэмплирования Гиббса

4.5. Процедура ренормализации для модели pLSA

4.6. Эксперименты ренормализации

4.6.1. Ренормализация модели LDA GB (датасет 'Lenta')

4.6.2. Ренормализация модели LDA GB (датасет '20Newsgroups')

4.6.3. Ренормализация модели LDA GB (франкоязычный датасет)

4.6.4. Ренормализация модели VLDA (датасет 'Lenta')

4.6.5. Ренормализация модели VLDA (датасет '20Newsgroups')

4.6.6. Ренормализация модели VLDA (франкоязычный датасет)

4.6.7. Ренормализация модели pLSA (датасет 'Lenta')

4.6.8. Ренормализация модели pLSA (датасет '20Newsgroups')

4.6.9. Ренормализация модели pLSA (франкоязычный датасет)

4.7. Сравнение методов агрегации по быстродействию моделей на основе процедуры ренормализации

5. Гранулированный вариант тематической модели

5.1. Регуляризация тематической модели при помощи задания локальной плотности распределения слов по темам

5.1.1. Прямоугольное ядро регуляризации (гранулированный метод сэмплирования, GLDA)

5.1.2. Ядро Епанечникова (Epanechnikov kermel, ELDA)

5.1.3. Треугольное ядро Епанечникова (triangle kernel, TLDA)

5.2. Исследование стабильности моделей ARTM, GLDA, pLSA, SLDA

5.2.1. Эксперименты по определению стабильности тематических моделей

Заключение

Финансирование

Список литературы

Приложение

Приложение A Paper «Analysis and tuning of hierarchical topic models based on Renyi

entropy approach»

Приложение В Paper «Analyzing the Influence of Hyper-parameters and Regularizers of Topic

Modeling in Terms of Renyi entropy»

Приложение C Paper «Renormalization Analysis of Topic Models»

Приложение D Paper «Estimating Topic Modeling Performance with Sharma-Mittal

Entropy»

Приложение E Paper «Application of Renyi and Tsallis entropies to topic modeling

optimization»

Приложение F Paper «A thermodynamic approach to selecting a number of clusters based on

topic modeling»

Приложение G Paper «Gibbs Sampler Optimization for Analysis of a Granulated Medium.. .172 Приложение H Paper «Topic modelling for qualitative studies»

Рекомендованный список диссертаций по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Введение диссертации (часть автореферата) на тему «Энтропийные тематические модели и методы их агрегирования»

Введение

Анализ больших текстовых данных стал одним из самых востребованных научных направлений в современном мире в связи с развитием электронных средств хранения и передачи таких данных. По своему объему эти данные становятся сравнимыми с физическими мезоскопическими системами, в связи с чем, для их анализа можно использовать модели машинного обучения, на основе математического формализма, заимствованного из статистической физики. Примером такой модели является тематическое моделирование (topic modeling), на основе процедуры сэмплирования, где используется формализм модели Потса для вычисления распределения слов по темам [15]. Задачей тематического моделирования является выделение набора распределений наблюдаемых величин - текстов или изображений и их элементов - по скрытым переменным, называемым темами.

На данный момент разработано множество тематических моделей [1], с различными способами определения скрытых распределений и разными мерами анализа качества, однако при достаточно широком применении этих моделей в различных областях, нерешенным остается ряд проблем, которые ограничивают использование ТМ.

Одной из основных является проблема определения количества компонент в смеси распределений, поскольку параметр, определяющий размер смеси в модели, должен задаваться явным образом в тематических моделях. Следует отметить, что в тематическом моделировании разработан подход, в котором авторы постулируют автоматический подбор числа тем, однако такие модели обладают множеством скрытых параметров, которые существенно влияют на результаты моделирования [2]; более того эта модель не способна корректно определить число тем в датасете.

Второй проблемой является нестабильность тематического моделирования, которая выражается в том, что результаты тематического моделирования не являются идентичными друг другу, при разных запусках модели на одном и том же датасете и с одними и теми же параметрами. Данная проблема, с одной стороны, связана с неоднозначностью матричных разложений (для тематических моделей на основе E-M алгоритма), а с другой стороны, с наличием множества локальных минимумов и максимумов в подинтегральной функции (для тематических моделей на основе процедуры сэмплирования Гиббса).

Третья нерешенная проблема, вытекающая из второй, связана с разработкой процедур регуляризации, которые могут использоваться как для улучшения стабильности ТМ, так и для других целей [1]. Под регуляризацией понимается добавление априорной информации в тематические модели в виде различных связей и ограничений, что приводит к уменьшению возможного числа решений. На данный момент, в литературе предложено большое число генеративных моделей с регуляризаторами, однако нет ясного критерия выбора комбинации регуляризаторов и подбора коэффициентов регуляризации.

Вышеуказанные проблемы естественным образом влияют на качество тематического моделирования. На данный момент, основными мерами определения качества тематических моделей являются энтропия Шеннона, дивергенция Кульбака-Лейблера, логарифм правдоподобия и перплексия (perplexity). Кроме того, известно, что распределения слов, по крайней мере, в европейских языках, имеют степенной характер, свойственный именно сложным статистическим системам. Также известно, что поведение сложных систем эффективнее исследовать с помощью методов, развиваемых в рамках математического формализма, заимствованного из теории сложных систем.

Цели и задачи исследования

Цель диссертации - разработка и исследование нового класса вычислительных тематических моделей - энтропийных тематических моделей, ориентированных на продвижение в решении проблем определения оптимальных гиперпараметров тематических моделей, включая определение наличия плоских или иерархических структур в датасетах, и разработке стабильных моделей кластеризации текстовых коллекций.

Полученные результаты:

1. Энтропийная тематическая модель на основе однопараметрической энтропии (энтропии Реньи и Тсаллиса). Данная модель реализована для следующих генеративных алгоритмов: 1) LDA (Gibbs Sampling algorithm), 2) pLSA (E-M algorithm), 3) VLDA (E-M algorithm), 4) GLDA (Gibbs Sampling algorithm).

2. Энтропийная тематическая модель на основе двухпараметрической энтропии Шарма-Миттала (Sharma-Mittal Entropy). Данная модель реализована для следующих генеративных моделей: 1) pLSA (E-M algorithm). 2) LDA (Gibbs Sampling algorithm). 3) ARTM с регуляризаторами разреживания матриц Ф и 0 (E-M algorithm).

3. Иерархическая энтропийная тематическая модель. Данная модель реализована для генеративных иерархических моделей: 1) hLDA. 2) hPAM. 3) hARTM. 4) алгоритм кластерного анализа HCA ('complete' method).

4. Фрактальная модель оценки работы генеративных тематических моделей. Данная модель реализована для следующих алгоритмов: 1) pLSA (E-M algorithm), 2) ARTM (E-M algorithm), 3) LDA Gibbs sampling algorithm.

5. Метод агрегации тематических моделей на основе процедуры ренормализации. Метод реализован для следующих алгоритмов: 1 ) VLDA (E-M algorithm). 2) LDA (Gibbs sampling algorithm). 3) pLSA (E-M algorithm).

6. Метод агрегации реализован для трех вариантов объединения тем: 1) Объединение на основе минимума энтропии Реньи. 2) Случайное объединения тем. 3) Объединение на основе минимума дивергенции Кульбака-Лейблера.

7. Гранулированная тематическая модель на основе процедуры сэмплирования Гиббса. Данная модель реализована для трех вариантов функции локального распределения тем: 1) GLDA. 2) ELDA. 3) TLDA.

Личный вклад автора включает в себя:

• общая математическая формулировка энтропийной модели на основе однопараметрической энтропии Реньи, опубликованная в двух статьях одним автором.

• организация и участие в проведении широкомасштабных компьютерных экспериментов по анализу применимости энтропийной модели для оценки качества работы тематических моделей разного типа.

• лидирующее участие в математической формулировке энтропийной модели на основе двухпараметрической энтропии Шарма-Миттала и проверке этой модели в серии компьютерных экспериментов.

• формулировка фрактальной модели оценки работы генеративных тематических моделей и проведение компьютерных экспериментов по тестированию данной модели.

• математическая формулировка метода агрегации тематических моделей с помощью процедуры ренормализации и проведение компьютерных экспериментов по проверке эффективности процедуры ренормализации.

• общая математическая формулировка гранулированного метода сэмплирования.

По теме данной диссертации опубликовано 8 статей в журналах уровня Q1-Q2 WoS: а также

11 статей, индексируемых в Scopus.

Научная новизна:

1. Впервые предложено использование однопараметрической энтропии Реньи и двухпараметрической энтропии Шарма-Миттала для целей оптимизации работы тематических моделей.

2. Впервые показано, что мера качества на основе параметризованных энтропий превосходит традиционные меры, такие как логарифм правдоподобия или перплексия, так они позволяют одновременно настраивать величины гиперпараметров тематических моделей и количество распределений в смеси.

3. Впервые предложена фрактальная модель оценки работы генеративных тематических моделей, демонстрирующая самоподобное поведение тематических моделей, что позволяет применять к ним процедуру ренормализации.

4. Впервые предложена процедура ренормализации тематических моделей и продемонстрирована ее эффективность для скоростного определения оптимального количества распределений в смеси.

5. Предложен гранулированный вариант тематической модели, которая превосходит по своей стабильности другие тематические модели.

Публикации повышенного уровня (Q1-Q2 по WOS и Scopus)

1. Koltcov, S. Analysis and tuning of hierarchical topic models based on Renyi entropy approach / Koltcov, S., Ignatenko, V., Terpilovskii, M., Rosso, P. // PeerJ Computer Science, Том 7, 2021. Open access: https://peerj.com/articles/cs-608/

2. Koltsov S., Ignatenko V., Boukhers Z., Staab S. Analyzing the Influence of Hyperparameters and Regularizers of Topic Modeling in Terms of Renyi entropy // Entropy. 2020. Vol. 22. No. 4.pp. 1-13. doi

3. Koltcov S, Ignatenko V. Renormalization Analysis of Topic Models // Entropy. 2020. Vol. 22. No. 5. pp. 1-23. doi

4. Koltsov S., Ignatenko V., Koltsova O. Estimating Topic Modeling Performance with Sharma-Mittal Entropy // Entropy. 2019. Vol. 21. No. 7. pp. 1-29. doi

5. Koltsov S. Application of Renyi and Tsallis entropies to topic modeling optimization // Physica A: Statistical Mechanics and its Applications. 2018. Vol. 512. pp. 1192-1204. doi

6. Koltcov, S.N. A thermodynamic approach to selecting a number of clusters based on topic modeling / Koltcov, S.N. // Technical Physics Letters. 2017. Т. 43. № 12. С. 90-95. doi

7. S. N. Koltsov, S. I. Nikolenko, and E. Yu. Koltsova Gibbs Sampler Optimization for Analysis of a Granulated Medium // Pis'ma v Zhurnal Tekhnicheskoi Fiziki, 2016, Vol. 42, No. 16, pp. 21-25.

8. Sergey Nikolenko, Sergei Koltcov, Olessia Koltsova. Topic modelling for qualitative studies // Journal of Information Science. 2017. Vol. 43. No. 1. pp. 88-102. doi

Публикации стандартного уровня по теме исследования (Scopus)

1. Koltsov S., Ignatenko V., Pashakhin S. How many clusters? An Entropic Approach to Hierarchical Cluster Analysis, in: Intelligent Computing: SAI2020: Volume 3 Vol. 1230. Book 3. Cham : Springer, 2020. doi pp. 560-569. doi

2. Koltsov S., Ignatenko V. Renormalization approach to the task of determining the number of topics in topic modeling, in: Intelligent Computing: SAI 2020: Volume 1 Vol. 1228. Part 1. Switzerland : Springer, 2020. pp. 234-247. doi

3. Ignatenko V., Sergei Koltcov, Staab S., Boukhers Z. Fractal approach for determining the optimal number of topics in the field of topic modeling // Journal of Physics: Conference Series. 2019. Vol. 1163. No. 1. pp. 1-6. doi

4. Koltsov S., Pashakhin S., Dokuka S. A Full-Cycle Methodology for News Topic Modeling and User Feedback Research, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics). 10th International Conference on Social Informatics, SocInfo 2018; St.Petersburg. Cham: Springer, 2018. pp. 308-321. doi

5. Mavrin A., Filchenkov A., Koltsov S. Four Keys to Topic Interpretability in Topic Modeling, in: Artificial Intelligence and Natural Language, 7 th International Conference, AINL 2018, St. Petersburg, Russia, October 17-19, 2018, Proceedings Issue 930. Switzerland : Springer, 2018. doi pp. 117-129. doi

6. Koltsov S., Nikolenko S. I., Koltsova O., Filippov V., Bodrunova S. Stable Topic Modeling with Local Density Regularization, in: Internet Science, Proc. of 3d conf INSCI 2016, Lecture Notes in Computer Science series Vol. 9934. Switzerland : Springer, 2016. doi pp. 176-188. doi

7. Koltsov S., Nikolenko S. I., Koltsova O., Bodrunova S. Stable topic modeling for web science: Granulated LDA, in: WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference. Elsevier, 2016. pp. 342-343. doi

8. Koltsov S., Koltsova O., Nikolenko S. I. Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated content, in: Proceedings of WebSci '14 ACM Web Science Conference, Bloomington, IN, USA — June 23 - 26, 2014. NY : ACM, 2014. pp. 161-165.

9. Nikolenko S. I., Koltsov S., Koltsova O. Measuring Topic Quality in Latent Dirichlet Allocation, in: Proceedings of the Philosophy, Mathematics, Linguistics: Aspects of Interaction 2014 Conference. St. Petersburg : The Euler International Mathematical Institute, 2014. pp. 149-157.

10. Koltsov S., Ignatenko V., Pashakhin S. Fast Tuning of Topic Models: An Application of Renyi Entropy and Renormalization Theory, in: Proceedings of the 5th International Electronic Conference on Entropy and Its Applications Vol. 46. Issue 1. MDPI AG, 2020. Ch. 5. pp. 1-8.

11. Bodrunova S., Koltsov S., Koltsova O., Nikolenko S. I., Shimorina A. Interval Semi-supervised LDA: Classifying Needles in a Haystack, in: Proceedings of the 12th Mexican International Conference on Artificial Intelligence (MICAI2013) Part I: Advances in Artificial Intelligence and Its Applications. Berlin : Springer, 2013. pp. 265-274.

1. Аналитический обзор научной литературы 1.1. Подходы к проблеме выбора числа кластеров

В обзоре рассматриваются исследования, наиболее интересные и полезные для данной диссертационной работы. Главная проблема в поиске оптимального числа кластеров в кластерном анализе и тематическом моделировании состоит в выборе функции, на основе которой осуществляется такой поиск. Обсуждение множества мер качества кластеризации, включая функции для выбора числа кластеров, производится в работах [10, 11]. Эти и другие работы показывают, что для данных целей в кластерном анализе часто используется минимальное внутрикластерное расстояние. Однако проблема этой и сходных мер качества заключается в том, что функция зависимости такой меры от числа кластеров является монотонно возрастающей (или убывающей). Соответственно, требуется разработка процедур трансформации для выделения особенностей из таких функций. В работе [3] был сформулирован алгоритм определения оптимального числа тем на основе 'rate distortion theory'. Модернизация данного подхода в рамках неэкстенсивной статистической физики для кластеризации изображений была реализована в работе [4].

В кластерном анализе существуют и другие подходы к решению данной проблемы [5, 6, 7]. В работе Тибширане [6] предложен подход, называемый 'gap statistic'. Его ключевой идеей является измерение разницы между базовым распределением (null reference distribution) и распределением, которое получается в ходе кластеризации из этого же распределения. Данная разница измеряется при различном числе кластеров, после чего строится соответствующая кривая. В рамках данного подхода авторы предполагают, что

оптимальное число кластеров соответствует ситуации, когда логарифм от среднего внутри-кластерного расстояния падает ниже, чем аналогичный логарифм, рассчитанный по null reference distribution. По сути дела, это аналог процедуры измерения зависимости энтропии от числа кластеров по отношению к исходной энтропии. В работе [8] предлагается процедура кластеризации, основанная на принципе поиска максимума энтропии (maximum entropy principle), при этом авторы также опираются на классический вариант энтропии (энтропия Шеннона). Однако уже в работе [9], метод кластеризации реализован с применением принципа максимизации энтропии Тсаллиса, за счет вариации параметра q.

Из всех существующих подходов в кластерном анализе наиболее интересным и содержательным является подход на основе минимизации свободной энергии [12]. Его основная идея заключается в следующем: каждый элемент статистической системы характеризуется вероятностями принадлежности к различным кластерам. Соответственно, для каждого элемента можно сформулировать понятие внутренней энергии (выраженной через вероятность принадлежности элемента кластеру) и рассчитать свободную энергию всей системы. Температура в такой системе становится свободным параметром, который варьируется с целью минимизации свободной энергии. Недостатком данной работы является тестирование модели только на кластерах с гауссовыми распределениями. Кроме того, как показывают вычисления авторов, функция свободной энергии выглядит монотонной функцией без явного минимума.

Данная диссертационная работа базируется на идеях работы [1 2], но в отличие от нее, во-первых, температура рассматривается как число кластеров, во-вторых, вместо свободной энергии рассматриваются параметризованные энтропии, которые обладают явным минимумом. Теоретические положения энтропийных тематических моделей изложены в главе 2.

1.2. Обзор типов моделей в тематическом моделировании

На данный момент в научной литературе по тематическому моделированию предложено более сорока различных тематических моделей, а число статей, в которых используется тематическое моделирование, превышает несколько сотен. В целом можно выделить три основных вида моделей: 1. Плоские модели (plain topic models) с различными видами регуляризации [13, 14, 15, 16]. 2. Иерархические тематические модели [17, 18, 19, 20]. 3. Тематические модели с элементами нейронных сетей, в которых используются либо эмбединги разного типа, либо слои нейронных сетей [21, 22]. Наиболее полные обзоры разновидностей моделей и мер качества представлены в работах [1, 23]. В целом, в научной литературе доминируют два основных алгоритма определения распределения слов по темам и тем по документам: 1. Алгоритм Expectation-Maximization (E-M алгоритм). В рамках него матрица слов в документах F представлена в виде произведения двух матриц F = Ф0, где Ф - матрица распределения слов по темам, 0 - матрица распределения тем в документах. 2. Алгоритм определения вероятности принадлежности слова к теме в виде многомерного интеграла. В данном алгоритме вычисление вероятности реализовано с помощью процедуры сэмплирования Гиббса. Несмотря на разный математический формализм указанных алгоритмов, оба они дают сходные результаты [24]. Поэтому ниже рассматриваемые проблемы, справедливые для разных алгоритмов.

Для тематического моделирования проблема поиска оптимального числа тем/кластеров также актуальна и даже более сложна для решения. Это обусловлено следующими причинами. Во-первых, этот поиск связан с лингвистическим определением понятия темы, что вызывает значительные трудности, так как сложно сформулировать лингвистический критерий разделения двух тем на множестве документов. Кроме того, тематические модели часто генерируют трудно интерпретируемые темы, которые сложно рассматривать как темы. Во-вторых, в тематическом моделировании, так же, как и в кластерном анализе, сложно сформулировать адекватную функциональную зависимость, которая, с одной стороны, характеризовала бы тематическую модель, а, с другой стороны,

являлось бы функцией от числа тем и гиперпараметров. Тем не менее, существует несколько работ, в которых авторы попытались решить проблему выбора числа тем именно в тематическом моделировании. Авторы работы [25], основываясь на идеях кластерного анализа, рассматривали тему как семантический кластер (набор слов), в рамках которого можно рассчитать внутрикластерное расстояние. В качестве функции для минимизации авторы использовали косинусную меру. Таким образом, с точки зрения авторов, число тем, при котором находится минимум средней величины косинусной меры, рассчитанной по всем темам, соответствует оптимуму. Другой подход к поиску оптимального числа тем предложен Аруном и соавторами в работе [26] как поиск минимума дивергенции Кульбака-Лейблера при изменении числа тем. Для этого авторы предлагают делать SVD разложение матриц Ф и 0, после чего рассчитывать дивергенцию Кульбака-Лейблера на основании двух векторов, содержащих сингулярные величины. В этом случае оптимальное число тем соответствует ситуации, когда обе матрицы описываются одинаковым числом сингулярных величин. Недостатками этих двух подходов являются следующее. Во-первых, не ясно каким образом минимумы выбранных функций соотносятся с энтропийным принципом, широко используемым в теории информации. Во-вторых, добавление еще одного этапа расчета, а именно SVD разложения и расчета дивергенции Кульбака-Лейблера, существенно ограничивает применение подхода Аруна к обработке больших данных. Арун и коллеги искали минимум дивергенции Кульбака-Лейблера на текстовых коллекциях, не превышающих 2500 текстов. В-третьих, в обоих выше изложенных подходах отсутствует учет влияния исходного распределения на результаты тематического моделирования, хотя известно, что такое влияние есть [27]. В-четвертых, в выше указанных подходах не принимается во внимание эффект семантической нестабильности, который присущ тематическим моделям [28].

Отдельно следует выделить тематическую модель на основе алгоритма аддитивной регуляризации (ARTM), предложенной в работе [16]. Данная модель основана на поиске максимума линейной комбинации логарифма правдоподобия и ряда регуляризаторов. Уровень влияния регуляризаторов на тематическую модель определяется величиной коэффициентов. Несмотря на широкое применение данной модели в русскоязычной научной литературе, она обладает одним существенным недостатком: в теории аддитивной регуляризации не сформулирован принцип выбора величин коэффициентов регуляризации. Они должны задаваться явным образом перед проведением тематического моделирования. В рамках данной работы предлагается решение этой проблемы.

Одной из основных мер качества тематических моделей является максимум логарифма правдоподобия ('log-likelihood') [1] и связанная с ним мера 'perplexity'. В целом, логарифм правдоподобия позволяет настраивать гиперпараметры 'плоских' тематических моделей, однако не дает возможности определять оптимальное количество тематических кластеров. При этом ' log-likelihood' мало пригоден для настройки иерархических тематических моделей [2], в которых существуют дополнительная нерешенная проблема выбора числа тем на каждом из уровней иерархии, а также традиционная проблема задания величин гиперпараметров.

Кроме этого в тематических моделях активно используется мера когерентности ('coherence'), которая позволяет оценить связность тем в тематическом решении [67]. Суть данной меры заключается в подсчете того, насколько часто слова с высокими вероятностями совместно встречаются в высоко вероятностных документах. Высокая когерентность тем соответствует наилучшему решению. Данная мера также не позволяет определить оптимальное количество компонент в смеси распределений, в виду своего монотонного поведения.

Таким образом, тематическим моделям разного типа присущи следующие нерешенные проблемы: 1. Не ясно, каким образом определять оптимальное число кластеров в тематическом решении. 2. Существующие меры качества не являются универсальными, то есть работают не на всех моделях. 3. Не существует меры качества, которые позволяли

бы настраивать несколько параметров модели одновременно (включая гиперпараметры, число кластеров и семантическую связность).

В данной работе предлагается решение указанных проблем за счет использования параметризованных энтропий в тематических моделях. Теоретическая и экспериментальная оценка применения параметризованных энтропий в тематических моделях приведена в главе 2.

1.3. Применение энтропийных принципов в области тематического моделирования

Следующая часть обзора работ посвящена применению модели отжига для выделения скрытых распределений в тематическом моделировании. В работах [29, 30] используется классический вариант алгоритма отжига на основе марковского процесса. В работе Тсаллиса [74] предложен модифицированный вариант отжига, однако в области машинного обучения данный алгоритм не применялся.

В работе Zhu [31] предложена модель 'maximum entropy discrimination latent Dirichlet allocation (MedLDA)', суть которой заключается во введении в логарифм правдоподобия дивергенции Кульбака-Лейблера, который является энтропийным регуляризатором. Следует также отметить работу [32], где предложена тематическая модель, в которой для определения оптимальной величины коэффициента регуляризации используется минимум энтропии Шеннона, рассчитанный по словам. Существенным недостатком данной работы является тестирование предложенной модели на датасетах, размеченных всего на две темы.

1.4. Стабильность тематических моделей

Несмотря на множество работ, посвященных тематическим моделям, число работ, связанных с оценкой их стабильности невелико. Проблема стабильности тематических моделей связана с особенностями работы тематических моделей.

Решение задачи тематического моделирования эквивалентно стохастическому матричному разложению, в котором большая матрица F, содержащая документы d и слова w, аппроксимируется произведением двух матриц 0 и Ф меньшей размерности. Однако стохастическое матричное разложение определяется не единственным образом, а с точностью до невырожденного преобразования [16]. Если F = Ф0 - решение, то F = (Ф5)(5-10) также является решением для всех невырожденных S, при которых матрицы Ф' = ФБ и 0' = S-1& являются стохастическими. В терминах алгоритма ТМ неоднозначность восстановления многомерной плотности смеси распределений связана с тем, что алгоритм, стартуя из различных начальных приближений, будет сходиться к различным точкам из множества решений. Это выражается в том, что при разных запусках алгоритма на одних и тех же исходных данных, содержимое матриц 0 и Ф будет различным. Задачи, решение которых не единственно или неустойчиво, называются некорректно поставленными. Общий подход к их решению даёт регуляризация по Тихонову [38]. Суть регуляризации заключается в доопределении априорной информации, что позволяет сузить множество решений. Регуляризация осуществляется либо за счет введения ограничений на матрицы 0 и Ф [16], либо за счет модификации процедуры сэмплирования [72, 73].

В исследованиях оценки стабильности тематических решений можно выделить несколько работ. В работе Гриффитса и Стайверса [15] предлагается использовать симметричную дивергенцию Кульбака-Лейблера (KLB) для оценки сходства между двумя темами из разных тематических решений. Однако в данной работе не проведено детального исследования применимости данной меры в практических экспериментах. Модифицированный вариант симметричной меры KLB предложен в работе Кольцова и соавторов [28], где также представлен полноценный практический алгоритм оценки стабильности тематических моделей по трем запускам. В работе Белфорда предложена мера 'Average Descriptor Set Difference (ADSD)' [39], которая характеризует среднее значение

количества совпадающих слов в двух тематических решениях. Кроме того, в данной работе также рассматривается мера 'Average Term Stability' на основе усредненного расстояния Жаккара. Авторы данной работы предлагают способ выделения стабильных тем с помощью 'K-Fold ensemble approach', который тестируются на моделях LDA с сэмплированием Гиббса и NMF (Non-negative Matrix Factorization approach) - модели, близкой к ТМ. В работе Грина и соавторов [40] также была использована мера 'Average Jaccard (AJ) measure' для определения оптимального числа тем в англоязычных размеченных датасетах. В работе [41] де Ваал показал, что мера перплексии мало пригодна для оценки стабильности тематической модели, так как она, во-первых, зависит от размера датасета, что затрудняет сравнение различных датасетов между собой, а во-вторых, ведет себя монотонно-убывающим образом.

Указанные работы направленны только на разработку и тестирование меры стабильности. Тем не менее, есть несколько работ, в которых предлагается модификация самой тематической модели с целью увеличения ее стабильности. В работе Кольцова и соавторов [42] показано, что выбор коэффициентов регуляризации в моделях LDA с сэмплированием Гиббса и ARTM существенно влияет на стабильность тематической модели. Кроме того, в данной работе предложен гранулированный вариант процедуры сэмплирования (GLDA), которые дает чрезвычайно высокий уровень стабильности тематической модели. Подробное описание GLDA приведено в главе 5 данной работы. В данной работе было показано, что добавление регуляризаторов влияет стабильность тематической модели.

Похожие диссертационные работы по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Список литературы диссертационного исследования доктор наук Кольцов Сергей Николаевич, 2023 год

Список литературы

1. Chauhan, Uttam and Apurva Shah. "Topic Modeling Using Latent Dirichlet allocation." ACM Computing Surveys (CSUR) 54 (2022): 1 - 35.

2. Koltsov, Sergei, Vera Ignatenko, Maxim Terpilovskii and Paolo Rosso. "Analysis and tuning of hierarchical topic models based on Renyi entropy approach." PeerJ Computer Science 7 (2021): n. pag.

3. Catherine, § A. and Ugar. "Finding the number of clusters in a data set : An information theoretic approach C." (2003).

4. Stephens, Greg J., Thierry Mora, Gasper Tkacik and William Bialek. "Statistical thermodynamics of natural images." Physical review letters 110 1 (2013): 018701 .

5. Mirkin, Boris G.. "Clustering for data mining - a data recovery approach." Computer science and data analysis series (2005).

6. Tibshirani, Robert, Guenther Walther and Trevor J. Hastie. "Estimating the number of clusters in a data set via the gap statistic." Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 63 (2000): n. pag.

7. Fujita, André, Daniel Y. Takahashi and Alexandre Galvao Patriota. "A non-parametric method to estimate the number of clusters." Comput. Stat. Data Anal. 73 (2014): 27-39.

8. Aldana-Bobadilla, Edwin and Ángel Fernando Kuri Morales. "A Clustering Method Based on the Maximum Entropy Principle." Entropy 17 (2015): 151-180.

9. Ramírez-Reyes, Abdiel, Alejandro Raúl Hernández-Montoya, Gerardo Herrera-Corral and Ismael Domínguez-Jiménez. "Determining the Entropie Index q of Tsallis Entropy in Images through Redundancy." Entropy 18 (2016): 299.

10. Milligan, Glenn W. and Martha Cooper. "An examination of procedures for determining the number of clusters in a data set." Psychometrika 50 (1985): 159-179.

11. SH Cha, Taxonomy of nominal type histogram distance measures, Proceedings of the American conference on applied mathematics, 325-330, 2008.

12. Rose, Gurewitz and Fox. "Statistical mechanics and phase transitions in clustering." Physical review letters 65 8 (1990): 945-948.

13. Hofmann, Thomas. "Probabilistic Latent Semantic Indexing." ACM SIGIR Forum 51 (2017): 211 - 218.

14. Blei, David M., A. Ng and Michael I. Jordan. "Latent Dirichlet Allocation." J. Mach. Learn. Res. 3 (2003): 993-1022.

15. Griffiths, Thomas L. and Mark Steyvers. "Finding scientific topics." Proceedings of the National Academy of Sciences of the United States of America 101 (2004): 5228 - 5235.

16. Vorontsov, Konstantin V., Anna Potapenko and Alexander Plavin. "Additive Regularization of Topic Models for Topic Selection and Sparse Factorization." SLDS (2015).

17. Teh, Yee Whye, Michael I. Jordan, Matthew J. Beal and David M. Blei. "Hierarchical Dirichlet Processes." Journal of the American Statistical Association 101 (2006): 1566 -1581.

18. Teh, Yee Whye, Michael I. Jordan, Matthew J. Beal and David M. Blei. "Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes." NIPS (2004).

19. Mimno, David, Wei Li and Andrew McCallum. "Mixtures of hierarchical topics with Pachinko allocation." ICML '07 (2007).

20. Belyy A. V., Seleznova M. S., Sholokhov A. K., Vorontsov K. V. Quality Evaluation and Improvement for Hierarchical Topic Modeling, Computational Linguistics and Intellectual Technologies. Dialogue 2018. pp. 110-123

21. Dieng, Adji B., Francisco J. R. Ruiz and David M. Blei. "Topic Modeling in Embedding Spaces." Transactions of the Association for Computational Linguistics 8 (2020): 439-453.

22. Miao, Yishu, Edward Grefenstette and Phil Blunsom. "Discovering Discrete Latent Topics with Neural Variational Inference." ArXiv abs/1706.00359 (2017): n. pag.

23. Daud, Ali, Juan-Zi Li, Lizhu Zhou and Faqir Muhammad. "Knowledge discovery through directed probabilistic topic models: a survey." Frontiers of Computer Science in China 4 (2009): 280-301.

24. Asuncion, Arthur U., Max Welling, Padhraic Smyth and Yee Whye Teh. "On Smoothing and Inference for Topic Models." UAI (2009).

25. Cao, Juan, Tian Xia, Jintao Li, Yongdong Zhang and Sheng Tang. "A density-based method for adaptive LDA model selection." Neurocomputing 72 (2009): 1775-1781.

26. Arun, R., V. Suresh, C. E. Veni Madhavan and M. Narasimha Murty. "On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations." PAKDD (2010).

27. Roberts, Margaret E., Brandon M Stewart and Dustin Tingley. "Navigating the Local Modes of Big Data: The Case of Topic Models." Computational Social Science (2016).

28. Koltsov, Sergei, Sergei I. Nikolenko, Olessia Koltsova and Svetlana Bodrunova. "Stable topic modeling for web science: granulated LDA." Proceedings of the 8th ACM Conference on Web Science (2016): n. pag.

29. Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov and David Mimno. "Evaluation methods for topic models." ICML '09 (2009).

30. Foulds, James R. and Padhraic Smyth. "Annealing Paths for the Evaluation of Topic Models." UAI (2014).

31. Zhu, Jun, Amr Ahmed and Eric P. Xing. "MedLDA: maximum margin supervised topic models." J. Mach. Learn. Res. 13 (2012): 2237-2278.

32. Sristy, Nagesh Bhattu and Durvasula V. L. N. Somayajulu. "Entropy Regularization for Topic Modelling." I-CARE 2014 (2014).

33. Kadanoff, Leo P.. "Statistical Physics: Statics, Dynamics and Renormalization." (2000).

34. Wilson, Kenneth G.. "Renormalization Group and Critical Phenomena. I. Renormalization Group and the Kadanoff Scaling Picture." Physical Review B 4 (1971): 3174-3183.

35. Olemskoi, A. Synergetics of Complex Systems: Phenomenology and Statistical Theory, KRASAND Publ. House, Moscow, 2009, 384 p. (in Russian).

36. Carpinteri, Alberto, Bernardino Chiaia and Giuseppe Andrea Ferro. "Size effects on nominal tensile strength of concrete structures: multifractality of material ligaments and dimensional transition from order to disorder." Materials and Structures 28 (1995): 311317.

37. Essam, John W.. "Potts models, percolation, and duality." Journal of Mathematical Physics 20 (1979): 1769-1773.

38. Tikhonov, A.N. and Arsenin, V.Y. Solutions of Ill-Posed Problems. Winston, New York, (1977).

39. Belford, Mark, Brian Mac Namee and Derek Greene. "Stability of topic modeling via matrix factorization." Expert Syst. Appl. 91 (2018): 159-169.

40. Greene, Derek, Derek O'Callaghan and Padraig Cunningham. "How Many Topics? Stability Analysis for Topic Models." ECML/PKDD (2014).

41. De Waal, A., Barnard, E.: Evaluating topic models with stability. In: 19th Annual Symposium of the Pattern Recognition Association of South Africa (2008).

42. Koltsov, Sergei, Sergei I. Nikolenko, Olessia Koltsova, Vladimir Filippov and Svetlana Bodrunova. "Stable Topic Modeling with Local Density Regularization." INSCI (2016). Lecture Notes in Computer Science series Vol. 9934. Switzerland : Springer, (2016)

43. Derbanosov, R. Stability of topic modeling via modality regularization [Текст] / R. Derbanosov, M. Bakhanova // Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue. (2020)

44. Agrawal, Amritanshu, Wei Fu and Tim Menzies. "What is wrong with topic modeling? And how to fix it using search-based software engineering." Inf. Softw. Technol. 98 (2018): 74-88.

45. Koltcov, Sergei. "A thermodynamic approach to selecting a number of clusters based on topic modeling." Technical Physics Letters 43 (2017): 584-586.

46. Koltcov, Sergei. "Application of Renyi and Tsallis entropies to topic modeling optimization." Physica A: Statistical Mechanics and its Applications (2018): n. pag.

47. Tsallis, Constantino. "Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World." (2009).

48. Beck, Christian. "Generalised information and entropy measures in physics." Contemporary Physics 50 (2009): 495 - 510.

49. Mora, Thierry and Aleksandra M. Walczak. "Renyi entropy, abundance distribution, and the equivalence of ensembles." Physical review. E 93 5 (2016): 052418.

50. Beck, Christian and Friedrich Schogl. "Thermodynamics of chaotic systems." (1993).

51. Klimontovich, Yu. L. Statistical Theory of Open Systems (Yanus, Moscow, 1995; Springer, Dordrecht, 1995).

52. Sharma, Bhu Dev and Asha Garg. "Nonadditive Measures of Average Charge for Heterogeneous Questionnaires." Inf. Control. 41 (1979): 232-242.

53. Nielsen, Frank and Richard Nock. "A closed-form expression for the Sharma-Mittal entropy of exponential families." Journal of Physics A: Mathematical and Theoretical 45 (2011): n. pag.

54. Jaccard, P.. "The distribution of the flora in the alpine zone 1." New Phytologist 11: 37-50.

55. Parker, Austin J., Kelly B. Yancey and Matthew P. Yancey. "Regular Language Distance and Entropy." MFCS (2017).

56. Koltsov, Sergei, Vera Ignatenko and Olessia Koltsova. "Estimating Topic Modeling Performance with Sharma-Mittal Entropy." Entropy 21 (2019): n. pag.

57. Koltsov, Sergei, Vera Ignatenko and Sergei Pashakhin. "How Many Clusters? An Entropic Approach to Hierarchical Cluster Analysis." SAI (2020).

58. News Dataset from Usenet. Available online: http://qwone.com/~jason/20Newsgroups/ (accessed on 31 October 2019).

59. Basu, Sugato, Ian Davidson and Kiri L. Wagstaff. "Constrained Clustering: Advances in Algorithms, Theory, and Applications." (2008).

60. Lesche, Bernhard. "Instabilities of Renyi entropies." Journal of Statistical Physics 27 (1982): 419-422.

61. Blei, David M., Thomas L. Griffiths, Michael I. Jordan and Joshua B. Tenenbaum. "Hierarchical Topic Models and the Nested Chinese Restaurant Process." NIPS (2003).

62. Mimno, David, Wei Li and Andrew McCallum. "Mixtures of hierarchical topics with Pachinko allocation." ICML '07 (2007).

63. Chirkova, Nadezhda. "Additive Regularization for Hierarchical Multimodal Topic Modeling.". Machine Learning and Data Analysis, 2:187-200. (2016)

64. Ignatenko, Vera, Sergei Koltsov, Steffen Staab and Zeyd Boukhers. "Fractal approach for determining the optimal number of topics in the field of topic modeling." Journal of Physics: Conference Series (2019): n. pag.

65. Schroeder, Manfred. "Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise." (1991).

66. Koltcov, Sergei and Vera Ignatenko. "Renormalization Analysis of Topic Models." Entropy 22 (2020): n. pag.

67. Mimno, David, Hanna M. Wallach, Edmund M. Talley, Miriam Leenders and Andrew McCallum. "Optimizing Semantic Coherence in Topic Models." EMNLP (2011).

68. Koltsov, Sergei and Vera Ignatenko. 2020. "Renormalization Analysis of Topic Models" Entropy 22, no. 5: 556. https://doi.org/10.3390/e22050556.

69. Rosenblatt, Murray. "Remarks on Some Nonparametric Estimates of a Density Function." Annals of Mathematical Statistics 27 (1956): 832-837.

70. Epanechnikov, V. A. Nonparametric estimation of multidimensional probability density. Theory Probab. Appl. 14, 153-158, 1973.

71. Koltsov, Sergei, Sergei I. Nikolenko and E. Y. Koltsova. "Gibbs sampler optimization for analysis of a granulated medium." Technical Physics Letters 42 (2016): 837-839.

72. Newman, David, Edwin V. Bonilla and Wray L. Buntine. "Improving Topic Coherence with Regularized Topic Models." NIPS (2011).

73. Andrzejewski, David and Xiaojin Zhu. "Latent Dirichlet Allocation with Topic-in-Set Knowledge." HLT-NAACL 2009 (2009).

74. Tsallis, Constantino and Daniel A. Stariolo. "Generalized simulated annealing." Physica A-statistical Mechanics and Its Applications 233 (1996): 395-406.

А: Koltcov, S. Analysis and tuning of hierarchical topic models based on Renyi entropy approach / Koltcov, S., Ignatenko, V., Terpilovskii, M., Rosso, P. // PeerJ Computer Science, Том 7, 2021. Open access: https://peerj.com/articles/cs-608/

Analysis and tuning of hierarchical topic models based on Renyi entropy approach

Sergei Koltcov1, Vera Ignatenko1, Maxim Terpilovskii1 and Paolo Rosso1,2

1 Laboratory for Social and Cognitive Informatics, National Research University Higher School of Economics, St. Petersburg, Russia

2 Pattern Recognition and Human Language Technology Research Center, Universität Politécnica de Valencia, Valencia, Spain

Submitted 10 February 2021 Accepted 31 May 2021 Published 29 July 2021

Corresponding author

Sergei Koltcov, skoltsov@hse.ru

Academic editor

Arkaitz Zubiaga

Additional Information and Declarations can be found on page 31

DO I 10.7717/peerj-cs.608

© Copyright 2021 Koltcov et al.

Distributed under

Creative Commons CC-BY 4.0

OPEN ACCESS

ABSTRACT

Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing a hierarchy representing the levels of topic abstractness. However, parameter optimization in hierarchical models, which includes finding an appropriate number of topics at each level of hierarchy, remains a challenging task. In this paper, we propose an approach based on Renyi entropy as a partial solution to the above problem. First, we introduce a Renyi entropy-based metric of quality for hierarchical models. Second, we propose a practical approach to obtaining the "correct" number of topics in hierarchical topic models and show how model hyperparameters should be tuned for that purpose. We test this approach on the datasets with the known number of topics, as determined by the human mark-up, three of these datasets being in the English language and one in Russian. In the numerical experiments, we consider three different hierarchical models: hierarchical latent Dirichlet allocation model (hLDA), hierarchical Pachinko allocation model (hPAM), and hierarchical additive regularization of topic models (hARTM). We demonstrate that the hLDA model possesses a significant level of instability and, moreover, the derived numbers of topics are far from the true numbers for the labeled datasets. For the hPAM model, the Renyi entropy approach allows determining only one level of the data structure. For hARTM model, the proposed approach allows us to estimate the number of topics for two levels of hierarchy.

Subjects Data Mining and Machine Learning, Data Science, Natural Language and Speech Keywords Topic modeling, Renyi entropy, Hierarchical topic models, Optimal number of topics

INTRODUCTION

The large flow of news generated by TV channels, electronic news sources and social media is very often represented as a hierarchical system. In such a system, news items or messages are divided into a number of global topics, such as politics, sports, or health. Within each of the main topics, the documents usually demonstrate a considerable diversity and can be divided into subtopics, such as COVID-19, healthcare availability or healthy food within the global topic of health. The hierarchical division of information content is highly convenient and seems to reflect fundamental cognitive features of humans (Cohen, 2000; Palmer, 1977; Taylor et al, 2015). Therefore, starting from 2004, active development of probabilistic topic models that allow for identifying hierarchical topical

How to cite this article Koltcov S, Ignatenko V, Terpilovskii M, Rosso P. 2021. Analysis and tuning of hierarchical topic models based on Renyi entropy approach. PeerJ Comput. Sci. 7:e608 DOi 10.7717/pcerj-cs.608

structures in large datasets has begun (Blei et al, 2003; Li & McCallum, 2006). However, such models have a set of parameters, which need to be tuned to obtain a topical solution of higher quality. Correspondingly, a problem of tuning hierarchical topic models arises. A solution to this problem is hindered by a number of factors. First, there are no generally accepted or appropriate metrics of quality that take into account the features of hierarchical modeling, namely, the relations between pairs of parent and child topics. Second, the number of publicly available datasets with the hierarchical mark-up, which can be used to tune and compare hierarchical models, is very limited. Third, when applying hierarchical topic models on real datasets, the type of topical structure of data (non-hierarchical against hierarchical, as well as the number of levels in the case of hierarchical structure) is usually not known in advance. Fourth, hierarchical topic models, just like flat topic models, possess a certain level of semantic instability, which means that different runs of the algorithm on the same source data lead to different solutions, which, in turn, correspond to different local maxima of the model posterior: the model may arrive to different local maxima depending on the randomness in initialization and sampling. An analysis and discussion of topic models instability can be found in work (Koltsov et al., 2016). This problem complicates the search for optimal model hyperparameters on a given dataset. Thus, investigation and assessment of the ability to tune hierarchical topic models is an important task.

In this work, we investigate the behavior of three hierarchical models, namely, hierarchical latent Dirichlet allocation (hLDA) (Blei et al., 2003), hierarchical Pachinko allocation (hPAM) (Mimno, Li & McCallum, 2007), and hierarchical additive regularization of topic models (hARTM) (Chirkova & Vorontsov, 2016), in terms of two metrics: log-likelihood and Renyi entropy. We conduct experiments on four marked-up collections, two of which are non-hierarchical and two others have two-level hierarchical mark-up. The latter means that assessors have assigned each news item to a lower-level topic which in turn has been assigned to a higher-level topic, and have also determined the overall number of topics at each level and the number of lower-level topics within each higher-level topic. The goal of our research is threefold. First, we aim to estimate the ability of hierarchical models to identify whether the data structure is hierarchical or non-hierarchical. Second, we seek to develop a metric of quality that is suitable for hierarchical models, notably for topic number search at different levels of hierarchy. Third, based on this metric, we aim to offer an approach for tuning model hyperparamters so as to find the true number of topics or a value close to it. For this, we propose an extension of our entropic approach (Koltcov, 2018), which was developed earlier for non-hierarchical topic models, as a partial solution to the hierarchical model tuning. The approach is based on the information theory, according to which the information maximum corresponds to the entropy minimum. This means that topic modeling solutions with minimal entropy are likely to be the most informative which indicates a potential utility of entropic approach for tuning all model hyperparameters. However, in this paper we focus on topic number optimization specifically.

To simplify the structure of this work, an overview of hierarchical models and existing metrics of quality is provided in Appendix A. Thus, our work consists of the following

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 2/35

parts. 'Entropic approach to hierarchical topic modeling' describes the entropic approach for estimating the quality of topic models and introduces the adaptation of this approach to hierarchical models. 'Computational experiments' outlines the design of our computer experiments for each of the considered models. 'Numerical results' contains an analysis of the behavior of hierarchical models under variation of hyperparameters and the number of topics for the three models on four datasets. 'Discussion' interprets the obtained results and reviews the possible limitations of the proposed approach. 'Conclusion' summarizes our findings and contains practical recommendations for choosing between hierarchical models. Appendix B contains additional experiments with different types of pre-processing that investigate the influence of the latter on behavior of Renyi entropy.

ENTROPIC APPROACH TO HIERARCHICAL TOPIC MODELING

The entropic approach for topic models is based on the concept that an information system consisting of a large number of words and documents represents a statistical system. The state of such a statistical system can be characterized by the value of entropy. It is well known that the maximum entropy of a statistical system corresponds to either chaos or a uniform distribution of the system elements. However, unlike real physical systems, a text collection can be subject to procedures (for instance, clustering or topic modeling) that change the value of entropy by ordering the data. Based on the fact that entropy equals minus information S = - I (Beck, 2009), one can implement the process of document clustering or topic modeling in such a way that leads to the entropy minimum (information maximum), which corresponds to a highly non-equilibrium distribution. Since the procedure of clustering of text collections strongly depends on the values of model parameters, which include the number of topics as a parameter, an algorithm of model tuning can be organized in the form of searching for parameters that would lead to minimum entropy. Let us note that modern widely used topic models usually include no hyperparameters optimization algorithms (or include only for some of them). Thus, this is the user who has to select the values of model parameters based on some external metrics of quality.

Recent research (Koltcov, 2018-, Koltcov et al., 2020) has demonstrated that the most convenient metric for estimating the state of a textual statistical system is Renyi entropy, whose calculation is entirely based on the words' probabilities of belonging to a particular cluster or topic obtained in the process of topic modeling (matrix $ = {<pwl}, where <pwt is the probability of word w in topic f; for more details we refer the reader to Appendix A). Calculation of Renyi entropy for a topic model is based on two observable variables: (1) Gibbs-Shannon entropy of the model; (2) internal energy of the model. Gibbs-Shannon entropy S can be calculated as follows: S = ln(p) = ln(^), where N is the number of words with <f>wt > 1IW, W is the number of unique words in the dataset, T is the number of topics. The internal energy £ of a topic model can be expressed in the following way: E = -ln(P) = -ln^X^O^v, • 1{0„>i/iv})), where Y.wt{<t>w,' l«„,>i/w}) is the sum of

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608

3/35

probabilities of words, each of which is above the threshold IIW. Then, Renyi entropy can be calculated through free energy (F = - qE + S) in the following way:

SR_ F _qln(P) + ln(p) q i-1 i"1

where the deformation parameter q = 1 IT is the inverse number of topics. Thus, Renyi entropy of a topic model explicitly contains 'number of topics' as a deformation parameter. Application of Renyi entropy for tuning different non-hierarchical topic models is considered in work (Koltcov et al., 2020).

As shown in Blei et al. (2003) and Mimno, Li & McCallum (2007), the structure in hierarchical topic models can be represented as a graph, where each node at each level is a topic t of the corresponding degree of abstractness. While flat topic modeling results in obtaining one distribution of probabilities of all words over all topics (represented in matrix ®), and one distribution of probabilities of all documents over all topics (represented in matrix 0), hierarchical topic modeling constructs one pair of these matrices per each level of hierarchy (<!>„ 0„ where i is the level of the hierarchy). The number of levels, and most often the number of nodes-topics per each level, is a value pre-selected by a user. In some models, topics of a lower level are united in clusters each of which is nested in only one topic of a higher level; in other models, hierarchy levels are unrelated and only differ by the number of topics, so that lower levels contain more topics. At all levels, matrices O/, 0, use the same number of words W and documents D which equals the total number of words and documents in the collection, respectively. However, the share of words with high probabilities (i.e., probabilities above the threshold) is the lower, the higher is the level of hierarchy.

As it can be seen, instead of a single topical solution, hierarchical topic modeling obtains a set of solutions whose number equals to the number of levels L. Each solution can be characterized by the following variables: (1) the number of topics T, on level i; (2) the number of words (AT,) with high probabilities: <f>'wt > l/W; (3) the sum of probabilities above the threshold: Pi = Ylwt ^Cj ' >\/w}- ®ased on these variables, one can calculate the internal energy £,■ and Gibbs-Shannon entropy S( of the the solution of level i with respect to the equilibrium state of that level: Ej = — ln(|f), S,- = ln(-^r). Using Sj and E¡, one can determine free energy and Renyi entropy of level i. Free energy is expressed as follows: Fj = £, - Ti ■ S,. Renyi entropy of level i can be expressed in the following way: Sf = y^j, where q = 1/T, is the deformation parameter characterizing each level of the hierarchy.

By measuring the value of entropy at each level while consecutively increasing the number of topics, it is possible to find a solution optimal in terms of information theory. In such a design, the process of clustering of words by topics starts with the minimum information (maximum entropy) when all the elements (words) of the statistical system are assigned to one or two topics and ends also with the maximum entropy when all the elements are almost uniformly distributed over topics (when the number of topics is large). The locations of the global minimum and of several possible local minima of Renyi

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 M 14/35

entropy, as a function of the number of topics, are determined by the features of the data.

Thus, Renyi entropy Sf serves a measure of disequilibrium for a given system. By

additionally varying other hyperparameters one can estimate how optimal each

combination of them is.

RESULTS

Data

In our numerical experiments, the following datasets were used:

• 'Lenta' dataset (from lenta.ru news agency, available at https://www.kaggle.com/yutkin/ corpus-of-russian-news-articles-from-lenta). This dataset contains 8,630 documents with a vocabulary of 23,297 unique words in the Russian language. Each of these documents is manually assigned to a class from a set of 10 topic classes. Since some of the topics are closely related to each other, 7-10 topics can describe the documents of this dataset.

• '20 Newsgroups' dataset (http://qwone.com/~jason/20Newsgroups/) is a widely used dataset in the field of topic modeling. It consists of 15,425 news articles in the English language with 50,965 unique words. Each of the news items is assigned to one of 20 topic groups. Since some of these topics might be combined, 14-20 topics can describe this dataset's documents according to Basu, Davidson & Wagstajf (2008).

• 'WoS' dataset (available at https://data.mendeley.eom/datasets/9rw3vkcfy4/l) is a dataset in the English language with a two-level hierarchical mark-up. The original dataset contains 46,985 abstracts of published papers available from the Web of Science. The first level contains seven categories (domains): computer science, electrical engineering, psychology, mechanical engineering, civil engineering, medical science, and biochemistry. The second level is comrpised of 134 specific topics (areas), each of which belongs to one of the categories of the first level. The number of unique words is 80,337, This dataset is often used as a benchmark for hierarchical classification (Sinha et al, 2018). However, this dataset exhibits a highly unbalanced distribution of the number of documents per sub-category. For instance, some sub-categories contain more than 700 documents, while others are represented by less than 50 documents. Therefore, we also consider a balanced subset of this dataset (described below), where poorly presented topics, i.e., topics with a small number of documents, were deleted.

■ 'Balanced WoS' dataset (available at https://data.mendeley.eom/datasets/9rw3vkdy4/l) is a class-balanced subset of the 'WoS' dataset, which contains 11,967 abstracts. The first level contains seven categories and the second level consists of 33 areas.

• 'Amazon' dataset (available at https://www.kaggle.com/kashnitsky/hierarchical-text-classification/version/1) is a dataset with a three-level hierarchical mark-up. It contains 40,000 product reviews in English from Amazon. The vocabulary of this dataset consists of 31,486 unique words. Level 1 of the hierarchical mark-up contains six categories, level 2 contains 64 categories, and level 3 contains 510 categories. We consider only the first two levels since the third level contains 'unknown' labels. Let us note that the original dataset is highly imbalanced. Some sub-categories contain less than 50

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 5/35

documents, while other sub-categories contain more than 2,000 documents. Therefore, a balanced subset of this dataset is also considered. • 'Balanced Amazon' dataset is a subset of the 'Amazon' dataset that includes only subcategories with the number of documents above 500. As a result, level 1 contains six categories and level 2 contains 27 sub-categories. The total number of documents is 32,774, and the number of unique words is 28,422.

Statistical features of the above datasets are summarized in Table 1.

Computational experiments

In our numerical experiments, we used implementations of hPAM and hLDA models from tomotopy package, version 0.9.1 (https://bab2min.github.io/tomotopy/v0.9.1/en/), and the implementation of hARTM model from BigARTM package, version 0.10.1 (https:// bigartm.readthedocs.io/en/stable/). With each of them, we performed topic modeling on both types of our datasets—those with the hierarchical and the flat mark-up. Correspondingly, the results are divided into two parts. In the first part, we analyze the applicability of the models for the datasets with flat structure, while focusing on the datasets with hierarchical mark-up in the second part. Since topic modeling possesses a certain level of instability that leads to fluctuations in the word probabilities, all calculations were performed at least six times (for each combination of hyperparameters), and then the results were averaged. For each model we calculated Renyi entropy, perplexity and log-likelihood, the two latter having been chosen as the most common metrics for hyperparameters tuning in topic modeling. However, as perplexity is in fact a reciprocal geometric mean of log-likelihood (see Appendix A), its behavior has turned out to be nearly identical to the latter. Therefore, we do not report it in our analysis. In this paper we also leave out metrics related to semantic coherence or semantic stability, such as standard topic coherence (Mimno et al., 2011), tf-idf coherence (Nikolenko, Koltcov & Koltsova, 2017), Jaccard index or their extensions, since their use needs significant adaptation to our task and therefore deserves a separate investigation. Thus, while these metrics, after an adaptation, can be used independently, a more promising avenue is to include them as an entropy parameter into two-parametric Sharma-Mittal entropy, as is shown in Koltcov, Ignatenko & Koltsova (2019). However, this is beyond the scope of this paper.

After entropy and log-likelihood is calculated for a sequence of solutions with varying number of topics, we plot the values of these metrics as functions of the number of topics and investigate whether the minima of these graphs, if any, fall on or near the value of the number of topics suggested by human markup. As we do not infer the methods of calculation of either entropy or log-likelihood from the data, we do not divide our collections into training and test sets, and both metrics are calculated on the entire datasets. In an additional set of experiments described in Appendix B we evaluate the influence of three different types of text preprocessing on the location of the minimum of entropy as a function of the number of topics.

Below we describe specific features of the main experiments for each of the three tested algorithms.

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 16/35

Table 1 Statistics of the datasets.

Dataset Number of documents Vocabulary size Labeling Number of topics

r, r2

Lenta 8,630 23,297 Non-hierarchical 10

20 Newsgroups 15,425 50,965 Non- hierarchical 20

WoS 46,985 80,337 Hierarchical 7 134

Bal. WoS 11,967 36,488 Hierarchical 7 33

Amazon 40,000 31,486 Hierarchical 6 64

Bd. Amazon 32,774 28,422 Hierarchical 6 27

hPAM model

The hP AM model depends on the following parameters: (1) number of topics at the second and the third levels; (2) hyperparameter rf; (3) hyperparameter a. The number of topics at the first level of hPAM model is always set to one. Moreover, in this model, the user can only set the initial value of parameter a, and then the algorithm tunes it during the modeling. Our experiments demonstrated that variation of the initial value of a (a0) does not influence the results of modeling, namely, different values of a0 lead to almost the same topic solutions and the same final value of a. Therefore, in the rest of the paper, we fix a0 to 0.0001. For a more detailed description of the model, we refer the reader to Appendix A.

In our work, hPAM model was studied in two stages. First, we investigated the second hierarchical level in the following way: the number of topics on the third level was fixed T2 ~ 1 while hyperparameter rj was varied in the range (0.001, 1) and the number of topics on the second level (TJ was varied in the range (2, 200). For each solution, we calculated Renyi entropy and log-likelihood of the second level of the hierarchy. Then, we found and fixed several pairs of the values of and r] that were the closest to minimum Renyi entropy. Second, we investigated the third hierarchical level, where the number of topics T2 on the third level was varied under condition of a fixed pair of T} and //. For each combination of T2 and the pair (T\; q), Renyi entropy of the third level was calculated.

hLDA model

The hLDA model has the following parameters: (1) depth of the hierarchy; (2) hyperparameter a, which is tuned by the model automatically; (3) hyperparameter y; (4) hyperparameter rj. hLDA model is non-parametric; therefore, it infers the number of topics on each level automatically.

In this work, we studied the dependence of the number of inferred topics on the parameter which was varied in the range (0.001,1). The depth of the hierarchy was set to three as in the experiments in the original work (Blei et al., 2003). The influence of parameter y was not investigated in this work. Since this model is highly unstable and can produce different numbers of topics for the runs with the same values of hyperparameters, we ran the model 10 times for each value of t]. Then, we estimated the range of the

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 17/35

derived numbers of topics on the second and the third levels. The mean values of Renyi entropy and log-likelihood were calculated for each level.

hARTM model

The hARTM model has the following parameters: (1) number of topics at each level of the hierarchy; (2) seed—a parameter describing the initialization procedure (it defines the work of the random number generator). This model was also studied in two stages. First, the number of topics on the second level was fixed (T2 = 1), and the number of topics on the first level was varied in the range of (2, 200) topics. Based on the minimum Renyi entropy location, the optimal T\ was chosen. Second, the number of topics on the second level was varied under the condition of fixed T1. For each run of the model, the parameter seed was randomly selected to investigate the variability of the model output.

Numerical results

hPAM model (investigation of the second hierarchical level)

Figures 1-4 contain the results of the first stage of our test for hPAM model. They demonstrate the behavior of Renyi entropy under variation of r/ and the number of topics on the second level of hierarchy for different datasets. A similar pattern of Renyi entropy behavior for all datasets is observed in the ranges of small (about 2-3) and large (about 100-200) numbers of topics. These ranges correspond to the two extreme states of the statistical system characterized by entropy maximum. Moreover, one can see that the location and the value of minimum Renyi entropy significantly depend on the parameter rj. Large values of r\ (f/ > 0.7) lead to significant fluctuations in the Renyi entropy for large numbers of topics that complicates finding entropy minimum the more, the higher the number of topics is. Correspondingly, the further increase of r] is inadvisable.

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 ~l 8/35

0 25 50 75 100 125 150 175 200 T

Figure 2 Renyi entropy curves (hPAM). 20 Newsgroups dataset.

Full-size B DOI: 10.7717/peerj-cs.608/fig-2

0 25 50 75 100 125 150 175 2oÖ" T

Figure 3 Renyi entropy curves (hPAM). Balanced WoS dataset.

Full-size SI DOI: 10.7717/peerj-cs.608/fig-3

The behavior of Renyi entropy for this model allows us to find approximations of the optimal number of topics on the second level for different datasets and the optimal value of t] by means of selecting the values that correspond to the minimum entropy. Since we test the model on the datasets with human mark-up, we can estimate the error of the

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 ^ 19/35

— o = o.ooi

6.0- ..... n = o.oi

---/7 = 0.2

5 10 15 20 25 30 35 40 T

Figure 4 Renyi entropy curves (hPAM). Balanced Amazon dataset.

Full-size 0 DOI: 10.7717/peerj-cs.608/fig-4

Table 2 Minimum points of Renyi entropy for hPAM model. Potentially interesting combinations of parameters that were used in our calculations at the second stage are highlighted in bold.

11 Lenta 20 Newsgroups Bal. WoS WoS Bal. Amazon Amazon

min. T, min. I\ min. l\ min. min. I\ min

0.001 3.54 7 3.27 9 3.66 8 3.25 8 3.68 19 3.93 14

0.01 3.18 6 3.06 10 2.89 10 2.96 9 3.6 16 3.74 15

0.2 3.13 6 3.19 13 2.82 12 2.67 14 3.45 13 3.47 12

0.3 3.12 8 3.17 10 2.79 12 2.68 8 3.4 7 3.39 7

0.5 3.15 7 3.14 9 2.77 10 2.66 9 3.34 6 3.29 6

0.7 3.23 5 3.08 8 2.87 8 2.95 5 3.44 5 4.01 3

1 3.38 4 3.09 6 2.96 6 3.39 6 3.93 3 3.93 3

found approximation of the number of topics. For non-hierarchical datasets (Lenta and 20 Newsgroups), the error corresponds to ±2 topics (Figs. 1 and 2). For hierarchical datasets we obtain the error of ±4 topics (Figs. 3 and 4).

The list of values of t] and corresponding numbers of topics with the values of average minimum Renyi entropy on the second hierarchical level is given in Table 2 (Tj refers to the number of topics on the second level of the hierarchy). Potentially interesting combinations of parameters that were used in our calculations at the second stage are highlighted in bold.

Calculation of log-likelihood under variation of Tl and r\ demonstrates that this metric is not useful for selecting the optimal values of r\ and since it has large fluctuations in the entire range of (Figs. 5 and 6).

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 110/35

A

0 25 50 75 100 125 150 175 200

T

0 50 100 150 200 0 50 100 150 200

T T

Figure 7 Renyi entropy curves (hPAM). (A) Lenta dataset. (B) Balanced WoS dataset. (C) WoS dataset.

Full-size E DOI: 10.7717/peerj-cs.608/fig-7

should be noted that such fluctuations are typical for both balanced and unbalanced datasets. Moreover, parameter r) visibly influences the location of a spike. Therefore, the estimation of the number of topics on the third level is much less accurate than that on the second level. Our calculations demonstrate that in the region of large fluctuations the model deteriorates. This means that the number of words with high probabilities becomes constant and does not change with the growth of the number of topics. Correspondingly, the sum of high probabilities also becomes constant, i.e., statistical features of obtained solutions do not change. In this case, Renyi entropy changes only because the number of topics changes while other variables are constant. Thus, due to the features of hPAM model, the selection of the number of topics on the third level is complicated. Moreover, the dependence of log-likelihood on T2 has similar behavior as for the second hierarchical level and, thus, does not allow us either to choose the right number of topics for hPAM model or to determine whether the dataset has a hierarchical or flat structure.

hLDA

Our calculations demonstrate that hLDA model cannot be used in real applications since it infers very different numbers of topics for different runs with the same parameters. Moreover, the inferred numbers of topics are far away from the true number for considered datasets. In addition, the order of dispersion and the mean value of the predicted number of topics on each level significantly depends on the value of parameter rj. Due to high instability for ?/ < 0.3 and incorrect output for t] > 0.3, there is no sense in applying Renyi entropy approach to this model. Finally, log-likelihood metric does not

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 ^ 112/35

minimum is dose to the human mark-up, namely seven topics for Lenta dataset and 14 topics for 20 Newsgroups dataset. The behavior of Renyi entropy on the second level of hierarchy is almost identical to that of the first level; therefore, we do not provide figures. Let us note that Renyi entropy curve for hARTM model does not have sharp jumps compared to the hPAM model. Thus, the entropy approach can be successfully used for determining the structure of non-hierarchical datasets modeled with hARTM.

For hierarchical datasets, we observe two minima of Renyi entropy which allows us to identify two levels in the data structure. The first (global) minimum of Renyi entropy approximately corresponds to the first level of the hierarchical mark-up of the dataset and the second minimum corresponds to the second level. For instance, Renyi entropy for the balanced WoS dataset has a global minimum at the value of 10 topics, while human mark-up suggests seven topics. The second (local) minimum of Renyi entropy corresponds to 36-42 topics, and the mark-up discerns 34 topics. For the balanced Amazon dataset, the first minimum of Renyi entropy corresponds to 10 topics, with 6 topics on the first level provided by the mark-up. The second minimum is found at 38 topics, whereas the mark-up indicates 27 topics on the second level. Thus, the estimation of the number of topics on the second level has a larger error compared to that of the first level. However, to the best of our knowledge, Renyi entropy approach provides the best accuracy of estimating the number of topics for this hierarchical model. Moreover, our approach allows us to determine the very presence of hierarchical structure in the data. Let us note that balancing the datasets allows improving the results of topic modeling in terms of topic salience, making them strongly pronounced. This is achieved by obtaining higher probabilities of the most probable words (that is, by obtaining more skewed distributions of word probabilities within a topic). This, in turn, leads to the formation of a more pronounced entropy minimum and, on average, to lower values of entropy across the runs of the algorithm with different parameters. Figures 8C and 8D demonstrate the difference between Renyi entropy for the balanced and unbalanced datasets. The effect of balancing is especially pronounced in the region of entropy minima, while for large numbers of topics the effect is less observable. This happens because that increase in the number of topics leads to an almost uniform distribution of word probabilities regardless of the dataset content.

Figure 9 demonstrates the behavior of log-likelihood depending on the number of topics for non-hierarchical and hierarchical datasets. Let us note that the behavior is monotone and does not allow determining the dataset structure.

DISCUSSION

As our experiments demonstrate, hLDA model is very unstable which means that its different runs with the same parameters produce radically different topical structures of the same data. This instability can be significantly reduced by changing parameter which controls the sparsity of topics. However, how to choose the optimal value of t] is an open question since neither log-likelihood nor Renyi entropy allow us to tune this parameter.

Koltcov et al. (2021), PeerJ Comput. SeiDOI 10.7717/peerj-cs.608 14/35

T T

Figure 9 Log-likelihood curves (hARTM). (A) Lenta dataset. (B) Balanced WoS dataset.

Full-size SI DOI: 10.7717/peerj-cs.608/fig-9

A somewhat more promising result has been obtained for hPAM model which depends on the parameter r] and the number of topics on different levels. Variation of these parameters in experiments demonstrated that Renyi entropy allows determining only one level of data structure for this model, namely, the number of topics on the second level. Variation of the number of topics on the next level leads to significant fluctuations of Renyi entropy for both non-hierarchical and hierarchical datasets. Correspondingly, it seems that sharp jumps of Renyi entropy on the third level are related to the model's features rather than to the datasets' structure. Additionally, the behavior of log-likelihood as a function of the number of topics and parameter rj does not allow determining the dataset structure since the maxima and the minima of log-likelihood do not correspond to the structure of mark-up either for hierarchical or non-hierarchical datasets.

Among the considered models, hARTM is the most stable since the procedure of initialization of this model has almost no effect on the obtained values of Renyi entropy. Moreover, Renyi entropy for this model demonstrates that for the datasets with non-hierarchical mark-up (for both the English-language and the Russian-language datasets), there is only one minimum, which is located close to the number of topics from the mark-up. Further increase in the number of topics does not lead to jumps (in contrast to hPAM model) but leads to a smooth growth of Renyi entropy which coincides with the results of modeling with non-hierarchical topic models (such as Latent Dirichlet allocation (Blei, Ng & Jordan, 2003; Griffiths & Steyvers, 2004) and granulated Latent Dirichlet Allocation (Koltcov et al, 2016)) on the same datasets (Koltcov et al., 2020). However, for the datasets with hierarchical mark-up, Renyi entropy has two clear minima. One of these minima corresponds to the number of topics from the mark-up of the first level, and the second is close to the values from the mark-up of the second level. The difference between the number of topics obtained by searching for Renyi entropy minimum and the results of the mark-up is about 3-4 topics for the first level and 8-10 topics for the second level. Thus, estimation of the number of topics on the second level has a poorer quality.

To the best of our knowledge and based on our results, our approach is the first method allowing to detect the presence of hierarchical or non-hierarchical structure in the text

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 115/35

data, and, thus, can be used as a hierarchy detector. However, it is useful only for hARTM model among the considered ones. At the same time, the behavior of log-likelihood significantly depends on the values of hyperparameters and often it has no clear maxima which, in turn, complicates the application of this metric for determining the optimal number of topics on different hierarchical levels. Moreover, since human mark-up also usually has its own limitations to precision or inter-annotator agreement, the achieved accuracy of prediction of the topic number with Renyi entropy can be viewed as considerable.

Based on the additional experiments with different types of data pre-processing presented in Appendix B, we also conclude that removing short and rare words leads to the decrease in Renyi entropy (and, correspondingly, to the increase in information). At the same time, the location of a global minimum stays almost unchanged. However, extensive pre-processing may lead to disappearance of local minima at the second hierarchical level. Therefore, it is recommended to remove short words and not to remove words with a frequency above 5. In general, cutting off the long tail of the frequency distribution affects the results of the entropy calculation, since Renyi entropy is calculated as the difference between high frequency words and low frequency words.

The proposed Renyi entropy approach has some limitations related to the method of Renyi entropy calculation and to the distribution of words in text collections. First, these distributions have long and heavy tails. Words from such tails are assigned to topics with small probabilities. Since Renyi entropy measures the difference between words with high probabilities and words with small probabilities, it can only detect topics that are comprised of highly probable words. However, lower levels of hierarchy tend to contain narrower topics, each of which is comprised of words with smaller probabilities. Correspondingly, the ability to distinguish the sets of topics disappears as the number of hierarchy level increases. Thus, Renyi entropy approach is suitable for determining the number of topics and values of hyperparameters for one or two levels of hierarchical structures. Another limitation of our approach is the presence of fluctuations in Renyi entropy values. These fluctuations are related to fluctuations of word probabilities, which, in turn, are caused by the stochastic nature of topic modeling. The problem of minimizing the instability of topic modeling has not yet been solved, but the stabilization of topic models through word embeddings can be potentially interesting. The application of Renyi entropy approach to topic models with word embeddings is beyond the scope of this work and could be considered a next potential stage in the development of the entropic approach to topic modeling.

CONCLUSION

In this work, we investigated the ability of hierarchical topic models to correctly determine the hierarchical structure in data by applying three topic modeling algorithms to four datasets with a human mark-up of topics. Two of these datasets have a flat mark-up, while the mark-up of two others is two-level. Three of the datasets are in the English language, and one is in Russian. Thus, we analyze the models and test our approach on a variety of data structures and languages. We first formulated the principle of Renyi entropy

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 ^^^T 116/35

calculation for hierarchical topic models. Next, we consecutively analyzed the chosen topic models using log-likelihood metric and Renyi entropy. Based on the results of our calculations, we can conclude the following.

First, our Renyi entropy-based approach can be extended to hierarchical topic models since the accuracy of the approximations of the optimal topic number for some of such models is not inferior to that for non-hierarchical models demonstrated in Koltcov (2018) and Koltcov et al. (2020). Second, our calculations on the test datasets demonstrated that hLDA model in its current form is not applicable to practical tasks due to its extreme instability (although it is open to regularization in iuture research). Third, for hPAM model, the proposed Renyi entropy approach allows selecting the number of topics only for one level of hierarchy. Determining the number of topics on the next levels is complicated by the large fluctuations of entropy for large numbers of topics. Thus, hPAM model, in conjunction with Renyi entropy as a metric to optimize, can be used for modeling datasets with non-hierarchical topical structures. Although this is suboptimal, let us note that log-likelihood metric does not allow tuning most of the model parameters on either hierarchical or flat data. Fourth, based on our calculations, it can be concluded that hARTM model provides the best results as, when applied to this model, Renyi entropy approach allows detecting both the hierarchical and non-hierarchical structures of the datasets. In the case of non-hierarchical dataset structure, we observe only one minimum of Renyi entropy under variation of the number of topics for both languages. In the case of a two-level dataset structure, Renyi entropy has two minima, each of which approximately corresponds to the topic number at the respective level of the hierarchy.

hARTM model is thus the easiest for tuning and, additionally, the most stable among the considered algorithms, while Renyi entropy is obviously a better quality metric compared to log-likelihood. We can also conclude that Renyi entropy is better than perplexity as the latter behaves very similar to log-likelihood whose derivative it is. Thus, for the purposes of topic number optimization it is recommended to select other model parameters based on the minimum of Renyi entropy.

The main limitation of Renyi entropy approach is that it can detect the true number of topics only for one or two levels of a topic hierarchy. Topic numbers at the subsequent levels are hard to predict since the probabilities of words at these levels get smaller and more uniform, thus determining the formation of less pronounced topics with unclear boundaries that, furthermore, are much more prone to random shifts from one run to another. This, in turn, hinders discrimination between genuine local minima of Renyi entopy and its random fluctuations. The problem of detecting poorly pronounced topics and the related task of topic model stabilization can possibly be solved by means of word embeddings technology, which appears a promising direction for a future research.

Our paper also provides important insights for the use of the entropic approach in a broader task of topic modeling parameter optimization. In general, optimal model parameters (be it a, t] or the number of topics) are those which produce a solution with the best value of one or more metrics of quality. However, when such metric cannot be optimized directly, as it is in the case of the number of topics, or its optimization is very time-consuming, an optimizable substitute metric should be found, of which Renyi

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 17/35

entropy is a promising example. While in this work we have shown the utility of Renyi entropy metric for finding an optimal number of topics, our earlier work (Koltcov, lgnatenko & Koltsova, 2019) also demonstrates that its two-parametric extension (Sharma-Mittal entropy) can also help selecting semantically stable solutions (which is usually very computationally complex)—i.e. it can point at such combinations of parameters that produce similar sets of most probable words in topics across multiple runs of the algorithm. Application of two-parametric entropy to hierarchical topic models is a future work.

However, apart from the stability and correctness of topic number identification, topic interpretability is of significant importance for topic modeling quality. This has not been explored yet in our framework. Furthermore, the very measurement of interpretability needs development. Few existing metrics, such as standard coherence (Mimno et al„ 2011) or tf-idf coherence (Nikolenko, Koltcov & Koltsova, 2017) aim at measuring interpretability of individual topics, while methods of generalization of those for the entire topic solutions are multiple, and none has been tested. Moreover, the correspondence of coherence to interpretability as defined by humans is far from perfect (Nikolenko, Koltcov & Koltsova, 2017). Neither it is obvious that a single topic modeling solution could be found combining the best interpretability, stability and the optimal number of topics; these different measurements of quality might be competing. Our research, in addition to our previous investigation in Koltcov, lgnatenko & Koltsova (2019), points to the prospect of applicability of the entropic approach to simultaneous parameter optimization in topic modeling. For this, more research into topic interpretability and into multi-parametric entropies should be carried out.

APPENDIX A

Classic probabilistic topic models

Let us briefly discuss the most widely used topic models and the existing quality metrics in the field of topic modeling. Assume that we have a document collection D with a vocabulary of all unique words denoted by W. A document d contains a sequence of words {wlt w2,...,w„}, where w, e W. So, the words and the documents are the only observable variables. The goal of topic modeling is to retrieve hidden topics in the document collection, where each topic is characterized by its distribution over the vocabulary. Thus, the primary goal of topic modeling is to find the word distribution for each topic and to find the proportions of topics in each document. In most probabilistic topic models, it is assumed that, first, there exists a finite number Tof topics, and each entry of a word w in document d is associated with a certain topic t. Second, it is assumed that the order of words in documents is not important for TM ('bag-of-words' model) and, in addition, the order of documents in the collection is neither important. Third, a conditional independence assumption states that document d and word w are independent conditioned on the latent topic f, i.e., words, conditioned on the latent class t, are generated independently of the specific document identity d (Hofmann, 2001).

Two basic topic models are probabilistic latent semantic analysis (pLSA) (Hofmann, 1999) and its Bayesian version called Latent Dirichlet Allocation (LDA) (Blei, Ng & Jordan,

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 I III l'i

2003). Mathematically, the probability of word w in document d can be expressed as follows (Hofmann, 1999):

= ErtHOpMd) =

teT teT

where t is a topic, p(w\t) is the distribution of words by topics, and p(t\d) is the distribution of topics by documents. The output of TM is represented with two matrices, namely, matrix 4>: = {((>„,,}= {p(w|f)} containing the distribution of words by topics and matrix 0: = {0,rf}= {/>(f|<i)} containing the distribution of topics by documents (or, in other words, the proportions of topics in documents). Many modifications of the LDA model were developed for various specific applications. However, these models share the same practical issue: to build a model, the user has to set the number of topics unknown in advance in many cases.

Generally, standard quality metrics such as perplexity {Heinrich, 2004; Newman et al, 2009; Zhao et al„ 2015), log-likelihood (Wallach, Mimno & McCallum, 2009; Heinrich, 2004; Griffiths & Steyvers, 2004), and semantic coherence (Mimno et al, 2011; Stevens et al., 2012) are used to select the number of topics or to tune other model parameters. Perplexity is a metric that evaluates the efficiency of the model to predict the data. For LDA-based models, the perplexity for document collection D with D documents can be calculated as follows (Heinrich, 2004):

perplexity^) =

where Nj is the number of words in document d,c = {cd}d=i ¡s the number of times word w was observed in document d. A lower perplexity value means better model quality. Perplexity is closely related to likelihood, namely, perplexity is a reciprocal geometric mean of the likelihood, where likelihood for a document d is expressed as (Heinrich, 2004)

w t

In turn, log-likelihood of document collection D can be calculated for LDA-based models as follows (Wallach, Mimno & McCallum, 2009; Heinrich, 2004):

in(p(£>i*, 0)) = £ E < ln<E

1 i w t

The number of topics and the other model parameters are selected when finding maximum log-likelihood (Griffiths & Steyvers, 2004).

Semantic coherence is another type of quality metrics. It aims at measuring the interpretability of inferred topics (Mimno et al., 2011) rather than the predictive power of the model. Semantic coherence can be calculated as an average of the topic coherence scores, where each topic coherence score is expressed as

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 III I'i

m=2 (=l U^VI>

W(i) = (vi*.... , Vm) is a list of M most probable words in topic t, D(v) is the number of documents containing word v, and D(v,v') is the number of documents where words v and v' co-occur. The typical value for M is 5-20 words.

The discussion on the application of the above metrics to the task of determining the number of topics for pLSA and LDA models and limitations of these metrics can be found in papers (Koltcov, lgnatenko & Koltsova, 2019; Koltcov et al., 2020).

Nonparametric topic models

Further development of topic models occurred in the direction of nonparametric models. The main idea of nonparametric topic modeling is to infer the model structure (namely, the number of topics) from the data. Theoretically, nonparametric models are able to select the number of topics automatically according to available data. Such models introduce a prior distribution on potentially infinite partitions of integers using some stochastic process that would give an advantage in the form of a higher prior probability for solutions with fewer topics. In works (Teh et al, 2006; Teh et al., 2004), a topic model based on hierarchical Dirichlet process (HDP) was first proposed. This model can be considered as an infinite extension of LDA model (Heinrich, 2008). More complicated models that are based on the Indian buffet process are considered in works (Chen, Zhou & Carin, 2012; Williamson et al., 2010). Detailed surveys on nonparametric models can be found in Hjort et al. (2010) and Rasmussen & Williams (2006). However, nonparametric models possess a set of parameters that can significantly influence the inferred number of topics and results of TM in general (Vorontsov, Potapenko & Plavin, 2015). Moreover, in work (Koltcov et al., 2020), we demonstrated that, in real applications, the number of topics inferred by HDP model does not correspond to the number of topics obtained with human judgment. Thus, the application of this type of models is complicated in practice.

Hierarchical topic models

The next important step in the development of topic models was dictated by the intention to organize topics into a hierarchy. It resulted in the development of hierarchical topic models. In contrast to flat topic models (such as pLSA or LDA), the hierarchical topic models are able to discover a topical hierarchy, which comprises levels of topical abstraction. Usually, each node in the hierarchy corresponds to a topic, which, in turn, is represented by the distribution over words. Different hierarchical topic models are based on different prior assumptions on the distribution of topics and on the type of hierarchical structure to be inferred. The two most widely used in practice hierarchical topic models (Liu et al., 2016) are hierarchical latent Dirichlet allocation (hLDA) (Blei et al, 2003) and hierarchical Pachinko allocation (hPAM) (Mimno, Li & McCallum, 2007).

hLDA model is a hierarchical and nonparametric extension of LDA model. In the framework of this model, it is assumed that each sub-topic (child topic) has a single parent topic, thus, providing a tree of topics. Moreover, it is assumed that all topics of a document

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 20/35

are found within a path in that tree. It imposes significant restrictions on the inferred topical components of documents. Thus, for instance, a document can not be devoted to several specific sub-topics within the assumptions of the model. Another feature of this model is that the first level of the hierarchy always contains one topic (the root of the hierarchy). hLDA model learns topic hierarchies based on the nested Chinese Restaurant Process (nCRP), which specifies a prior for the model. Nested Chinese Restaurant Process is a hierarchical version of the Chinese Restaurant process and is used to express uncertainty about possible ¿-level trees. To illustrate nCRP, assume that there are infinitely many Chinese restaurants with infinitely many tables. One restaurant is associated with a root (level 1) and each table of this restaurant has a card with a reference to another restaurant (level 2). Tables of those restaurants, in turn, have references to other restaurants (level 3), and this structure repeats. Each restaurant is referred to exactly once, therefore, the structure produces an infinitely-branched tree of restaurants. Imagine that a tourist enters the root restaurant and selects a table according to

p(occupied table i | previous customers) = p(new table | previous customers) =

y + m — 1

(2)

y + m — 1'

where m, is the number of previous customers at table i, m is the number of customers in the restaurant, including the tourist, and y is a parameter that controls how often a customer chooses a new table. The next day, the tourist goes to the restaurant identified on the card of the table chosen the day before and selects a table according to Eq. (2). This process is repeated for L days. After M tourists proceed this process, the collection of their paths describes a particular L-level tree. In terms of topic modeling, customers correspond to documents, and restaurants correspond to topics. In the framework of hLDA model, the following prior distributions are assumed: (1) nCRP prior with hyperparameter y on possible trees; (2) symmetric Dirichlet prior with hyperparameter tj on the distribution of words by topics (f>wl; (3) L-dimensional Dirichlet prior with hyperparameter a on mixing proportions 0f(; of the topics along the path from the root to the leaf.

The generative process of hLDA model can be described as follows:

• For each node t, draw 4>. Dir(r\).

• For each document d: draw a path of topics Q = {Cdi>...,Cdi,} according to nCRP with parameter y (2); draw the topic mixing proportion 0¿ ~ Dir(a). For each position n of word in the document, choose the level assignment z¿n~ Mult(0(j) (as the level is chosen in the path, the topic is determined), then draw a word from the chosen topic wdn~ Mult(4>- cdj J.

Thus, a document is drawn by choosing an L-level path through the restaurants (topics) and then sampling the words from the L chosen topics. The inference can be approximated

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608

21/35

by means of Collapsed Gibbs sampling Blei, Griffiths & Jordan (2010) and Chen et al. (2018). The expressions for assessment of Zdn and Cd variables are the following:

b?'in +1]

p(zdn = l\wd„ = v, W, z^d„, c) oc (aj" + a) ,

Sc,n + v'I

i B(b^d + bd + n) p(cd=~c\W,z,c,d) <xp(cMn B(bf + n) '

where W is the vocabulary (set of words), V is the vocabulary size, z = {zd}d = i, -> dn means all the tokens excluding token wd„, c = {cd}d=l , a^f" is a counter that equals the number of tokens in level I in document d excluding token wdm bj'" is a counter that equals the number of tokens of the word v assigned to topic cdi excluding the current token wdn, is a counter that equals the number of tokens assigned to topic c¡¡i excluding the current token wd„, c = {c[,..., Cj.} is a path in the hierarchy, b^d is a counter that equals the number of tokens assigned to topic c/ excluding the tokens of document d, bd is the number of tokens of document d that were assigned to topic c/, B(-) is the multivariate beta function and p(c|c-j) is according to nCRP (2). The details can be found in Chen et al. (2018). Then, matrix <P is calculated according to

, _ Cwt + '1 Vvt — v^v , ,, i

Ev=i cvt + iv

where a counter cw, equals the number of instances word w was assigned to topic t.

We would like to note that hLDA model has three hyperparameters: (1) a is a parameter of Dirichlet distribution, which controls smoothing over levels in the tree; (2) r\ is hyperparameter of another Dirichlet distribution, which controls the sparsity of topics; (3) y is a parameter of the nested Chinese restaurant process, which controls how often a document will choose new, i.e., not previously encountered paths.

In hPAM model, in contrast to hLDA, it is assumed that each child topic can be related to each node (topic) from the upper level, thus, resulting in a directed acyclic graph of topics. Therefore, in hPAM model, a child topic has several parent topics. It reduces the necessity of inferring the correct tree structure that is crucial in hLDA. Moreover, it allows the documents to contain several specific sub-topics. Moreover, hPAM model always infers a three-level hierarchy, where the first level corresponds to a root topic, the second level corresponds to super-topics, and the third level contains sub-topics. Another important difference between hLDA and hPAM models is that hLDA is a nonparametric model, i.e., it infers the number of topics on each level automatically while hPAM model is parametric, therefore, a user has to manually select the number of topics on each level. Thus, the number of topics on each level of the hierarchy is a model parameter. Moreover, hPAM model has two hyperparameters: tj is a parameter of a prior Dirichlet distribution for <p„ which controls sparseness of topics, and a is a parameter of a prior Dirichlet distribution for 9. Let us note that since hPAM constructs a directed acyclic graph, each

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 I I

node (topic) at a given level has a distribution over all nodes on the next lower level. Let Ti be the number of 'super-topics' and T2 be the number of 'sub-topics'. The generative process of hPAM model is as follows (hPAM model 2):

• For each vertex (topic) t of the graph, draw (|>. Dir(r|).

• For each document d, sample 0O from a T, + 1-dimensional Dirichlet distribution with hyperparameter a0 and distribution 9r from a T2 + 1-dimensional Dirichlet distribution with hyperparameter aT for each super-topic. The first component of 90 determines the probability of the event that a word is generated by the root topic while the other Tj components of 90 determine the distribution of the root topic over super-topics. Analogously, the first component of 0T defines the probability of the event that a word is generated by the super-topic T, and the other T2 components determine the distribution of the super-topic T over sub-topics.

• For each word w: sample a super-topic z? ~ Mult(60). If Zf = 0, sample a word from <j). 0 ~ Dir(r|). Otherwise, sample a sub-topic z, ~ Mult(Qzr). If z, = 0, then sample a word from i(). z ~ Dir(\]). Otherwise, sample a word from <f>-.

Thus, matrix of the distribution of words by topics has dimension W x (T, + T2 + 1). The inference algorithm is based on Gibbs sampling method Mimno, Li & McCallum (2007). Sampling distribution for a given word w in document d is as follows:

P(ZT=X,Z, =y\wd„ = W, OC (¡¡fa" + <Xo,x) '

->1

Ejii (bj%+«*,y) E^af + -?)'

(3)

where x G {0,1,..., T,} is the index of super-topic if x ± 0; y G )0,l,...,r2} is the index of sub-topic if y # 0; is the topic assignments for all other words, counter a equals the number of tokens (excluding the current tokem Wj„) in document d that were assigned to super-topic x, x # 0, and to the root topic if x = 0; a0x means the x-th component of vector a0; counter equals the number of tokens (excluding the current token Wjn) in document d that were assigned to super-topic x and sub-topic y; counter c^f" equals the number of tokens of the word w that were generated by topic k, where k is

the root topic if x = 0, k = 1, ...,T, is a super-topic ify = 0 and k - T[ + 1.....T2 + 1 is a

sub-topic otherwise. Then, a pair of indices (x, y) is sampled from the distribution (3) and assigned to the current word w,jn. Matrix <P is calculated according to

, cw, + >}

Ev

v=l cvt '

«V

where a counter cwt equals the number of instances word w was generated by topic t. One can use the fixed point update equations described in Minka (2000) to optimize the asymmetric Dirichlet parameters a. Thus, in most of the publicly available model implementations hyperparameter tj is fixed and has to be set by a user while hyperparameters a, are optimized during the training step.

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608

23/35

Besides hLDA and hPAM, many other hierarchical topic models have been developed, for instance, nested hierarchical Dirichlet process (nHDP) (Paisley et al., 2015), hierarchical probabilistic latent semantic analysis (hPLSA) (Gaussier et al., 2002), topic hierarchies of Hierarchical Dirichlet Processes (hHDP) (Zavitsanos, Paliouras & Vouros, 2011), hierarchical latent tree analysis (HLTA) (Chen et al., 2017), hierarchical stochastic block model (hSBM) (Gerlach, Peixoto & Altmann, 2018), and hierarchical Additive Regularization of Topic Models (hARTM) (Chirkova & Vorontsov, 2016).

To relax the restrictions of hLDA that each document follows one path down the tree of topics, nHDP model was proposed (Paisley et al, 2015). Two main changes with respect to hLDA model are the following: (1) each word follows its own path to a topic; (2) each document has a distribution on possible paths in a shared tree. In the framework of this model, all documents share a global nCRP drawn according to the stick-breaking construction. Let us denote this tree by T. From a root Dirichlet process G,0, a path is followed by drawing (|>i + i ~ G,-/ (Gd denotes Dirichlet process for the children of node i;,) for I = 0,1,2,..., where i0 is the root index, and i; = (ii,...,ii) indexes the Dirichlet process associated with the topic (J)/ = 0,;. For each document d, a tree Td is constructed, where for each G,;, a corresponding G\d> e Td is drawn according to the Dirichlet process: Glf! ~ DP(/?G,,). Using this construction, each document has a tree with document specific transition probabilities defined over the same subset of nodes. To generate a document with this tree, for each node ¡¡, a document specific beta random variable is drawn. Namely, given that the path for word wdn is at node stop with probability Udj,-j ~ Beta( Yl y2). If we do not select topic 0«, then continue by selecting node i; + j according to G/d. Thus, for word wd n in document d, the probability that it is generated by topic fydn = 0,;

is as follows Pr(<j>dn = 6h\Td, Ud) = (nL"2o ^({»w,})) ' (vd]il E^i, (1 -The left term is the probability of path !;, the right term is the probability that the word selects the /th topics, but not the first / — 1 topics, im denotes the first m values in ii (m< I). The authors propose a stochastic variational inference (Paisley et al, 2015), which is scalable to large datasets, for this model.

In the framework of hPLSA, the generative process is as follows. First, a document class a is chosen with probability p(a) from a predefined number of classes. Second, a document d is chosen according to conditional probability p(d\a). Third, a topic for each word position is chosen according to class-conditional probability p(t\a). Forth, a word is sampled according to p(w\t). In the hierarchical structure of hPLSA, the document classes are the leaves of the hierarchy while topics occupy non-leaf nodes of the hierarchy. pLSA model is a special case of hPLSA since if only one topic per class is sampled then the result corresponds to the flat topic solution of pLSA. If a topic is shared among classes, then it is placed at a higher level of the hierarchy. hPLSA model has certain limitations, which are analogous to limitations of pLSA, namely, it possesses a large number of parameters that have to be estimated, and this number grows linearly with the size of the dataset. It can lead to model overfitting.

In the framework of hHDP, two models are proposed. The first one (hvHDP) results in a hierarchy, where internal nodes are represented as probability distributions over topics

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608

and over words. Thus, in hvHDP model, all nodes can be considered as topics. In the second model (htHDP), only leaf nodes are represented as distributions over words. Thus, only the leaf nodes are essentially the topics. Both models are non-parametric and exploit the mixture model of hierarchical Dirichlet processes (HDP). At each level of the hierarchy, there is a Dirichlet process for each document and a global Dirichlet process over all the Dirichlet processes at that level. Thus, each level is associated with a HDP. These assumptions allow inferring the number of nodes on each level automatically. According to the authors (Zavitsanos, Paliouras & Vouros, 2011), hPAM and hLDA are the closest relatives of hvHDP in terms of inferred hierarchical structure. Indeed, analogously to hLDA, hHDP model is able to infer the number of topics automatically. In addition, analogously to hPAM, hHDP allows a child node to have several parent nodes, thus providing a more flexible structure. In turn, htHDP resembles the PAM model (Li & McCallum, 2006) since the words in both models are generated only at the leaf level. In contrast to PAM, htHDP is fully non-parametric and is able to infer the depth of the hierarchy and the number of nodes at each level. However, hHDP has a set of hyperparameters (H, a, y) that can potentially lead to different hierarchies in terms of the depth and the number of nodes (topics) on each level.

HLTA is a probabilistic hierarchical topic model, however, it has significant differences with respect to LDA-based models. First, HLTA models a collection of documents without specifying a document generation process. The latent variables are unobserved attributes of the documents. Second, each observed variable is related to a word and is a binary variable that represents the presence or absence of the word in a document. Third, topics in HLTA are clusters of documents. Namely, each binary latent variable in HLTA partitions a document collection into two soft clusters of documents. The document clusters are interpreted as topics. HLTA provides a tree-structured model, where the word variables are at the leaves and the latent variables are at the internal nodes. In turn, each latent variable can be described by a set of top words according to their mutual information, i.e., by a set of words that are the best ones to characterize the difference between the two clusters due to the fact that their occurrence probabilities in the two clusters differ the most. Latent variables at high levels of the hierarchy correspond to more general topics, while latent variables at low levels correspond to more specific topics. The construction of the hierarchy is based on the subsequent application of flat models. The details of the model construction can be found in the original work (Chen et al, 2017).

The authors of hSBM model propose an approach that relates topic modeling and community detection in complex networks. Here, the word-document matrix is considered a bipartite network, and the problem of inferring topics becomes a problem of inferring communities. The authors claim that their nonparametric approach automatically determines the depth of the hierarchy and the number of groups for both documents and words. Let us note that this model clusters both documents and words into hierarchical categories separately. Thus, hSBM model splits the network into groups on different levels, which are organized as a hierarchical tree. The construction of the hierarchy is based on the subsequent application of stochastic block models (which were originally developed for community detection in networks) yielding a hierarchy of nested

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 ___25/35

stochastic block models, where each level clusters the groups of the levels below. A detailed description of adaptation of stochastic block models to topics modeling can be found in Gerlach, Peixoto & Altmann (2018).

hARTM is a hierarchical version of the proposed earlier Additive Regularization of Topic Models (ARTM) approach. In the framework of hARTM, it is allowed for a topic to have several parent topics and, moreover, the authors claim that the model can automatically determine the number of sub-topics for each topic (Chirkova & Vorontsov, 2016). However, the number of topics on each level of the hierarchy has to be specified by a user. To construct a hierarchy, it is proposed to learn several flat topic models and then to tie them via regularization. Thus, for already learned tf>', i.e., the matrix containing the distribution of words by topics for topics on level /, it is proposed to implement matrix decomposition 4>' ~ (]>'+ 1 i)/, where matrix vj/ = {p(t'+ 'If')} contains interlevel distributions of sub-topics t! + 1 in parent topics f', ()>' + 1 is the matrix containing the distribution of words by sub-topics f' + 1 with additional sparsing regularizers. So, one can infer the hierarchy level by level via finding parent topics for each sub-topic using interlevel regularizers. A more detailed description of the model and the model inference can be found in work (Chirkova & Vorontsov, 2016).

In addition to a wide variety of unsupervised hierarchical topic models, many semi-supervised and supervised extensions of hLDA have been proposed. For instance, supervised hierarchical latent Dirichlet allocation (SHLDA) (Nguyen, Boyd-Graber & Resnik, 2013), constrained hierarchical Latent Dirichlet Allocation (constrained-hLDA) (Wang et al., 2014), hierarchical labeled-LDA (HLLDA) (Petinot, McKeown & Thadani, 2011), and semi-supervised hierarchical latent Dirichlet allocation (SSHLDA) model (Mao et al, 2012).

Although many hierarchical topic models have been developed, there are still no well-established quality metrics for topic hierarchies (Chen et al, 2017-, Zavitsanos, Paliouras & Vouros, 2011-, Belyy et al., 2018). Therefore, hierarchical topic models are often compared by the same quality metrics as flat models, namely, by means of per-word log-likelihood (Chambers, Smyth & Steyvers, 2010-, Chen et al., 2017), perplexity (Zavitsanos, Paliouras & Vouros, 2011) or semantic coherence (Chen et al., 2017). However, log-likelihood and perplexity are criticized for their inability to account for the interpretability of topic solutions that, in turn, is essential for end-users. Moreover, it was demonstrated that improved likelihood may lead to lower interpretability (Chang et al., 2009). In contrast, semantic coherence is closer to the human evaluation of topic modeling output, however, it measures only topic interpretability and does not take into account the parent-child relations in topic hierarchies (Belyy et al, 2018). Therefore, semantic coherence only partially reflects the quality of a hierarchical model ignoring the hierarchical relations of topics.

APPENDIX B

To estimate the influence of datasets pre-processing on the behavior of Renyi entropy in hierarchical models, the following experiments were conducted. First, let us describe additional pre-processing steps. At the first stage, words of length less than three letters

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608

were removed. At the second stage, in addition to short words, rare words with frequencies below 5 were also removed. For instance, after the initial pre-processing used in the main text of the paper, the 20 Newsgroups dataset consists of 50,965 unique words. After the first stage of additional pre-processing, the size of vocabulary is 25,956, and after the second stage it is 19,086 unique words. Analogously, Balanced WoS dataset contains 36,488 unique words in the experiments described in the main text. After the first stage of additional pre-processing it contains 31,725 unique words and 14,526 words after the second stage. For the obtained datasets with different types of pre-processing, hPAM and hARTM models were implemented to study the behavior of Renyi entropy in the same way as in section 'Computational experiments' of the main body of the article.

hPAM model (Influence of pre-processing on entropy of the second hierarchical level)

Figure Al demonstrates Renyi entropy curves on the second hierarchical level for 20 Newsgroups dataset for different values of parameter r\. Here and further, "pre-processing 0" refers to the initial pre-processing; "pre-processing 1" refers to the first stage of

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 27 35

additional pre-processing; "pre-processing 2" refers to the second stage of additional preprocessing. Based on the calculations, one can conclude the following. Removing only short words slightly worsens Renyi entropy, with the worsening being only in raising the entropy curve. The range of topics, where minima of Renyi entropy are observed, corresponds to 6-12 topics for the dataset with the first stage of additional pre-processing. The closest result to the human markup is observed at r] = 0.3. Removing short and rare words leads to entropy curve raising and shifting of minima points. The range of minima points for the second stage of pre-processing corresponds to 8-14 topics. Since human mark up corresponds to 14-20, one can conclude that additional pre-processing with removing of short and rare words has a positive effect on the results of topic modeling on a dataset with a non-hierarchical labeling.

Figure A2 demonstrates the results for balanced WoS dataset. One can see that removing only short words leads to raising of the whole Renyi entropy curve, however, the overall structure of the curve does not change. The range of topics with minimum Renyi entropy corresponds to 6-10 topics, i.e., there is a small variation associated with the change in parameter r). Removing of short and rare words leads to a back-down shift of

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 28/35

Renyi entropy. Moreover, variation of parameter t] does not influence significantly the minimum location under condition of this type of pre-processing. In general, Renyi entropy minimum corresponds to six topics that, in turn, is very close to close to the human mark-up. Thus, extensive pre-processing is necessary to obtain a better result of topic modeling.

hPAM model (Influence of pre-processing on entropy of the third hierarchical level)

Figure A3 demonstrates Renyi entropy curves on the third hierarchical level for 20 Newsgroups dataset for fixed values of r\ and . The results demonstrate that removing short and rare words improves Renyi entropy. For instance, sharp fluctuations appearing for hPAM model may disappear with a more extensive pre-processing.

Figure A4 demonstrates Renyi entropy curves on the third hierarchical level for balanced WoS dataset. One can see that in the region of the first entropy minimum, the difference between Renyi entropy curves for different pre-processing procedures is very small. A significant difference occurs where the HPAM model deteriorates, that is, for more than 25 topics. In addition, the second stage of pre-processing leads to smaller entropy values.

hARTM model (Influence of pre-processing on entropy of the first hierarchical level)

Figure A5 demonstrates Renyi entropy curves for hARTM model on the first hierarchical level for two datasets, 20 Newsgroups, and balanced WoS. One can see that different types of pre-processing have almost no effect on the resulting Renyi entropy curves for 20 Newsgroups dataset, i.e., the dataset with non-hierarchical structure. For all types of preprocessing, the minimum corresponds to 14 topics, which is close to the human markup. Thus, hARTM model is stable and reproduces the results in terms of Renyi entropy. For balanced WoS dataset, we obtain that Renyi entropy curves are similar for initial and

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 29/35

Author Contributions

• Sergei Koltcov conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

• Vera Ignatenko conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

• Maxim Terpilovskii performed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft.

• Paolo Rosso analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

Datasets and code are available at GitHub: https://github.com/hse-scila/hierarchical-models.

20 Newsgroups dataset is available at GitHub: https://github.com/hse-scila/hierarchical-models/blob/master/datasets/20news_lem.csv.

Lenta dataset is available at GitHub: https://github.com/hse-scila/hierarchical-models/ blob/master/datasets/lenta_lem.csv.

Amazon dataset is available at GitHub: https://github.com/hse-scila/hierarchical-models/blob/master/datasets/Amazon_lem.csv.

Amazon balanced dataset is available at GitHub: https://github.com/hse-scila/ hierarchical-models/blob/master/datasets/Amazon_balanced_lem.csv.

WoS dataset is available at GitHub: https://github.com/hse-scila/hierarchical-models/ blob/master/datasets/WoS_lem.csv.

WoS balanced dataset is available at GitHub: https://github.com/hse-scila/hierarchical-models/blob/master/datasets/WoS_balanced_lem.csv.

Supplemental Information

Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.608#supplemental-information.

REFERENCES

Basu S, Davidson 1, Wagstaff K. 2008. Constrained clustering: advances in algorithms, theory, and applications, Chapman & Ilall/CRC data mining and knowledge discovery series. First Edition. Boca Raton: Taylor & Francis Group Boca Raton. Beck C. 2009. Generalised information and entropy measures in physics. Contemporary Physics

50(4):495—510 DOI 10.1080/00107510902823517. Belyy AV, Seleznova MS, Sholokhov AK, Vorontsov KV. 2018. Quality evaluation and

improvement for hierarchical topic modeling. In: Komp'juternaja Lingvistika i Intellektual'nye Tehnologii.

Blei DM, Griffiths TL, Jordan MI. 2010. The nested Chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM 57(2): 1—30

DOI 10.1145/1667053.1667056.

Koltcov et al. (2021), PeerJ Comput. Sei., DOI 10.7717/peerj-cs.608 32 35

Blei DM, Jordan MI, Griffiths TL, Tenenbaum JB. 2003. Hierarchical topic models and the nested Chinese restaurant process. In: Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS'03. Cambridge, MA, USA: MIT Press, 17-24.

Blei DM, Ng AY, Jordan MI. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3:993-1022.

Chambers AL, Smyth P, Steyvers M. 2010. Learning concept graphs from text with stick-breaking priors. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems—Volume 1, NIPS'10. Red Hook: Curran Associates, Inc., 334-342.

Chang J, Boyd-Graber J, Gerrish S, Wang C, Blei DM. 2009. Reading tea leaves: how humans interpret topic models. In: Proceedings of the 22Nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, Inc., 288-296.

Chen J, Zhu J, Lu J, Liu S. 2018. Scalable training of hierarchical topic models. Proceedings of the VLDB Endowment 11(7):826-839 DOI 10.14778/3192965.3192972.

Chen P, Zhang NL, Liu T, Poon LK, Chen Z, Khawar F. 2017. Latent tree models for hierarchical topic detection. Artificial Intelligence 250(2):105-124 DOI 10.1016/j.artint.2017.06.004.

Chen X, Zhou M, Carin L. 2012. The contextual focused topic model. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'12. New York, NY, USA: Association for Computing Machinery, 96-104.

Chirkova N, Vorontsov KV. 2016. Additive regularization for hierarchical multimodal topic modeling. Machine Learning and Data Analysis 2(2):187-200 DOI 10.21469/22233792.2.2.05.

Cohen G. 2000. Hierarchical models in cognition: do they have psychological reality? European Journal of Cognitive Psychology 12(l):l-36 DOI 10.1080/095414400382181.

Gaussier E, Goutte C, Popat K, Chen F. 2002. A hierarchical model for clustering and categorising documents. In: Crestani F, Girolami M, Van Rijsbergen CJ, eds. Advances in Information Retrieval. Berlin, Heidelberg: Springer, 229-247.

Gerlach M, Peixoto TP, Altmann EG. 2018. A network approach to topic models. Science Advances 4(7):eaaql360 DOI 10.1126/sciadv.aaql360.

Griffiths TL, Steyvers M. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(Suppl. l):5228-5235 DOI 10.1073/pnas.0307752101.

Heinrich G. 2004. Parameter estimation for text analysis. Technical report. Available at http://www.arbylon.net/publications/text-est.pdf.

Heinrich G. 2008. infinite Ida—implementing the hdp with minimum code complexity. Available at http://www.arbylon.net/publications/ilda.pdf.

Hjort NL, Holmes C, Miiller P, Walker SG. 2010. Bayesian nonparametrics. Cambridge: Cambridge University Press.

Hofmann T. 1999. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 50-57.

Hofmann T. 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1/2):177—196 DOI 10.1023/A: 1007617005950.

Koltcov S. 2018. Application of Renyi and Tsallis entropies to topic modeling optimization. Physica A: Statistical Mechanics and its Applications 512(7075):1192-1204 DOI 10.1016/j.physa.2018.08.050.

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 33/35

Koltcov S, Ignateriko V, Boukhers Z, Staab S. 2020. Analyzing the influence of hyper-parameters and regularizes of topic modeling in terms of renyi entropy. Entropy 22(4):394 DOI 10.3390/e22040394.

Koltcov S, Ignatenko V, Koltsova O. 2019. Estimating topic modeling performance with sharma-mittal entropy. Entropy 2I(7):660 DOI 10.3390/e21070660.

Koltcov S, Nikolenko SI, Koltsova O, Bodrunova S. 2016. Stable topic modeling for web science: granulated Ida. In: Proceedings of the 8th ACM Conference on Web Science, WebSci '16. New York: ACM, 342-343.

Koltsov S, Nikolenko S, Koltsova O, Filippov V, Bodrunova S. 2016. Stable topic modeling with local density regularization. In: Bagnoli F, et al., ed. Internet Science. INSCI2016. Lecture Notes in Computer Science. Vol. 9934. Cham: Springer DOI 10.1007/978-3-319-45982-0_16.

Li W, McCallum A. 2006. Pachinko allocation: dag-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning, ICML'06. New York, NY, USA: Association for Computing Machinery, 577-584.

Liu L, Tang L, He L, Zhou W, Yao S. 2016. An overview of hierarchical topic modeling. In: 2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC). Vol. 1. 391-394.

Mao X-L, Ming Z-Y, Chua T-S, Li S, Yan H, Li X. 2012. Sshlda: a semi-supervised hierarchical topic model. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL'12. USA: Stroudsburg: Association for Computational Linguistics, 800-809.

Mimno D, Li W, McCallum A. 2007. Mixtures of hierarchical topics with pachinko allocation. In: Proceedings of the 24th International Conference on Machine Learning, ICML'07. New York, NY, USA: Association for Computing Machinery, 633-640.

Mimno D, Wallach HM, Talley E, Leenders M, McCallum A. 2011. Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 262-272.

Minka T. 2000. Estimating a dirichlet distribution. Technical report. Available at https://tminka. github.io/papers/dirichlet/minka-dirichlet.pdf.

Newman D, Asuncion A, Smyth P, Welling M. 2009. Distributed algorithms for topic models. Journal of Machine Learning Research 10:1801-1828.

Nguyen V-A, Boyd-Graber J, Resnik P. 2013. Lexical and hierarchical topic regression. In: Proceedings of the 26th International Conference on Neural Information Processing Systems— Volume 1, NIPS'13. Red Hook, NY, USA: Curran Associates Inc, 1106-1114.

Nikolenko SI, Koltcov S, Koltsova O. 2017. Topic modelling for qualitative studies. Journal of Information Science 43(1):88-102 DOI 10.1177/0165551515617393.

Paisley J, Wang C, Blei DM, Jordan MI. 2015. Nested hierarchical dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(2):256-270 DOI 10.1109/TPAMI.2014.2318728.

Palmer SE. 1977. Hierarchical structure in perceptual representation. Cognitive Psychology 9(4):441-474 DOI 10.1016/0010-0285(77)90016-0.

Petinot Y, McKeown K, Thadani K. 2011. A hierarchical model of web summaries. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers—Volume 2, HLT'll. Vol. 2. Stroudsbourg: Association for Computational Linguistics, 670-675.

Rasmussen C, Williams C. 2006. Gaussian processes for machine learning. In: Adaptive Computation and Machine Learning. Cambridge, MA, USA: MIT Press.

Koltcov et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.608 34/35

Sinha K, Dong Y, Cheung JCK, Ruths D. 2018. A hierarchical neural attention-based text classifier. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 817-823.

Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D. 2012. Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL'12. Stroudsburg, PA, USA: Association for Computational Linguistics, 952-961.

Taylor P, Hobbs JN, Burroni J, Siegelmann HT. 2015. The global landscape of cognition: hierarchical aggregation as an organizational principle of human cortical networks and functions. Scientific Reports 5(1):428 DOI 10.1038/srepl8112.

Teh YW, Jordan MI, Beal MJ, Blei DM. 2004. Sharing clusters among related groups: hierarchical dirichlet processes. In: Proceedings of the 17th International Conference on Neural Information Processing Systems, NIPS'04. Cambridge, MA, USA: MIT Press, 1385-1392.

Teh YW, Jordan MI, Beal MJ, Blei DM. 2006. Hierarchical dirichlet processes. Journal of the American Statistical Association 101 (476):1566-1581 DOI 10.1198/016214506000000302.

Vorontsov K, Potapenko A, Plavin A. 2015. Additive regularization of topic models for topic selection and sparse factorization. In: Gammerman A, Vovk V, Papadopoulos H, eds. Statistical Learning and Data Sciences. Cham: Springer International Publishing, 193-202.

Wallach HM, Mimno D, McCallum A. 2009. Rethinking Ida: why priors matter. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS'09. Red Hook: Curran Associates, Inc., 1973-1981.

Wang W, Xu H, Yang W, Huang X. 2014. Constrained-hlda for topic discovery in Chinese microblogs. In: Tseng VS, Ho TB, Zhou Z-H, Chen ALP, Kao H-Y, eds. Advances in Knowledge Discovery and Data Mining. Cham: Springer International Publishing, 608-619.

Williamson S, Wang C, Heller KA, Blei DM. 2010. The ibp compound dirichlet process and its application to focused topic modeling. In: Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML'10. Madison, WI, USA: Omnipress, 1151-1158.

Zavitsanos E, Paliouras G, Vouros GA. 2011. Non-parametric estimation of topic

hierarchies from texts with hierarchical dirichlet processes. Journal of Machine Learning Research 12:2749-2775.

Zhao W, Chen J, Perkins R, Liu Z, Ge W, Ding Y, Zou W. 2015. A heuristic approach to determine an appropriate number of topics in topic modeling. In: Proceedings of the 12th Annual MCBIOS Conference. BioMed Central Ltd.

Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.