Методы и проблемы децентрализованного глубинного обучения тема диссертации и автореферата по ВАК РФ 00.00.00, кандидат наук Рябинин Максим Константинович
- Специальность ВАК РФ00.00.00
- Количество страниц 132
Оглавление диссертации кандидат наук Рябинин Максим Константинович
1 Introduction
2 Key results and conclusions
3 Content of the work
3.1 Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts
3.2 Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices
3.3 Distributed Deep Learning in Open Collaborations
4 Conclusion
Appendix A Article. Towards Crowdsourced Training of Large Neural Networks
using Decentralized Mixture-of-Experts
Appendix B Article. Moshpit SGD: Communication-Efficient Decentralized
Training on Heterogeneous Unreliable Devices
Appendix C Article. Distributed Deep Learning in Open Collaborations
Рекомендованный список диссертаций по специальности «Другие cпециальности», 00.00.00 шифр ВАК
Методы обработки, декодирования и интерпретации электрофизиологической активности головного мозга для задач диагностики, нейрореабилитации и терапии нейрокогнитивных расстройств2022 год, доктор наук Осадчий Алексей Евгеньевич
Динамика обучения и ландшафт функции потерь нейронных сетей с масштабно-инвариантными параметрами2024 год, кандидат наук Кодрян Максим Станиславович
Дважды стохастический вариационный вывод с полунеявными и несобственными распределениями2022 год, кандидат наук Молчанов Дмитрий Александрович
Модели и методы автоматической обработки неструктурированных данных в биомедицинской области2023 год, доктор наук Тутубалина Елена Викторовна
Байесовский подход в глубинном обучении: улучшение дискриминативных и генеративных моделей2020 год, кандидат наук Неклюдов Кирилл Олегович
Введение диссертации (часть автореферата) на тему «Методы и проблемы децентрализованного глубинного обучения»
1 Introduction
Topic of the thesis
Over the past decade, deep learning has demonstrated remarkable results, outperforming other machine learning methods on a variety of tasks and domains. Recent years have seen a dramatic growth in the size of neural networks due to a significant impact of the model scale on its resulting capabilities [45; 46]. This presents a challenge to the progress of the broader scientific community: as the resources needed to obtain or exceed state-of-the-art models continue to grow, research in the field becomes less and less accessible to everybody outside of organizations with the most funding. In this work, we argue that a potential solution to this challenge is decentralization: instead of obtaining all resources from a centralized high-performance computing (HPC) cluster, we can leverage idle hardware resources of volunteers who are potentially distributed around the globe. Inspired by successes of volunteer computing in other scientific fields [7; 20; 48], we propose deep learning methods that are applicable for general large-scale training and take the unique challenges of volunteer computing into account.
More specifically, this work introduces the DecentralizedMixture-of-Experts layer, a sparse neural network architecture that meets the above challenges and naturally handles both node failures and large numbers of irregularly participating peers. Next, we consider training across networks of volunteers in the data-parallel setting: this requires a method that can quickly aggregate model parameters or gradients in presence of network failures. To this end, we develop Moshpit All-Reduce, an efficient fault-tolerant method for parameter averaging. Using this method, we propose Moshpit SGD — a distributed training algorithm that can be applied to networks of heterogeneous and unreliable devices. Lastly, we propose Distributed Deep Learning in Open Collaborations, a practical approach to large-scale collaborative pretraining. This approach combines an adaptive averaging strategy, global gradient accumulation, and careful system design to enable distributed training with workers that have highly diverse network conditions, computational performance, and participation time.
Relevance of the work
The growing size of models is at the heart of many recent advancements in deep learning. Today, the most capable models are routinely reaching the scale of tens and hundreds of billions of parameters [27; 35]: these developments are supported by studies [46] that demonstrate increasing gains in quality or even novel properties [18] of neural networks at larger sizes. Correspondingly, the size of training datasets is also growing: as recent works suggest [54], the number of examples might be equally as important as the model size when training a neural net-
work with a fixed compute budget. Both of these scaling directions require an immense amount of computational resources: all state-of-the-art models are trained in HPC clusters with hundreds or even thousands of specialized accelerators and dedicated high-speed networking solutions.
Predictably, acquiring the computational resources to train such large models can be difficult for an average researcher. Renting even one deep learning accelerator for a month may cost several thousands of dollars, and building a cluster is often outside the budget constraints for organizations with modest funding. This dramatically limits the availability of state-of-the-art research to a set of laboratories that can afford to run large-scale experiments with billion-scale neural networks. In turn, this results in a smaller potential for replicating or adapting the latest results to new datasets, an inability to analyze or improve the training process of large models, and overall difficulties in contributing to further scientific progress in deep learning.
In this work, we explore an alternative approach to large-scale deep learning that does not involve expensive supercomputers. We take inspiration from successful cases of leveraging volunteer resources in other sciences, such as computational biology [20] or astrophysics [17]. The most famous example of such projects is Berkeley Open Infrastructure for Network Computing (BOINC) [7], which became the first "supercomputer" reaching the exaflop scale [19]. However, directly applying existing methods for distributed deep learning in such conditions is challenging because of multiple infrastructure-related challenges.
Specifically, the most popular methods for efficient distributed training [22; 40; 47] are not designed to handle node failures or connectivity issues: in the most severe cases, even one disconnected peer can jeopardize the entire training procedure or significantly inhibit its progress. At the same time, workers in a volunteer computing setup possess a much higher degree of heterogeneity: each personal computer might have a unique hardware and networking setup, and this diversity needs to be taken into account when designing such decentralized training systems. Lastly, the communication links across cluster nodes can be magnitudes faster than standard Internet connections of collaborative experiment participants, which also impacts our design choices. Hence, we develop methods that aim to maximize the distributed training performance in the conditions outlined above.
The first work described in this thesis focuses on the goal of training models that can exceed the limits of a single device in the context of decentralized training. Trading off generality for performance, this work introduces DecentralizedMixture-of-Experts (DMoE), which is a specialized layer designed to be sharded across the computers of volunteers. Similarly to standard Mixture-of-Experts models [2], the DMoE layer consists of independent sublayers called experts that get assigned to the input based on the output of the gating function. We propose a natural extension of this architecture for fault-tolerant training and show that DMoE is not sensitive to communication latency. Another important difference is that the DMoE experts are located by
other nodes using distributed hash tables (DHT), a fault-tolerant decentralized key-value storage. This mitigates the need for a centralized entity that would track available experts, which might not be feasible in larger collaborations without incurring significant costs. To efficiently find the most relevant experts for a given input, we propose a structured gating function that factorizes the set of experts in a predefined multidimensional grid.
The subsequent part of this work addresses the problem of data-parallel training with volunteers. Our rationale for that is twofold: first, even if we use mixture-of-experts in each model layer, we still need to have parameters of the gating function and the embedding layer that are consistent across the collaboration. Second, with memory-efficient training methods (such as lower numeric precision [32] or parameter sharing [3]), it might be possible to train models that can fit consumer GPUs yet still require large amounts of computation to achieve the best quality.
In the second paper covered in this thesis, we study methods for efficiently aggregating the model gradients for distributed training. The family of communication-optimal methods, known as All-Reduce [37], is not fault-tolerant by default and thus unsuitable for our goals. On the other hand, more robust methods for decentralized training, such as Gossip [49; 52], require many communication rounds to achieve consistency across the network. We propose Mosh-pit All-Reduce, an iterative averaging algorithm that combines the fault tolerance of Gossip-based methods with the efficiency of All-Reduce. It combines the participants into independent groups and ensures that peers within one group are assigned to different groups in the next round. Moshpit SGD, a distributed optimization algorithm based on Moshpit All-Reduce, has convergence rates equivalent to standard distributed SGD (more specifically, Local-SGD [51]) yet exhibits much higher large-scale training performance in slower networks with node failures, as we demonstrate in our experiments.
Finally, the third work presents Distributed Deep Learning in Open Collaborations (DeDLOC), an approach that takes node heterogeneity into account and alleviates the issue of slower communication speeds of volunteer-oriented distributed training. Specifically, we propose an adaptive averaging strategy that assigns training and gradient aggregation tasks to workers based on their performance to minimize the overall time of averaging, the fundamental communication phase in data-parallel training. We also design a decentralized tracking mechanism for the total accumulated batch size, which is necessary to enable the dynamic participation of peers. Aside from ablation studies, the paper presents the results of the first collaborative language model pretraining experiment: an effort organized by the authors and a community of volunteers has resulted in sahajBERT, a Bengali masked language model that has competitive performance with both monolingual and multilingual baselines [25; 56].
Moreover, the methods we develop can be applied not only in the volunteer computing scenario. Specifically, cloud providers frequently offer preemptible (or spot) instances at a cost
that can be 2-3 times lower than the cost of on-demand servers [5; 21]. Spot instances, however, have the disadvantage of non-guaranteed availability: if the demand for nodes with their hardware configuration increases, some of these instances might become unavailable until the demand recedes. In principle, these conditions make applying traditional high-performance distributed methods infeasible. Usually, efficient training relies on reliable uptime and high communication speeds, both of which are difficult to achieve in preemptible environments. Still, the target setting of this work considers most challenges that arise from using spot instances. Hence, as we show in our experiments below, the proposed methods can be applied to heterogeneous volunteer hardware and to more homogeneous, yet still unstable, preemptible cloud servers.
The goal of this work is to develop practical large-scale distributed training methods for slowly-connected networks consisting of heterogeneous and unreliable nodes.
Похожие диссертационные работы по специальности «Другие cпециальности», 00.00.00 шифр ВАК
Алгоритмы ускорения сверточных нейронных сетей / Algorithms For Speeding Up Convolutional Neural Networks2018 год, кандидат наук Лебедев Вадим Владимирович
On prospects and limitations of variational quantum algorithms/О перспективах и ограничениях вариационных квантовых алгоритмов2025 год, кандидат наук Рабинович Даниил Сергеевич
Повышение устойчивости математических моделей в состязательных сценариях с использованием подходов обобщения2023 год, кандидат наук Рашид Бадер
Модели связывания именованных сущностей в биомедицинском домене2022 год, кандидат наук Мифтахутдинов Зульфат Шайхинурович
Методы машинного обучения для контроля качества данных в научных экспериментах2020 год, кандидат наук Борисяк Максим Александрович
Заключение диссертации по теме «Другие cпециальности», Рябинин Максим Константинович
5 Conclusion
In this work, we proposed DeDLOC — a collaborative deep learning approach that enables large-scale collective distributed training on whichever computers available to participants, regardless of hardware and network limitations. We demonstrated with several experiments that this is a viable approach that maintains its efficiency in a broad range of conditions. Finally, we report the first real collaborative training run of such a scale and share our findings on volunteer activity to pave the road for similar experiments in the future.
An essential property of collaborative training is its environmental impact. While all distributed training experiments have a negative impact due to carbon emissions [107], DeDLOC has one unique advantage. Due to the ability to utilize heterogeneous low-end devices, it can prolong the effective lifespan of existing computers. We discuss other aspects of environmental impact in Appendix J.
One issue that needs to be addressed before starting collaborative experiments is the need to gather a community of volunteers. Although our proposed authentication mechanism (see Appendix I.5) allows acknowledging participants for their contributions (briefly discussed in Appendix I.2), the best approach to recruit volunteers is an open question: one needs to take into account both the resources of community members and their motivation for training a specific model.
Список литературы диссертационного исследования кандидат наук Рябинин Максим Константинович, 2023 год
