Обучение генеративных вероятностных моделей для распознавания данных масс-спектрометрии тема диссертации и автореферата по ВАК РФ 05.13.17, кандидат наук Сулимов Павел Андреевич
- Специальность ВАК РФ05.13.17
- Количество страниц 119
Оглавление диссертации кандидат наук Сулимов Павел Андреевич
Contents
1 Introduction
1.1 The relevance of research
1.2 Aims and objectives of research
1.3 Importance of work
1.4 Publications
1.5 The organization of the thesis
2 Mass spectrometry
2.1 Biological sample preparation
2.2 Mass spectrometer and data generation
2.3 Database searching-based spectrum identification
2.3.1 Peptide database
2.3.2 Candidate peptides
2.3.3 Score functions
2.4 Search result validation
2.4.1 Decoy peptides
2.4.2 FDR estimation
2.4.3 Q-value calculation
2.5 De Novo and hybrid spectrum identification methods
3 Scoring functions properties
3.1 Discriminative property
3.2 Calibration property
3.3 Unbiasedness property
3.4 Universality property
3.5 Learning new score functions
4 Summary of the thesis articles
4.1 Training of BoltzMatch
4.2 Diversifying regularization
4.3 Evaluation of BoltzMatch in spectrum identification
4.4 PSM score calibration with Tailor methods
4.4.1 The main results of BoltzMatch in spectrum annotation
4.5 Interpretation of BoltzMatch
4.6 Biased score functions
4.6.1 Bias test of BoltzMatch
5 Conclusions
Acknowledgments
List of abbreviations and conventions
Bibliography
List of figures
List of tables
List of algorithms
Appendices
A Article "Annotation of tandem mass spectrometry data using stochastic neural networks in shotgun proteomics"
B Article "Guided Layer-wise Learning for Deep Models using Side Information"
C Article "Tailor: A Nonparametric and Rapid Score Calibration Method for Database Search-Based Peptide Identification in Shotgun Proteomics"
D Article "Bias in False Discovery Rate Estimation in Mass-Spectrometry-Based Peptide Identification"
Рекомендованный список диссертаций по специальности «Теоретические основы информатики», 05.13.17 шифр ВАК
Вычислительные методы для аннотирования данных тандемной масс-спектрометрии2022 год, доктор наук Кертес-Фаркаш Аттила
Применение алгоритмов глубокого обучения для сегментирования одиночных клеток и фенотипического профилирования2022 год, кандидат наук Мошков Никита Евгеньевич
Система управления человеческой походкой методами машинного обучения, подходящая для роботизированных протезов в случае двойной трансфеморальной ампутации2019 год, кандидат наук Черешнев Роман Игоревич
Нейросетевая модель распознавания человека по походке в видеоданных различной природы2020 год, кандидат наук Соколова Анна Ильинична
Общий подход к теории и методологии метода анализа сингулярного спектра2023 год, доктор наук Голяндина Нина Эдуардовна
Введение диссертации (часть автореферата) на тему «Обучение генеративных вероятностных моделей для распознавания данных масс-спектрометрии»
1 Introduction
Mass spectrometry is used to study and identify molecules in a mixture of samples. The spectrometer generates sort of fingerprints of molecules, called spectra, which are then subjected to identification or annotation of the original molecules which could have generated the given spectra.
Mass spectrometry has gained attention in various fields including molecular biology, forensic, pharmaceutical industry, medicine, etc. For instance, in environmental containment analysis the mass spectrometry can be used to test food and beverages for contamination or adulteration. Soil analysis can be carried out with mass spectrometers to estimate the amount of the pesticides or hormone used in cultivation. In forensics analysis, mass spectrometry can be used to confirm drug abuse or identify explosive residues or fire accelerants to determine incendiarism. In pharmaceutical analysis, determining structures of drugs and metabolites, as well as screening for metabolites in biological systems are the main applications of mass-spectrometry analysis. In clinical researches and clinical drug development the mass spectrometer is used in disease screening, drug therapy monitoring to monitor protein composition of cells in study, and identification of infectious agents for targeted therapies.
This thesis focuses on protein identification in a biological mixture from spectrum data obtained with tandem mass spectrometry.
1.1 The relevance of research
On one hand, tremendous data has been accumulated in biological data repositories due to sharp drop of the cost of data storage devices in the past few years. On the other hand, the development of computational devices in the past few years such as the graphical processing units (GPUs) allows researches to develop computationally expensive and data intensive methods in short time. As a consequence, several deep learning based methods have been published for spectrum data annotation recently, such as, MS2PIP [9], pDeep [64], Prosit [21], DeepNovo [48], just to name a few.
1.2 Aims and objectives of research
In general, the goal of my research project was to develop more accurate spectrum identification methods. I developed a novel method, called BoltzMatch, which is based on a stochastic neural network and can annotate spectrum data more accurately; moreover, in contrary to other deep, black-box models, BoltzMatch is interpretable. At the time of writing this thesis, Boltz-Match achieved the state-of-the-art performance in spectrum annotation. More specifically, the development of the BoltzMatch method involved the following tasks:
1. BoltzMatch required a novel score calibration method, termed Tailor, to ensure that spectrum annotation scores obtained with it are normalized and thus comparable.
2. The training of BoltzMatch required the development of a novel regularization method. This method was termed diversifying regularization. I showed that the regularization helps train arbitrary deep, stochastic neural networks as well.
3. I showed that machine learning methods may overfit and result in biased error estimation due to their high model capacity. I showed that BoltzMatch does not acquire bias during its training.
1.3 Importance of work
Incorrect spectrum annotation may lead experimental scientist and practitioners to misleading conclusions about their experiments and to inaccurate decision making; for instance, in selecting the right drug therapy. Therefore, it is important to develop reliable and accurate methods to annotate and identify spectrum, in fact, any types of data.
1.4 Publications
My PhD research work has resulted in 4 main articles, of which three of them have been published in Q1 journals. Ranking is based on Scopus and Web of Science.
First-tier publications.
1. Sulimov P., Voronkova A., Kertesz-Farkas A. Annotation of tandem mass spectrometry data using stochastic neural networks in shotgun proteomics. Bioinformatics, Q1 journal, 2020, doi: https://doi.org/10.1093/bioinformatics/btaa206, 2020.
2. Sulimov P., Kertesz-Farkas A. Tailor: A Nonparametric and Rapid Score Calibration Method for Database Search-Based Peptide Identification in Shotgun Proteomics. Journal of Pro-teome Research, Q1 journal, 2020, doi: https://doi.org/10.1021/acs.jproteome. 9b00736.
3. Danilova Y., Voronkova A., Sulimov P., Kertesz-Farkas A. Bias in False Discovery Rate Estimation in Mass-Spectrometry-Based Peptide Identification. Journal of Proteome Research, Q1 journal, 2019, doi: https://doi.org/10.1021/acs.jproteome.8b00991
Second-tier publications.
4. Sulimov P., Sukmanova E., Chereshnev R., Kertesz-Farkas A. Guided Layer-Wise Learning for Deep Models Using Side Information. In: van der Aalst W. et al. (eds) Analysis of Images, Social Networks and Texts. AIST 2019. Communications in Computer and Information Science, Q3 book series, vol 1086. Springer, 2020, doi: https: //doi.org/10.1007/978-3-030-39575-9_6
Reports at conferences and seminars.
5. Guided Layer-wise Learning for Deep Models using Side Information, 8th International Conference - Analysis of Images, Social networks and Texts, 17-19 July 2019, Kazan, Russia.
6. Generative probabilistic modelling of peptide-spectrum matching in tandem mass spectrometry, X International forum "Biotechnology: State Of The Art and Perspectives", 25-27 February 2019, Moscow, Russia.
7. Modification of Restricted Boltzmann Machines for peptide identification, Annual Interuni-versity Scientific and Technical Conference of Students, Postgraduates and Young Specialists named after E.V. Armensky, MIEM HSE, 19 February 2018, Moscow, Russia.
8. High-dimensional Generative Probabilistic Models for Peptide-spectrum Matching in Tandem Mass Spectrometry, II Russian-French workshop "Big Data and Applications", 12-13 October 2017, Moscow, Russia.
1.5 The organization of the thesis
The thesis is organized as follows. The Chapter 2 provides a brief overview of mass-spectrometry, discusses biological sample preparation, data generation, database searching-based spectrum identification process and results validation. The Chapter 3 gives more detailed description of scoring functions properties and exposes why violation of them could lead to weak statistical power of scoring function. Finally, Chapter 4 summarizes the Authors results from the articles written during the whole research process.
Похожие диссертационные работы по специальности «Теоретические основы информатики», 05.13.17 шифр ВАК
Модели и методы автоматической обработки неструктурированных данных в биомедицинской области2023 год, доктор наук Тутубалина Елена Викторовна
Подход к отслеживанию траектории многороторных летательных аппаратов в неизвестных условиях / Trajectory Tracking Approach for Multi-rotor Aerial Vehicles in Unknown Environments2024 год, кандидат наук Кулатхунга Мудийанселаге Гисара Пратхап Кулатхунга
Модели и методы информационно-телекоммуникационной системы ВУЗА/Models and methods for University information and telecommunication systems2024 год, кандидат наук Ник Аин Купаи Алиреза
Автоматические методы распознавания метафоры в текстах на русском языке2019 год, кандидат наук Бадрызлова Юлия Геннадьевна
Онтологический доступ к данным с использованием дизъюнктивных аксиом2023 год, кандидат наук Герасимова Ольга Александровна
Заключение диссертации по теме «Теоретические основы информатики», Сулимов Павел Андреевич
■ CONCLUSIONS
The increasing complexity of the peptide identification pipelines and the advancement in the mass spectrometric instrumentation yield new challenges for maintaining fair FDR estimation, and old approaches may fail to reveal bias in modern methods. For instance, the null test24 and the decoy-decoy search approaches are ineffective at pinpointing the bias we showed in this Letter because these methods do not utilize the target peptide set (see Supplementary Note S1.2), and thus they do not consider the differences between the distribution of or the correlation between the theoretical spectra of the target and decoy peptides. Granholm et al. proposed a so-called semilabeled method to demonstrate biased features in Percolator using so-called entrapment sequences.25 Unfortunately, the semilabeled method is not able to identify the bias in the enzInt feature because it does not involve the peptide-level decoy generation procedure. Consequently, the fairness of the FDR estimation in new methods might need to be reassessed with new validation techniques.
Список литературы диссертационного исследования кандидат наук Сулимов Павел Андреевич, 2020 год
■ REFERENCES
(1) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207.
(2) Kertesz-Farkas, A.; Reiz, B.; Myers, M. P.; Pongor, S. Database searching in mass spectrometry based proteomics. Curr. Bioinf. 2012, 7, 221—230.
(3) He, K.; Fu, Y.; Zeng, W.-F.; Luo, L.; Chi, H.; Liu, C.; Qing, L.Y.; Sun, R.-X.; He, S.-M. A Theoretical Foundation of the Target-Decoy Search Strategy for False Discovery Rate Control in Proteomics. 2015, arXiv:1501.00537. arXiv.org e-Print archive. https://arxiv.org/abs/1501.00537 (accessed Dec 2018).
(4) Levitsky, L. I.; Ivanov, M. V.; Lobas, A. A.; Gorshkov, M. V. Unbiased false discovery rate estimation for shotgun proteomics based on the target-decoy approach. J. Proteome Res. 2017, 16, 393— 397.
(5) Gupta, N.; Bandeira, N.; Keich, U.; Pevzner, P. A. Target-decoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 2011, 22, 1111—1120.
(6) Yates, J. R.; Eng, J. K.; McCormack, A. L.; Schieltz, D. Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal. Chem. 1995, 67, 1426—1436.
(7) Fenyo, D.; Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 2003, 75, 768—774.
(8) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V.; Mann, M. Andromeda: a peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 2011, 10, 1794—1805.
(9) O'Neil, C. Weapons of Math Destruction; Crown Random House, 2016.
(10) Kall, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4, 923.
(11) McIlwain, S.; Tamura, K.; Kertesz-Farkas, A.; Grant, C. E.; Diament, B.; Frewen, B.; Howbert, J. J.; Hoopmann, M. R.; Kall, L.; Eng, J. K.; et al. Crux: rapid open source protein tandem mass spectrometry analysis. J. Proteome Res. 2014, 13, 4488—4491.
(12) Rodrigu ez, J.; Gupta, N.; Smith, R. D.; Pevzner, P. A. Does trypsin cut before proline? J. Proteome Res. 2008, 7, 300—305.
(13) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J. OLAV: Towards high-throughput tandem mass spectrometry data identification. Proteomics 2003, 3, 1454—1463.
(14) Feng, J.; Naiman, D. Q.; Cooper, B. Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies. Bioinformatics 2007, 23, 2210—2217.
(15) Halloran, J. T.; Bilmes, J. A.; Noble, W. S. Learning peptide-spectrum alignment models for tandem mass spectrometry. Uncertainty Artif. Intell. 2014, 30, 320.
(16) Bish, R A.; Fregoso, O. I.; Piccini, A.; Myers, M. P. Conjugation of complex polyubiquitin chains to WRNIP1. J. Proteome Res. 2008, 7, 3481—3489.
(17) Diament, B. J.; Noble, W. S. Faster SEQUEST searching for peptide identification from tandem mass spectra. J. Proteome Res. 2011, 10, 3871—3879.
(18) Pease, B. N.; Huttlin, E. L.; Jedrychowski, M. P.; Talevich, E.; Harmon, J.; Dillman, T.; Kannan, N.; Doerig, C.; Chakrabarti, R.; Gygi, S. P.; Chakrabarti, D. Global analysis of protein expression and phosphorylation of three stages of Plasmodium falciparum intra-erythrocytic development. J. Proteome Res. 2013, 12, 4028—4045.
(19) Howbert, J. J.; Noble, W. S. Computing exact p-values for a cross-correlation shotgun proteomics score function. Mol. Cell. Proteomics 2014, 13, 2467—2479.
(20) Wang, G.; Wu, W. W.; Zhang, Z.; Masilamani, S.; Shen, R.-F. Decoy methods for assessing false positives and false discovery rates in shotgun proteomics. Anal. Chem. 2009, 81, 146—159.
(21) Granholm, V.; Kall, L. Quality assessments of peptide— spectrum matches in shotgun proteomics. Proteomics 2011, 11, 1086— 1093.
(22) Klammer, A. A.; MacCoss, M. J. Effects of modified digestion schemes on the identification of proteins from complex mixtures. J. Proteome Res. 2006, 5, 695—700.
(23) Keich, U.; Tamura, K.; Noble, W. S. Averaging Strategy To Reduce Variability in Target-Decoy Estimates of False Discovery Rate. J. Proteome Res. 2019, 18, 585—593.
(24) Zhang, S.-R.; Shan, Y.-C.; Jiang, H.; Liu, J.-H.; Zhou, Y.; Zhang, L.-H.; Zhang, Y.-K. The Null-Test for peptide identification algorithm in Shotgun proteomics. J. Proteomics 2017, 163, 118—125.
(25) Granholm, V.; Noble, W. S.; Kall, L. On using samples of known protein content to assess the statistical calibration of scores assigned to peptide-spectrum matches in shotgun proteomics. J. Proteome Res. 2011, 10, 2671—2678.
Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.