Исследование вариантов трансформера для различных задач обработки длинных документов/ Investigation of transformer options for various long documents processing tasks тема диссертации и автореферата по ВАК РФ 00.00.00, кандидат наук Аль Адел Ариж

  • Аль Адел Ариж
  • кандидат науккандидат наук
  • 2024, ФГАОУ ВО «Московский физико-технический институт (национальный исследовательский университет)»
  • Специальность ВАК РФ00.00.00
  • Количество страниц 108
Аль Адел Ариж. Исследование вариантов трансформера для различных задач обработки длинных документов/ Investigation of transformer options for various long documents processing tasks: дис. кандидат наук: 00.00.00 - Другие cпециальности. ФГАОУ ВО «Московский физико-технический институт (национальный исследовательский университет)». 2024. 108 с.

Оглавление диссертации кандидат наук Аль Адел Ариж

Contents

Introduction

1 Transformer usage for different NLP tasks

1.1 Key transformer application areas overview

1.1.1 Translation

1.1.2 Masked language modeling

1.1.3 Summarization Task

1.2 Transformers for processing long documents overview

1.2.1 Models overview

1.2.2 Positional encoding in transformers overview

1.2.3 Proposed attention patterns for long document processing overview

1.3 Chapter conclusion

2 Memory transformer design and mathematical representation

2.1 Memory transformer

2.1.1 Memory transformer with hierarchical attention

2.1.2 Mathematical representations and model details

2.1.3 Model variants that were used for the MLM and QA Task:

2.2 Chapter conclusion

3 Long documents translation

3.1 Related work

3.2 Task, datasets and evaluation metrics

3.2.1 Task

3.2.2 Dataset

3.2.3 Evaluation metric

3.3 Chunking approach

3.4 T5MemAttention Design

3.4.1 Design variants for translation task

3.5 Experiments setups

3.6 Experiments results and conclusion

3.6.1 Sentence level translation

3.6.2 Context-aware translation

3.6.3 Discussion

3.7 Chapter conclusion

4 Question answering in long context

4.1 Related work

4.2 Data sets and metrics

4.2.1 Data sets

4.2.2 Metrics

4.3 Results and discussion

4.3.1 Results of pre-training task MLM

4.3.2 Results of fine-tuning proposed model on HotpotQA

4.3.3 Results of modified variants on MLM task

4.3.4 Results of modified model variants on HotpotQA data set

4.3.5 Results discussion

4.3.6 Limitations and future work

4.4 Chapter conclusion

5 Summarize long documents

5.1 Task definition

5.2 Importance of text summrization

5.3 Text Summarization Types

5.3.1 Extractive Text Summarization

5.3.2 Abstractive Text Summarization

5.3.3 Hybrid Text Summarization

5.4 Summarizing Long Documents Using Transformers

5.5 The proposed approach

5.6 Datasets

5.7 Metrics

5.8 Experiments setup

5.9 Results and discussion

5.9.1 Attention analysis

5.10 Chapter conclusion and Discussion

6 Conclusion

6.1 Thesis conclusion

6.2 Limitations and future recommendations

Abbreviations

Bibliography

List of Figures

List of Tables

108

Рекомендованный список диссертаций по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Введение диссертации (часть автореферата) на тему «Исследование вариантов трансформера для различных задач обработки длинных документов/ Investigation of transformer options for various long documents processing tasks»

Introduction

The transformers have been limited to finite input lengths since their proposal [99]. This is because attention in transformers needs to attend to every token in the input

Scientific actuality of the research Transformers have emerged as the state-of-the-art in most machine-learning tasks, owing to their exceptional performance. The attention mechanism is a crucial component of transformers, and different attention variants have been proposed for various tasks, resulting in the creation of numerous transformer models. However, the primary bottleneck in the use of transformers for natural language processing (NLP) tasks is attention complexity, which increases with the length of the input text.

In recent years, there has been a growing trend in NLP research towards processing long documents, driven by the need to address various challenges in fields such as research, education, media, jurisdiction, finance, health, and more[69]. However, processing long documents using transformers is both time-consuming and expensive, as attention must be calculated between each token in the input.

In general pretraining and finetuning transformers are very expensive, this is because attention needs to calculate attention between each token in the input. according to the previously mentioned bottleneck, this cost increases as long as the length of input increases. That is why more structure investigation is needed to get the benefit of the transformer's power and use it in processing long documents. Hence, we need to tune the transformer and get its benefit in processing long documents using a new transformer structural modifications and this is what this thesis topic is about.

The goal and task of the dissertation

The primary objective of this dissertation is to explore several structural modifications of the encoder-decoder model for effectively processing longer inputs. These alterations comprise a variety of attention, masking, position bias, and additional memory techniques incorporated into the transformer architecture. The proposed techniques are then evaluated on various tasks such as translation, masked language modeling, question answering, and summarization tasks. The main focus is to devise an efficient approach to handle lengthy text inputs.

The major tasks in this PhD work include:

• Systematically study and review publications on the application of attention mechanism in transformer for different tasks concentrating on long text inputs;

• Propose and implement a new transformer structure using different masking patterns, position bias algorithms, additional memory tokens to inputs, and new attention schema so the model can process long inputs;

• Apply the proposed model designs to four specific tasks: translation, masked language modeling, question answering with context, and summarization task;

• Based on the conducted experiments on given tasks, analyze the performance of the different implemented designs.

Scientific novelty What sets apart this work is the primary focus on attention structure medications in the transformer to enable the transformer to consume longer input. This dissertation introduces a transformer with additional memory for processing long documents. This model and its variants were applied for various tasks on long text input like translation, question answering, and summarization. This work is different from previous works in:

1. Propose new encoder-decoder architecture structural modifications based on t5 architecture for processing input longer than standard input length;

2. An investigation of the effectiveness of these modifications applied to translation, mask language modeling, question answering with long-range context, and summarization of long inputs tasks.

Theoretical value of the work in the dissertation

The theoretical value of the work in the dissertation lies in the comprehensive exploration of the proposed variants of transformer modifications for processing long input.

The thesis overcomes existing attention issues (limited transformer input length) by suggesting new structural modifications to the encoder-decoder transformer structure. These new modifications led to breaking attention barriers for processing long-range inputs.

Furthermore, this thesis presents a new usage for the internal global tokens called memory tokens, proper masking technique, and proper usage for the original relative positional encoding to relate chunked input with related memory slots.

Practical value of the work in the dissertation: The proposed modifications were applied to different tasks using different data sets. They resulted in consuming long-range input, information interchange, and data compression. Models trained as part of the dissertation work are made publicly available.

Statements to be defended

• The developed encoder-decoder architecture supplied with a sentence selector for in-context-translation improves the translation quality;

• The proposed T5 modifications for MLM in separate input chunks have proven the ability to process longer inputs by introducing a memory system for information inter-flow between chunks, where the memory slots are the only connection between the chunks;

• The proposed architecture is based on additional trainable parameters, masking attention, and pre-trained weights reusing has proven its efficiency in scaling input length for summarization tasks compared with T5 and SLED models.

Presentations and validation of the research results

The main findings and contributions of the dissertation were presented and discussed at four conferences:

• International Conference Engineering and Telecommunication (En&T), Dolgoprudny, Russian Federation, November 24-25, 2021.

• 64th All-Russian Scientific Conference MIPT, Dolgoprudny, Russian Federation, November 29 - December 03, 2021

• XXIV International Conference on Neuroinformatics, Moscow, Russia, October 17-21, 2022.

• XXV International Conference on Neuroinformatics, Moscow, Russia, October 23-27, 2023.

Publications The main results on dissertation topic are presented in 4 printed publications, 3 in periodical scientific journals indexed by Scopus, 1 - in abstracts of reports.

The author's contribution in the works with co-author in the first paper is planning all the experiments, studying, implementing, and experimenting with the model in all stages, code implementation, training models, and baselines into analyzing the results and publication. The author has made a full contribution to all stages of the second and third papers.

Dissertation structure

The dissertation is organized into six chapters:

• Introduction delves into natural language processing, starting with its definition and ending with its main applications, such as translation, question-answering systems, masked language processing, and summarization. It gives an overview of each task and the related work for long inputs.

• Chapter 1 starts with a glimpse of natural language processing, starting from definition and ending with the main applications in translation, question-answering systems, masked language processing, and summarization. It displays an overview of each task and its related work for long inputs.

• Chapter 2 is dedicated to the evolution of the model and its mathematical representation throughout the design process. It introduces the preliminary design of the model used in the translation experiments. The section also covers the various modification variants used for subsequent tasks. Additionally, this section provides a detailed description of the model components and their mathematical representation.

• Chapter 3 describes proposed model design variants for the translation task in more detail, the used data set for English-Russian translation, results, and discussion.

• Chapter 4 presents the model used for masked language modeling and question-answering tasks, obstacles, solutions, results, discussion of results, limitations, and future work.

• Chapter 5 dedicated to the model modifications using just the memory added to the encoder with the new masking and relative positional encoding while investigating the effectiveness the these modifications on summarization data sets.

• Conclusion presents the conclusion of the overall work that was presented in this thesis, its limitations, and future directions.

Похожие диссертационные работы по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Заключение диссертации по теме «Другие cпециальности», Аль Адел Ариж

Chapter 6 Conclusion

6.1 Thesis conclusion

This chapter provides a concise overview of the key findings of the thesis. It presents the theoretical, practical, and experimental outcomes, highlighting the significance of the research in enhancing the transformer architecture to consume longer input. Additionally, it addresses the limitations of the study and proposes recommendations and future research directions.

Since 2017, a plethora of works have emerged that are based on transformers for processing long documents. These works have demonstrated the dominance of transformer applications over most natural language processing (NLP) tasks, including translation, reading comprehension, summarization, and others. Most of these transformer-based applications depend on modifications or additions to the original transformer design. As a result, the transformer has become an increasingly popular tool for processing long documents and handling complex NLP tasks.

The tasks presented in this dissertation were conducted as part of ongoing efforts to explore new avenues and make meaningful contributions to the field. The focus was on modifying transformer designs to address input bottlenecks, with the T5 transformer undergoing design alterations at different stages of development for three key tasks: translation, QA, and summarization. The design modifications implemented have the potential to yield significant benefits, and this research represents a valuable contribution to the field of transformer design.

The main purpose of this dissertation is to investigate different modifications of the encoder-decoder model that can effectively process longer inputs. These modifications include various attention, masking, position bias, and memory techniques integrated into the transformer architecture. The proposed techniques will be evaluated on several tasks like translation, masked language modeling, question answering, and summarization tasks. The primary objective is to develop an efficient approach to handle lengthy text inputs.

Scientific novelty What sets apart this work is the primary focus on attention structure medications in the transformer to enable the transformer to consume longer input. This dissertation introduces a transformer with additional memory for processing long documents. This model and its variants were applied for various tasks on long text input like translation, question answering, and summarization. This work is different from previous works in:

1. Propose new encoder-decoder architecture structural modifications based on t5 architecture

for processing input longer than standard input length;

2. An investigation of the effectiveness of these modifications applied to translation, mask language modeling, question answering with long-range context, and summarization of long inputs tasks.

Theoretical value of the work in the dissertation

The theoretical value of the work in the dissertation lies in the comprehensive exploration of the proposed variants of transformer modifications for processing long input.

The thesis overcomes existing attention issues (limited transformer input length) by suggesting new structural modifications to the encoder-decoder transformer structure. These new modifications led to breaking attention barriers for processing long-range inputs.

Furthermore, this thesis presents a new usage for the internal global tokens called memory tokens, proper masking technique, and proper usage for the original relative positional encoding to relate chunked input with related memory slots.

Practical value of the work in the dissertation: The proposed modifications were applied to different tasks using different data sets. They resulted in consuming long-range input, information interchange, and data compression. Models trained as part of the dissertation work are made publicly available.

The key outcomes of the study encompass the following:

1. The original encoder-decoder transformer architecture has undergone structural modifications that have been proposed and implemented.

2. Memory slots have been found to have compression properties, further investigation is necessary to fully understand their role in information flow between input chunks.

3. To process longer inputs of up to 3072 tokens with precision, the transformer relies on accurate masking techniques and tailored relative positional encoding. The transformer architecture is thus endowed with the power to process longer inputs with greater efficiency and accuracy.

In conclusion, these significant modifications to the transformer architecture have resulted in improved performance and greater efficiency in handling longer inputs. Further research is needed to fully understand the compression properties of memory slots and their impact on information flow in the transformer.

6.2 Limitations and future recommendations

The first version of the model design produced excessive computational cost even though it provided better results considering BLEU metric, in the following chapters there was a try to

overcome this issue. Also, the application of this model was for the translation task from English to Russian to translate sentences that had a context but an application for translating complete documents was not presented yet due to time constructions. Anyway at the time of experiments on translation task, there were no appropriate metrics for document-level translation evaluation as [43]. This was one of the experiment limitations and research on document-level translation metrics was out of the study scope.

The second group of model structural modifications was tested on MLM and question-answering tasks. Question answering task means finding an answer to a question. The question can be combined with a context or not, and the answer can be produced in an extractive or abstractive way. In this dissertation our focus on the abstractive way to questions combined with context.

The results of the study on these modifications to process long documents for question-answering tasks were published in paper " Global memory transformer for processing long document using [1]. In the paper, both the baseline and the proposed model were pre-trained from scratch on MLM model using a pre-training pipeline implemented using pytorch based on T5 pre-training objective. The results of the proposed model on MLM task were superior on the t5 baseline using the data set The results using HotpotQA data set for the question-answering task were not satisfied. Both the baseline and the proposed model were not able to manage the task using a linear optimizer. Both models started to tackle the task using a constant optimizer. Hence we claim that since the proposed model was able to surpass the baseline on MLM task, the reason could be one of the following reasons. The first reason the pre-training data set was small enough to not be able to catch all the knowledge needed for the downstream task for QA task or the data set nature was not suitable or enough for QA task. The second reason is the nature of the pre-training data set is not suitable for the downstream data set HotpotQA. It is worth mentioning that HotpotQA answers have an extractive character more than an abstractive character this can be an important reason to use another data set that is more suitable for generative models such as used transformers in this study. Nevertheless, the proposed model still has an attractive compression character for the chunked input. The conclusion of these experiments emphasizes the compression ability f the memory slots and their role f information flow between input chunks.

Researcher notes about this group of experiments: It will be of important scientific merits to conduct a fair comparison study between all models that were directed to processing long documents using the large model architecture and the same data. The field needs some kinds of studies to decouple the effects of algorithms/models from other factors such as training details, large computation, and big data.

For the last group of structural modifications detailed in sec.5.9 Analyzing memory content reveals that memory tokens with the same positional encoding store identical tokens for each

related chunk. Analyzing attention maps showed active interaction between memory tokens and related chunks and active interaction between chunks tokens and most of memory tokens in all layers. Here we should note that the interaction was from the first layers where the memory was empty. It worth in this case investigate applying memory in much creative way as we did in previos chapters using separate attention for each memory slot and its related chunk so we can have initially memory representation for the chunk then apply attention on the related chunk, and give memory tokens in each slot different positional encoding. This needs experiments with different design variants and investigating the best layer where to use the memory slots. Research in this direction is very wide and many variants are possible to investigate.

At the end of this thesis, I would start and end by praising "ALLAH," who is known for His mercy and kindness. I want to express my appreciation to "ALLAH," who is the most generous and compassionate. I also want to thank the Holy Prophet "MOHAMAD" (Peace be upon him), for always inspiring me.

The pursuit of a doctorate degree is an arduous and complex journey through uncharted territories. I would like to take this opportunity to express my deepest gratitude to those who have accompanied and encouraged me on this journey. Without their unlimited support, this thesis would not have been possible.

I extend my sincere appreciation to my supervisor. His invaluable guidance and direction throughout the thesis writing process were instrumental in completing this work.

Upon introspection into how many candidates have successfully defended their thesis, I have realized that this accomplishment is not the most significant one in my life. However, this journey has exposed me to a world that has had a profound impact on me. I have learned a lot about myself and have gained a comprehensive understanding of the degree of determination that one must possess to succeed as a human being. I am immensely grateful to the Moscow Institute of Physics and Technology for granting me the chance to conduct my research and for providing me with the opportunity to embark on such a transformative journey. The invaluable experience that I gained has significantly contributed to my academic and professional development. I would like to express my sincere gratitude to the institute for its tenacious support and encouragement throughout the research process.

Lastly, I would like to express my deepest appreciation to my country Syria, and my family, whose unwavering mental support and boundless love have played an instrumental role in my success. Their constant presence and encouragement throughout my journey have been a source of inspiration that kept me motivated during the most challenging times. I am profoundly grateful for their unrelenting love, support, and prayers, as well as their resolute commitment to my personal and academic growth. I am forever indebted to them for their invaluable contribution to my achievements.

Список литературы диссертационного исследования кандидат наук Аль Адел Ариж, 2024 год

Bibliography

1. Arij Al Adel. "Global memory transformer for processing long documents". In: ArXiv abs/2212.01650 (2022). URL: https://api.semanticscholar.org/CorpusID:254246977.

2. Arij Al Adel and Mikhail S. Burtsev. "Memory transformer with hierarchical attention for long document processing". In: 2021 International Conference Engineering and Telecommunication (En&T) (2021), pp. 1-7.

3. Joshua Ainslie et al. "ETC: Encoding Long and Structured Inputs in Transformers". In: EMNLP. 2020.

4. Nikolay Arefyev, Dmitry Kharchev, and Artem Shelmanov. "NB-MLM: Efficient Domain Adaptation of Masked Language Models for Sentiment Analysis". In: EMNLP. 2021.

5. V. Aribandi et al. "ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning". In: ArXiv abs/2111.10952 (2022).

6. Mikel Artetxe et al. "Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation". In: Conference on Computational Natural Language Learning. 2018.

7. Alexei Baevski et al. "Cloze-driven Pretraining of Self-attention Networks". In: Conference on Empirical Methods in Natural Language Processing. 2019.

8. Guangsheng Bao et al. "G-Transformer for Document-Level Machine Translation". In: ACL. 2021.

9. Iz Beltagy, Matthew E. Peters, and Arman Cohan. "Longformer: The Long-Document Transformer". In: ArXiv abs/2004.05150 (2020).

10. Samuel R. Bowman et al. "Generating Sentences from a Continuous Space". In: CoNLL. 2016.

11. Mikhail S Burtsev et al. "Memory transformer". In: arXiv preprint arXiv:2006.11527 (2020).

12. Avi Caciularu et al. "CDLM: Cross-Document Language Modeling". In: Conference on Empirical Methods in Natural Language Processing. 2021. URL: https : / / api . semanticscholar.org/CorpusID:237415234.

13. Yang Trista Cao et al. "On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations". In: ArXiv abs/2203.13928 (2022).

14. Zihang Chen et al. "Quora Question Pairs". In: 2017.

15. Ta-Chung Chi et al. "KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation". In: ArXiv abs/2205.09921 (2022). URL: https: //api . semanticscholar . org/CorpusID:248965309.

16. Rewon Child et al. "Generating Long Sequences with Sparse Transformers". In: ArXiv abs/1904.10509 (2019).

17. Rewon Child et al. "Generating long sequences with sparse transformers". In: arXiv preprint arXiv:1904.10509 (2019).

18. Krzysztof Choromanski et al. "Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers". In: ArXiv abs/2006.03555 (2020).

19. Krzysztof Choromanski et al. "Rethinking Attention with Performers". In: ArXiv abs/2009.14794 (2020).

20. Ronan Collobert and Jason Weston. "A unified architecture for natural language processing: deep neural networks with multitask learning". In: ICML '08. 2008.

21. Marcella Cornia et al. "Meshed-memory transformer for image captioning". In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 10578-10587.

22. Peng Cui and Le Hu. "Sliding Selector Network with Dynamic Memory for Extractive Summarization of Long Documents". In: North American Chapter of the Association for Computational Linguistics. 2021.

23. Zihang Dai et al. "Transformer-XL: Attentive Language Models beyond a Fixed-Length Context". In: ArXiv abs/1901.02860 (2019).

24. Jacob Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". In: NAACL. 2019.

25. Yukun Feng et al. "Learn To Remember: Transformer with Recurrent Memory for Document-Level Machine Translation". In: NAACL-HLT. 2022.

26. Shen Gao et al. "From Standard Summarization to New Tasks and Beyond: Summarization with Manifold Information". In: International Joint Conference on Artificial Intelligence. 2020.

27. Bogdan Gliwa et al. "SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization". In: Proceedings of the 2nd Workshop on New Frontiers in Summarization. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 70-79. DOI: 10.18653/v1/D19-5409. URL: https://www.aclweb.org/anthology/D19-5409.

28. Max Grusky, Mor Naaman, and Yoav Artzi. "Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies". In: North American Chapter of the Association for Computational Linguistics. 2018.

29. Mandy Guo et al. "LongT5: Efficient Text-To-Text Transformer for Long Sequences". In: NAACL-HLT. 2022.

30. Qipeng Guo et al. "Star-transformer". In: arXiv preprint arXiv:1902.09113 (2019).

31. Ankit Gupta and Jonathan Berant. "GMAT: Global Memory Augmentation for Transformers". In: ArXiv abs/2006.03274 (2020).

32. Ankit Gupta and Jonathan Berant. "Gmat: Global memory augmentation for transformers". In: arXiv preprint arXiv:2006.03274 (2020).

33. Adi Haviv et al. "Transformer Language Models without Positional Encodings Still Learn Positional Information". In: ArXiv abs/2203.16634 (2022).

34. Junxian He et al. "CTRLsum: Towards Generic Controllable Text Summarization". In: Conference on Empirical Methods in Natural Language Processing. 2020.

35. Karl Moritz Hermann et al. "Teaching Machines to Read and Comprehend". In: NIPS. 2015.

36. Karl Moritz Hermann et al. "Teaching Machines to Read and Comprehend". In: Advances in Neural Information Processing Systems (NIPS). 2015. URL: http://arxiv.org/abs/ 1506.03340.

37. Cheng-Zhi Anna Huang et al. "Music Transformer: Generating Music with Long-Term Structure". In: ICLR. 2019.

38. Luyang Robby Huang et al. "Efficient Attentions for Long Document Summarization". In: North American Chapter of the Association for Computational Linguistics. 2021.

39. Zhiheng Huang et al. "Improve Transformer Models with Better Relative Position Embeddings". In: Findings. 2020. URL: https://api.semanticscholar.org/CorpusID: 221995630.

40. Maor Ivgi, Uri Shaham, and Jonathan Berant. "Efficient Long-Text Understanding with Short-Text Models". In: ArXiv abs/2208.00748 (2022).

41. Sebastian Jaszczur et al. "Sparse is Enough in Scaling Transformers". In: Neural Information Processing Systems. 2021.

42. Frederick Jelinek et al. "Perplexity—a measure of the difficulty of speech recognition tasks". In: Journal of the Acoustical Society of America 62 (1977).

43. Yu Jiang et al. "BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation". In: North American Chapter of the Association for Computational Linguistics. 2021. URL: https://api.semanticscholar.org/CorpusID:248572535.

44. Mandar Joshi et al. "SpanBERT: Improving Pre-training by Representing and Predicting Spans". In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 64-77.

45. Marcin Junczys-Dowmunt. "Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation". In: WMT. 2019.

46. Xiaomian Kang et al. "Dynamic Context Selection for Document-level Neural Machine Translation via Reinforcement Learning". In: ArXiv abs/2010.04314 (2020).

47. Amirhossein Kazemnejad et al. "The Impact of Positional Encoding on Length Generalization in Transformers". In: ArXiv abs/2305.19466 (2023). URL: https : / / api . semanticscholar.org/CorpusID:258987259.

48. Urvashi Khandelwal et al. "Sample Efficient Text Summarization Using a Single Pre-Trained Transformer". In: ArXiv abs/1905.08836 (2019).

49. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. "Reformer: The Efficient Transformer". In: ArXiv abs/2001.04451 (2020).

50. Huan Yee Koh et al. "An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics". In: ACM Computing Surveys 55 (2022), pp. 1-35.

51. Wojciech Kryscinski et al. "Evaluating the Factual Consistency of Abstractive Text Summarization". In: ArXiv abs/1910.12840 (2019).

52. Wojciech Kryscinski et al. "Neural Text Summarization: A Critical Evaluation". In: Conference on Empirical Methods in Natural Language Processing. 2019.

53. Taku Kudo and John Richardson. "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing". In: arXiv preprint arXiv:1808.06226 (2018).

54. Tom Kwiatkowski et al. "Natural Questions: A Benchmark for Question Answering Research". In: Transactions of the Association for Computational Linguistics 7 (2019), pp. 453-466.

55. Juho Lee et al. "Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks". In: International Conference on Machine Learning. 2018.

56. Mike Lewis et al. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension". In: ACL. 2020.

57. Tatiana Likhomanenko et al. "CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings". In: Neural Information Processing Systems. 2021. URL: https://api.semanticscholar.org/CorpusID:235358538.

58. Chin-Yew Lin. "ROUGE: A Package for Automatic Evaluation of Summaries". In: Annual Meeting of the Association for Computational Linguistics. 2004.

59. Pierre Lison, Jorg Tiedemann, and Milen Kouylekov. "OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora". In: International Conference on Language Resources and Evaluation. 2018.

60. Hui Liu and Xiaojun Wan. "Video Paragraph Captioning as a Text Summarization Task". In: Annual Meeting of the Association for Computational Linguistics. 2021.

61. Peter J. Liu et al. "Generating Wikipedia by Summarizing Long Sequences". In: ArXiv abs/1801.10198 (2018).

62. Xuanqing Liu et al. "Learning to Encode Position for Transformer with Continuous Dynamical Model". In: ArXiv abs/2003.09229 (2020).

63. Yang Liu and Mirella Lapata. "Text Summarization with Pretrained Encoders". In: ArXiv abs/1908.08345 (2019).

64. Yinhan Liu et al. "Multilingual Denoising Pre-training for Neural Machine Translation". In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 726-742.

65. Yinhan Liu et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach". In: ArXiv abs/1907.11692 (2019).

66. Ilya Loshchilov and Frank Hutter. "Decoupled Weight Decay Regularization". In: International Conference on Learning Representations. 2017.

67. Andrew L. Maas et al. "Learning Word Vectors for Sentiment Analysis". In: Annual Meeting of the Association for Computational Linguistics. 2011.

68. Joel M. Mackenzie et al. "CC-News-En: A Large English News Corpus". In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management (2020).

69. Dimitris Mamakas et al. "Processing Long Legal Documents with Pre-trained Transformers: Modding LegalBERT and Longformer". In: ArXiv abs/2211.00974 (2022). URL: https : //api.semanticscholar.org/CorpusID:253254835.

70. Potsawee Manakul and Mark John Francis Gales. "Long-Span Summarization via Local Attention and Content Selection". In: Annual Meeting of the Association for Computational Linguistics. 2021.

71. Sameen Maruf, Andre F. T. Martins, and Gholamreza Haffari. "Selective Attention for Context-aware Neural Machine Translation". In: NAACL. 2019.

72. Lesly Miculicich et al. "Document-Level Neural Machine Translation with Hierarchical Attention Networks". In: ArXiv abs/1809.01576 (2018).

73

74

75

76

77

78

79

80

81

82

83

84

85

86

Gianluca Moro and Luca Ragazzi. "Semantic Self-Segmentation for Abstractive Summarization of Long Documents in Low-Resource Regimes". In: AAAI Conference on Artificial Intelligence. 2022.

Sharan Narang et al. "Do Transformer Modifications Transfer Across Implementations and Applications?" In: ArXiv abs/2102.11972 (2021).

Masato Neishi and Naoki Yoshinaga. "On the Relation between Position Information and Sentence Length in Neural Machine Translation". In: Conference on Computational Natural Language Learning. 2019. URL: https://api.semanticscholar.org/CorpusID: 208163680.

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. "Representation Learning with Contrastive Predictive Coding". In: ArXiv abs/1807.03748 (2018).

Arka Pal et al. "Giraffe: Adventures in Expanding Context Lengths in LLMs". In: ArXiv abs/2308.10882 (2023). URL: https://api.semanticscholar.org/CorpusID:261048876.

Kishore Papineni et al. "Bleu: a Method for Automatic Evaluation of Machine Translation". In: Annual Meeting of the Association for Computational Linguistics. 2002.

Jiezhong Qiu et al. "Blockwise Self-Attention for Long Document Understanding". In: ArXiv abs/1911.02972 (2019). URL: https://api.semanticscholar.org/CorpusID:207847640.

Alec Radford et al. "Language Models are Unsupervised Multitask Learners". In: 2019.

Jack W. Rae et al. "Compressive Transformers for Long-Range Sequence Modelling". In: ArXiv abs/1911.05507 (2020).

Colin Raffel et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". In: ArXiv abs/1910.10683 (2020).

Pranav Rajpurkar et al. "SQuAD: 100,000+ Questions for Machine Comprehension of Text". In: Conference on Empirical Methods in Natural Language Processing. 2016.

Jan Rosendahl et al. "Analysis of Positional Encodings for Neural Machine Translation". In: IWSLT. 2019.

Alexander M. Rush, Sumit Chopra, and Jason Weston. "A Neural Attention Model for Abstractive Sentence Summarization". In: Conference on Empirical Methods in Natural Language Processing. 2015.

A. See, Peter J. Liu, and Christopher D. Manning. "Get To The Point: Summarization with Pointer-Generator Networks". In: ArXiv abs/1704.04368 (2017).

87. Abigail See, Peter J. Liu, and Christopher D. Manning. "Get To The Point: Summarization with Pointer-Generator Networks". In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics, July 2017, pp. 1073-1083. DOI: 10.18653/v1/ P17-1099. URL: https://aclanthology.org/P17-1099.

88. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. "Self-Attention with Relative Position Representations". In: NAACL. 2018.

89. Peter Shaw et al. "Generating Logical Forms from Graph Representations of Text and Entities". In: ArXiv abs/1905.08407 (2019).

90. Koustuv Sinha et al. "The Curious Case of Absolute Position Embeddings". In: ArXiv abs/2210.12574 (2022). URL: https://api.semanticscholar.org/CorpusID:253098922.

91. Richard Socher et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". In: Conference on Empirical Methods in Natural Language Processing. 2013.

92. Sainbayar Sukhbaatar et al. "Adaptive Attention Span in Transformers". In: Annual Meeting of the Association for Computational Linguistics. 2019.

93. Sainbayar Sukhbaatar et al. "Augmenting self-attention with persistent memory". In: arXiv preprint arXiv:1907.01470 (2019).

94. Zewei Sun et al. "Rethinking Document-level Neural Machine Translation". In: FINDINGS. 2022.

95. Xin Tan et al. "Hierarchical Modeling of Global Context for Document-Level Neural Machine Translation". In: EMNLP. 2019.

96. Yi Tay et al. "Scale Efficiently: Insights from Pretraining and Finetuning Transformers". In: International Conference on Learning Representations. 2022. URL: https : / / api . semanticscholar.org/CorpusID:260498358.

97. Yi Tay et al. "Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?" In: ArXiv abs/2207.10551 (2022).

98. Trieu H. Trinh and Quoc V. Le. "A Simple Method for Commonsense Reasoning". In: ArXiv abs/1806.02847 (2018).

99. Ashish Vaswani et al. "Attention is All you Need". In: NIPS. 2017.

100. Jesse Vig, Alexander R. Fabbri, and Wojciech Kryscinski. "Exploring Neural Models for Query-Focused Summarization". In: ArXiv abs/2112.07637 (2021).

101

102

103

104.

105

106

107

108.

109

110

111

112.

113.

114.

115.

116.

Elena Voita, Rico Sennrich, and Ivan Titov. "When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion". In: ACL. 2019.

Yu-An Wang and Yun-Nung Chen. "What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding". In: EMNLP. 2020.

Sinong Wang et al. "Linformer: Self-Attention with Linear Complexity". In: ArXiv abs/2006.04768 (2020).

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. "Constructing Datasets for Multi-hop Reading Comprehension Across Documents". In: Transactions of the Association for Computational Linguistics 6 (2017), pp. 287-302.

Alexander Wettig et al. "Should You Mask 15% in Masked Language Modeling?" In: ArXiv abs/2202.08005 (2022).

Chien Sheng Wu et al. "Controllable Abstractive Dialogue Summarization with Sketch Supervision". In: ArXiv abs/2105.14064 (2021).

Qingyang Wu et al. "Memformer: The Memory-Augmented Transformer". In: arXiv preprint arXiv:2010.06891 (2020).

Lee Xiong et al. "Open Domain Web Keyphrase Extraction Beyond Language Modeling". In: Conference on Empirical Methods in Natural Language Processing. 2019.

Hongfei Xu et al. "Efficient Context-Aware Neural Machine Translation with Layer-Wise Weighting and Input-Aware Gating". In: IJCAI. 2020.

Zhilin Yang et al. "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering". In: EMNLP. 2018.

Manzil Zaheer et al. "Big Bird: Transformers for Longer Sequences". In: ArXiv abs/2007.14062 (2020).

Manzil Zaheer et al. "Big Bird: Transformers for Longer Sequences." In: NeurIPS. 2020.

Yury Zemlyanskiy et al. "ReadTwice: Reading Very Large Documents with Memories". In: ArXiv abs/2105.04241 (2021).

Han Zhang et al. "A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models". In: ArXiv abs/2201.05337 (2022).

Jiacheng Zhang et al. "Improving the Transformer Translation Model with Document-Level Context". In: EMNLP. 2018.

Jingqing Zhang et al. "PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization". In: ArXiv abs/1912.08777 (2020).

117. Linlin Zhang. "Context-Adaptive Document-Level Neural Machine Translation". In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), pp. 6232-6236.

118. Pei Zhang et al. "Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation". In: EMNLP. 2020.

119. Xingxing Zhang, Furu Wei, and M. Zhou. "HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization". In: ArXiv abs/1905.06566 (2019).

120. Zhengyan Zhang et al. "ERNIE: Enhanced Language Representation with Informative Entities". In: Annual Meeting of the Association for Computational Linguistics. 2019.

121. Yao Zhao, Mohammad Saleh, and Peter J. Liu. "SEAL: Segment-wise Extractive-Abstractive Long-form Text Summarization". In: ArXiv abs/2006.10213 (2020).

122. Zaixiang Zheng et al. "Towards Making the Most of Context in Neural Machine Translation". In: IJCAI. 2020.

123. Yukun Zhu et al. "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books". In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 19-27.

Обратите внимание, представленные выше научные тексты размещены для ознакомления и получены посредством распознавания оригинальных текстов диссертаций (OCR). В связи с чем, в них могут содержаться ошибки, связанные с несовершенством алгоритмов распознавания. В PDF файлах диссертаций и авторефератов, которые мы доставляем, подобных ошибок нет.