References

lingngu

Вестник НГУ. Серия: Лингвистика и межкультурная коммуникация

NSU Vestnik. Series: Linguistics and Intercultural Communication

1818-7935

Новосибирский государственный университет

10.25205/1818-7935-2025-23-1-80-92

lingngu-906

Research Article

КОМПЬЮТЕРНАЯ И ПРИКЛАДНАЯ ЛИНГВИСТИКА

COMPUTER AND APPLIED LINGUISTICS

Автоматическая саммаризация родительских чатов в WhatsApp

Automatic Summarization of Parental Chats on WhatsApp

https://orcid.org/0009-0001-9548-3273

Дмитриева

К. А.

Dmitrieva

K. A.

Дмитриева Кристина Александровна, стажер-исследователь

Санкт-Петербург

Kristina A. Dmitrieva, Research Assistant

Saint Petersburg

kadmitrieva@hse.ru

https://orcid.org/0009-0005-4124-1956

Жолус

М. Р.

Zholus

M. R.

Жолус Марина Романовна, стажер-исследователь, инженер-программист АО «Эврика»

Санкт-Петербург

Marina R. Zholus, Research Assistant, Software Engineer

Saint Petersburg

mrzholus@edu.hse.ru

Национальный исследовательский университет «Высшая школа экономики»РоссияHSE UniversityRussian Federation

2025

04072025

2318092

2025

Дмитриева К.А., Жолус М.Р.

Dmitrieva K.A., Zholus M.R.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://lingngu.elpub.ru/jour/article/view/906

Автоматическая саммаризация текста – одна из ключевых задач NLP, предполагающая создание краткой версии исходного текста. В современном мире, где объемы потребляемой человеком информации неустанно растут, задаче саммаризации уделяется все больше внимания. Автореферирование предполагает два основных подхода: экстрактивный и абстрактивный. Последний заключается в автоматическом создании саммари текста, в котором могут содержаться слова и предложения, не встречающиеся в источнике. Этот подход зачастую требует использования нейросетевых моделей, и для его реализации необходимы большие наборы специальным образом размеченных данных. Несмотря на значительные успехи в абстрактивной саммаризации публицистических и научных текстов, методы и датасеты, используемые для работы с монологическими документами, не всегда применимы для саммаризации диалогов. Кроме того, хотя создано достаточно много англоязычных датасетов для саммаризации текстов различных доменов, существующие наборы данных для автоматического аннотирования текстов на русском языке пока немногочисленны. Настоящая статья посвящена разработке и описанию русскоязычного диалогового датасета для саммаризации сообщений в родительских чатах и последующему обучению модели абстрактивной саммаризации для русского языка на авторском наборе диалоговых данных. В качестве материала выступил родительский чат с учителем в мессенджере WhatsApp. Процесс ручной разметки датасета включал в себя разбиение всех сообщений чата на отдельные диалоги, создание саммари и присвоение тематических меток для каждого разговора. В результате был создан датасет, содержащий 616 диалогов, в общей сложности состоящих из 3380 сообщений. Для файн-тьюнинга были выбраны модели-трансформеры ruT5, mT5 и RuGPT (ruT5 и RuGPT были предварительно обучены на русскоязычном датасете для автоматической саммаризации новостей), а для оценки их качества – метрики ROUGE-1, ROUGE-2, ROUGE-L, BLEU и BERTScore. В результате модели ruT5, дообученной на авторском датасете, удалось превзойти бейзлайн по всем пяти метрикам.

Automatic text summarization is one of the main tasks of natural language processing (NLP), which consists in creating a shorter version of the source text. In today’s world the amount of information consumed by people is constantly increasing, therefore more and more emphasis is being placed on the task of summarization. There are two main approaches to automatic text summarization: extractive and abstractive ones. The latter involves automatic creation of a summary text that may contain words and phrases not present in the source. This approach usually requires the usage of AI models, which creates a demand for large datasets labeled in a certain way. Despite significant advances in summarization of scientific and news articles, the methods and datasets applied to monologue documents are not always suitable for dialogue summarization. Besides, although there exists a considerable number of English-language summarization datasets, the number of those available in Russian is not yet sufficient. The paper is devoted to the labeling and description of a Russian-language dataset for group chat messages summarization and fine-tuning models for the task of abstractive summarization for the Russian language on a custom dialogue dataset. A parental chat with a teacher in WhatsApp was used as material for the dataset. The process of manually labeling the dataset consisted in dividing the entire group chat into separate dialogues, writing a summary, and adding topic labels for each of them. As a result, a dataset has been created, which includes 616 dialogues with a total of 3380 messages. The ruT5, mT5 and RuGPT models were selected for fine-tuning, the ruT5 and RuGPT models were pre-trained on a Russian-language dataset for automatic news summarization. The ROUGE–1, ROUGE-2, ROUGE-L, BLEU and BERTScore metrics were used to evaluate the quality of the models. Subsequently, the ruT5 model, fine-tuned on the custom dataset, turned out to out-perform the baseline model in all the five metrics.

автоматическая саммаризация текстадиалоговая саммаризациямашинное обучениетрансформерыобработка естественного языка

automatic text summarizationdialogue summarizationmachine learningtransformersdatasetNLP

Исследование подготовлено по материалам проекта «Текст как Big Data: методы и модели работы с большими текстовыми данными», выполняемого в рамках Программы фундаментальных исследований НИУ ВШЭ в 2024 году.

The article is based on the materials of the project “Text as Big Data: methods and models of working with large text data”, carried out within the framework of the HSE Fundamental Research Program in 2024

References1

An C., Zhong M., Chen Y., Wang D., Qiu X., Huang X. Enhancing Scientific Papers Summarization with Citation Graph. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35(14), pp. 12498–12506. https://doi.org/10.1609/aaai.v35i14.17482

Ben Abacha A., Yim W., Fan Y., Lin T. An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2291–2302. Available at: https://aclanthology.org/2023.eacl-main.168.pdf (аccessed: June 23, 2024).

Budzianowski P., Wen T., Tseng B.H., Casanueva I., Ultes S., Ramadan O., et al. MultiWOZ – A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 5016–5026. https://doi.org/10.18653/v1/d18-1547

Bylieva D., Lobatyuk V., Novikov M. Parent Chats in Education System: During and after the Pandemic Outbreak. Education Sciences, 2023, vol. 13(8), pp. 778–794. https://doi.org/10.3390/educsci13080778

Carletta J., Ashby S., Bourban S., Flynn M., Guillemot M., Hain T., et al. The AMI Meeting Corpus: A Pre-announcement. Lecture Notes in Computer Science, 2006, pp. 28–39. https://doi.org/10.1007/11677482_3

Chen Y., Liu Y., Chen L., Zhang Y. DialogSum: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 5062–5074. Available at: https://aclanthology.org/2021.findings-acl.449.pdf (аccessed: June 24, 2024).

Chowdhury S.B.R., Monath N., Dubey A., Zaheer M., McCallum A., Ahmed A., Chaturvedi S. Incremental Extractive Opinion Summarization Using Cover Trees. arXiv (Cornell University). 2024. Available at: https://arxiv.org/abs/2401.08047 (аccessed: June 24, 2024).

Cohan A., Dernoncourt F., Kim D. S., Bui T., Kim S., Chang W., Goharian N. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, vol. 2, pp. 615–621. https://doi.org/10.18653/v1/n18-2097

Dutta S., Chandra V., Mehra K., Ghatak S., Das A. K., Ghosh S. Summarizing Microblogs during Emergency Events: A Comparison of Extractive Summarization Algorithms. International Conference on Emerging Technologies in Data Mining and Information Security, 2018. Available at: https://www.researchgate.net/publication/325593717_Summarizing_Microblogs_during_Emergency_Events_A_Comparison_of_Extractive_Summarization_Algorithms (аccessed: June 25, 2024).

Feigenblat G., Gunasekara R. C., Sznajder B., Joshi S., Konopnicki D., Aharonov R. TWEET-SUMM A Dialog Summarization Dataset for Customer Service. Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 245–260. https://doi.org/10.18653/v1/2021.findings-emnlp.24

Feng X., Feng X., Qin B. A Survey on Dialogue Summarization: Recent Advances and New Frontiers. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022, pp. 5453–5460. https://doi.org/10.24963/ijcai.2022/764

Ghosh A., Acharya A., Jha P., Gaudgaul A., Majumdar R., Saha S., Chadha A., Jain R., Sinha S., Agarwal S. MedSUMM: A Multimodal Approach to Summarizing Code-Mixed Hindi-English clinical queries. arXiv (Cornell University). 2024. Available at: https://arxiv.org/abs/2401.01596 (аccessed: June 24, 2024).

Gliwa B., Mochol I., Biesek M., Wawer A. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. Proceedings of the 2nd Workshop on New Frontiers in Summarization, 2019, pp. 70–79. https://doi.org/10.18653/v1/d19-5409

Gusev I. Dataset for Automatic Summarization of Russian News. In: Communications in computer and information science, 2020, pp. 122–134. https://doi.org/10.1007/978-3-030-59082-6_9

Hasan T., Bhattacharjee A., Islam Md. S., Mubasshir K., Li Y., Kang Y. B., Rahman S., Shahriyar R. XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 4693–703. https://doi.org/10.18653/v1/2021.findings-acl.413

Hermann K. M., Kočiský T., Grefenstette E., Espeholt L., Kay W., Suleyman M., Blunsom P. Teaching Machines to Read and Comprehend. Neural Information Processing Systems, 2015, vol. 28, pp. 1693–1701. Available at: http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf (аccessed: June 25, 2024).

Janin A., Baron D., Edwards J. A., Ellis D. P. W., Gelbart D., Morgan N., et al. The ICSI meeting corpus. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 Proceedings (ICASSP ’03). 2003. Available at: https://www.researchgate.net/publication/4015071_The_ICSI_meeting_corpus (аccessed: June 23, 2024).

Jin H., Yang Z., Meng D., Wang J., Tan J. A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods. arXiv (Cornell University). 2024. Available at: https://arxiv.org/abs/2403.02901 (аccessed: June 24, 2024).

Khalman M., Zhao Y., Saleh M. ForumSum: A Multi-Speaker Conversation Summarization Dataset. Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 4592– 4599. https://doi.org/10.18653/v1/2021.findings-emnlp.391

Koupaee M., Wang W. Y. WikiHow: A Large Scale Text Summarization Dataset. arXiv (Cornell University). 2018. Available at: https://arxiv.org/pdf/1810.09305 (аccessed: June 24, 2024).

Lin C. ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004. Available at: https://aclanthology.org/W04-1013.pdf (аccessed: June 27, 2024).

Liu C., Wang P., Xu J., Zang L., Ye J. Automatic Dialogue Summary Generation for Customer Service. KDD ’19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019, pp. 1957–1965. https://doi.org/10.1145/3292500.3330683

Liu L., Lu Y., Yang M., Qu Q., Zhu J., Li H. Generative Adversarial Network for Abstractive Text Summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32(1). Available at https://arxiv.org/abs/1711.09357 (аccessed: June 25, 2024).

Liu Y. Fine-tune BERT for Extractive Summarization. arXiv (Cornell University). 2019. Available at: https://arxiv.org/pdf/1903.10318.pdf (аccessed: June 23, 2024).

Luhn H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 1958, vol. 2(2), pp. 159–165. https://doi.org/10.1147/rd.22.0159

Lyu M. R., Cheng P., Li X., Balian P., Bian J., Wu Y. Automatic Summarization of Doctor-Patient Encounter Dialogues Using Large Language Model through Prompt Tuning. arXiv (Cornell University). 2024. Available at: https://arxiv.org/abs/2403.13089 (аccessed: June 24, 2024).

Malykh V., Chernis K., Artemova E., Piontkovskaya I. SumTitles: a Summarization Dataset with Low Extractiveness. Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 5718–5730. https://doi.org/10.18653/v1/2020.coling-main.503

Moratanch N., Gopalan С. A survey on Extractive Text Summarization. 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP). 2017. Available at: https://ieeexplore.ieee.org/document/7944061 (аccessed: June 25, 2024).

Napoles C., Gormley M. R., Van Durme B. Annotated Gigaword. Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), 2012, pp. 95–100. Available at: https://aclanthology.org/W12-3018.pdf (аccessed: June 23, 2024).

Narayan S., Cohen S. B., Lapata M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1797–1807. https://doi.org/10.18653/v1/d18-1206

Nedoluzhko A., Singh M., Hledíková M., Ghosal T., Bojar O. ELITR Minuting Corpus: A Novel Dataset for Automatic Minuting from Multi-Party Meetings in English and Czech. Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 3174–3182. Available at: https://aclanthology.org/2022.lrec-1.340/ (аccessed: June 23, 2024).

Papineni K., Roukos S., Ward T., Zhu W. BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 2002, pp. 311–318. Available at: https://aclanthology.org/P02-1040.pdf (аccessed: June 27, 2024).

Rameshkumar R., Bailey P. Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 5121–34. https://doi.org/10.18653/v1/2020.acl-main.459

Shukla A., Bhattacharya P., Poddar S., Mukherjee R., Ghosh K., Goyal P., Ghosh S. Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation. arXiv (Cornell University), 2022. Available at: https://arxiv.org/abs/2210.07544 (аccessed: June 22, 2024).

Zhang S., Çelikyılmaz A., Gao J., Bansal M. EmailSum: Abstractive Email Thread Summarization. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021, vol. 1, pp. 6895–6909. https://doi.org/10.18653/v1/2021.acl-long.537

Zhang T., Kishore V., Wu F., Weinberger K. Q., Artzi Y. BERTScore: Evaluating Text Generation with BERT. arXiv (Cornell University), 2020. Available at: https://arxiv.org/pdf/1904.09675 (аccessed: June 27, 2024).

Zhong M., Yin D., Yu T., Zaidi A., Mutuma M., Jha R., et al. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5905–5921. https://doi.org/10.18653/v1/2021.naacl-main.472

Zhu C., Liu Y., Mei J., Zeng M. MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5927–5934. https://doi.org/10.18653/v1/2021.naacl-main.474

The authors declare that there are no conflicts of interest present.