Preview

NSU Vestnik. Series: Linguistics and Intercultural Communication

Advanced search

AUTOMATIC EXTRACTION OF FORMULAIC EXPRESSIONS FROM RUSSIAN TEXTS

https://doi.org/10.25205/1818-7935-2018-16-2-5-18

Abstract

The present paper describes automatic extraction of linguistic items we call formulaic expressions from the Russian drama texts. Particularly, by formulaic expressions (FE) we mean multiword constructions that contain no variables and are used as reactions to verbal stimuli. We consider FE to be a specific kind of constructions in the framework of Construction Grammar. Therefore, they are to be described in the Constructicon project, which is a web-platform where the constructions of one language are presented in a special way for automatic search by various aspects. To facilitate the compilation of comprehensive FE list, we developed a module for automatic FE extraction. Implementation of the module consisted of several stages, including manual annotation of dramatic texts. The first step involved describing the features of FE and their difference compared to other syntactic items such as parenthetical words, lexical verbs and meaningful parts of sentence. Afterwards, two annotators marked Fes in 24 dramatic texts and 46 texts were annotated semiautomatically. Subsequently, we used 34 dramatic texts with the highest inter-annotator agreement. The process of FE extraction involves splitting the text into the special fragments corresponding to clauses, predicting whether each fragment is an FE corresponding to a particular feature set and compiling the final list of FEs. For prediction, we use a uniform weight vote of four classifiers (Random Forest Classifier, Logistic Regression, Ridge Classifier, Support Vector Classifier), which showed the best performance compared to rule-based baseline and classifiers outside the ensemble. We also compared the prediction quality of systems based on different feature sets and used the one with all the features. The best quality currently achieved is precision 0.30 and recall 0.73 (F1-score 0.42). Further development includes improving the preprocessing stage and employing left context, where FE stimulus is located. We also consider using distributional semantic models like word2vec for word embedding and neural networks.

About the Authors

S. Yu. Puzhaeva
National Research University - Higher School of Economics
Russian Federation


E. A. Gerasimenko
National Research University - Higher School of Economics
Russian Federation


E. S. Zakharova
National Research University - Higher School of Economics
Russian Federation


E. V. Rakhilina
National Research University - Higher School of Economics; Vinogradov Institute of Russian Language RAS
Russian Federation


References

1. Апресян Ю. Д. Типы информации для поверхностно-семантического компонента модели «Смысл  Текст» // Wiener Slawistischer Almanach, Sonderband 1. Wien: Institut für Slawistik der Universität Wien, 1980.

2. Апресян Ю. Д. Избранные труды. М.: Языки русской культуры, 1995. Т. 2: Интегральное описание языка и системная лексикография. 352 с.

3. Баранов А. Н., Добровольский Д. О. Речевые формулы в диалоге // Тр. Междунар. семинара «Диалог-2000» по компьютерной лингвистике и ее приложениям. М.: Наука, 2000. Т. 1. С. 25-31.

4. Рахилина Е. В., Кузнецова Ю. Л. Грамматика конструкций: теория, сторонники, близкие идеи // Лингвистика конструкций / Под ред. Е. В. Рахилиной. М.: Азбуковник, 2010. С. 18-79.

5. Шаронов И. А. Коммуникативы как функциональный класс и как объект лексикографического описания // Русистика сегодня. 1996. №. 2. С. 89-111.

6. Шаронов И. А. Дискурсивные слова и коммуникативы // Компьютерная лингвистика и интеллектуальные технологии: по материалам ежегодной международной конференции «Диалог» (Москва, 1-4 июля 2016 г.). М.: Изд-во РГГУ, 2016. Вып. 15 (22). С. 605-615.

7. Biber D., Johansson S., Leech G., Conrad S., Finegan E. Longman Grammar of Spoken and Written English. Harlow: Pearson Education, 1999. 1204 p.

8. Cohen J. A coefficient of agreement for nominal scales // Educational and psychological measurement. 1960. Vol. 20. No. 1. P. 37-46.

9. Corrigan R., Moravcsik E. A., Ouali H., Wheatley K. (Eds.). Formulaic language. Amsterdam: John Benjamins Publishing, 2009. Vol. 1: Distribution and historical change. 315 p.

10. Fillmore Ch. J. The mechanisms of “construction grammar” // Annual Meeting of the Berkeley Linguistics Society. Berkeley, 1988. Vol. 14. P. 35-55.

11. Fillmore Ch. J. Grammatical construction theory and the familiar dichotomies // North-Holland Linguistic Series: Linguistic Variations. 1989. Vol. 54. P. 17-38.

12. Fillmore Ch. J. Border conflicts: FrameNet meets construction grammar // Proceedings of the XIII EURALEX International Congress. Barcelona: Universitat Pompeu Fabra, 2008. P. 49-68.

13. Fillmore Ch. J., Kay P., O’Connor M. C. Regularity and idiomaticity in grammatical constructions: The case of LET ALONE // Language. 1988. Vol. 64. No. 3. P. 501-538.

14. Fillmore Ch. J., Kay P. Construction Grammar Course Book. Berkeley: University of California, 1992. 113 p.

15. Goldberg A. Constructions: A Construction Grammar Approach to Argument Structure. Chicago: University of Chicago Press, 1995. 265 p.

16. Hoffmann T., Trousdale G. (Eds.). The Oxford handbook of construction grammar. Oxford: Oxford University Press, 2013. 586 p.

17. Janda L. A., Lyashevskaya O., Nesset T., Rakhilina E., Tyers F. M. A Constructicon for Russian: Filling in the Gaps // Constructicography: Constructicon development across languages / Ed. by B. Lyngfelt, L. Borin, K. H. Ohara, & T. T. Torrent. Amsterdam: John Benjamins, 2018.

18. Kim Y. Convolutional Neural Networks for Sentence Classification // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, 2014. P. 1746-1751.

19. Korobov M. Morphological analyzer and generator for Russian and Ukrainian languages // International Conference on Analysis of Images, Social Networks and Texts. Cham, 2015. P. 320-332.

20. Lage L. M. Frames e construções: a relevância de um constructicon para o português brasileiro // Revista Gatilho (PPGL / UFJF. Online). 2013. Vol. 16.

21. Landis J. R., Koch G. G. The measurement of observer agreement for categorical data // Biometrics. 1977. Vol. 33. No. 1. P. 159-174.

22. Moon R. Vocabulary connections: Multi-word items in English // Vocabulary: Description, acquisition and pedagogy / Ed. by N. Schmitt, M. McCarthy. Cambridge: Cambridge University Press, 1997. P. 40-63.

23. Ohara K. H. Constructicon Building as a Practical Implementation of Construction Grammar and Frame Semantics: Japanese FrameNet // Poster at the 13th International Cognitive Linguistics Conference. Newcastle: Northumbria University, 2015.

24. Schiffrin D. Discourse markers // Studies in Interactional Sociolinguistics. 1988. No. 5. 364 p.

25. Schmitt N., Carter R. Formulaic sequences in action // Formulaic sequences: Acquisition, processing and use / Ed. by N. Schmitt. Amsterdam: John Benjamins, 2004. P. 1-22.

26. Stefanowitsch A., Gries S. Th. Collostructions: investigating the interaction between words and constructions // International Journal of Corpus Linguistics. 2003. Vol. 8, No. 2. P. 209-243.

27. Tomasello M. Constructing a Language: A Usage-Based Theory of Language Acquisition. Cambridge, MA: Harvard University Press, 2003. 388 p.

28. Wray A. Formulaic language and the lexicon. Cambridge: Cambridge University Press, 2005. 348 p.


Review

For citations:


Puzhaeva S.Yu., Gerasimenko E.A., Zakharova E.S., Rakhilina E.V. AUTOMATIC EXTRACTION OF FORMULAIC EXPRESSIONS FROM RUSSIAN TEXTS. NSU Vestnik. Series: Linguistics and Intercultural Communication. 2018;16(2):5-18. (In Russ.) https://doi.org/10.25205/1818-7935-2018-16-2-5-18

Views: 303


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-7935 (Print)