Preview

NSU Vestnik. Series: Linguistics and Intercultural Communication

Advanced search

A NEW TOOLKIT FOR NATURAL TEXT PROCESSING WITH THE TXM PLATFORM AND ITS APPLIСATION TO A CORPUS FOR ANALYSIS OF TEXTS PROPAGATING EXTREMIST VIEWS

https://doi.org/10.25205/1818-7935-2018-16-3-19-31

Abstract

TXM platform provides a wide range of corpus analysis tools including correspondence analysis, clustering, lexical table construction, and parametrized subcorpus selection. The default structural unit of analysis for TXM is a token. The only TXM extension available by default is TreeTagger which performs automated morphological analysis and lemmatization during the corpus import process. However, it is possible to supply each token with a number of features enabling a more advanced text analysis. In this work we present a number of tools developed for even a more extensive, complex and flexible corpus analysis with TXM relying both on the tools previously developed by our team and on publicly available software libraries. We focus in particular on a stemming technique that uses a word structural pattern method and on noun phrase recognition that together make it possible to perform more sophisticated and powerful queries and analyses of the corpus not limited to word forms. The structural pattern stemming method is based on a set of specific language rules that allow separating a word stem from all affixes. The recognition of noun phrases is based on rules allowing the detection of subordination and coordination relations among nouns. These extensions result in the improvement of performance of statistical tools used by TXM, such as specificity scores and correspondence analysis. The new set of tools has been tested on a corpus including texts marked as «extremist» by experts along with «neutral» texts in similar domains. The corpus of approximately 900,000 words is divided into eight subcorpora: neutral texts oppose seven thematic subcorpora considered as extremist (namely aggressive, fascist, ideological, nationalistic, religious, separatist, and terroristic). The specificity analysis detects the words (or other structural units) that are significantly more or less frequent in a given subcorpus compared to the entire corpus. The specificity score for selected units can be compared across all the subcorpora in order to verify their difference or similarity. The correspondence analysis produces a chart where the subcorpora are represented as points in a two-dimensional space based on their similarity as to the frequency of selected units. All tests demonstrated a significant difference between neutral texts, on one side, and marked, on the other. Two «extremist» subcorpora, religious and ideological, demonstrated similar results and can probably be merged. These facts encourage further research on fully automatic or computer-aided expert recognition of extremist texts.

About the Authors

A. M. Lavrentiev
IHRIM Research Lab, CNRS & ENS de Lyon
Russian Federation


F. N. Solovyev
Institute of Physical and Technical Informatics
Russian Federation


M. I. Suvorova
Federal Research Center - Computer Science and Control RAS
Russian Federation


A. I. Fokina
National Research University - Higher School of Economics
Russian Federation


A. M. Chepovskiy
National Research University - Higher School of Economics
Russian Federation


References

1. Ананьева М. И., Кобозева М. В., Соловьев Ф. Н., Поляков И. В., Чеповский А. М. О проблеме выявления экстремистской направленности в текстах // Вестн. НГУ. Серия: Информационные технологии. 2016. Т. 14, № 4. С. 5-13.

2. Ананьева М. И., Девяткин Д. А., Кобозева М. В., Смирнов И. В., Соловьев Ф. Н., Чеповский А. М. Исследование характеристик текстов противоправного содержания // Тр. Ин-та системного анализа РАН. 2017. Т. 67, № 3. С. 86-97.

3. Белоногов Г. Г., Богатырёв В. И. Автоматизированные информационные системы. М.: Сов. радио, 1973.

4. Болховитянов А. В., Чеповский А. М. Методы автоматического анализа словоформ // Информационные технологии. 2011. № 4 (176). С. 24-29.

5. Зализняк А. А. Грамматический словарь русского языка. М.: Русский язык, 1977.

6. Чеповский А. М. Информационные модели в задачах обработки текстов на естественных языках. 2-е изд., перераб. М.: Национальный открытый университет «ИНТУИТ», 2015.

7. Benzécri J.-P. L’analyse des données: l’analyse des correspondances. 2nd ed. Paris: Dunod, 1979. Vol. 2.

8. Egorova E., Chepovskiy A., Lavrentiev A. A structural pattern based method for automated morphological analysis of word forms in a natural language // Journal of Mathematical Sciences. 2016. Vol. 214. No. 6. P. 802-813.

9. Heiden S. The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme // 24th Pacific Asia Conference on Language, Information and Computation - PACLIC24 / Eds. R. Otoguro, K. Ishikawa, H. Umemoto, K. Yoshimoto, Y. Harada. Institute for Digital Enhancement of Cognitive Development. Sendai, Japan: Waseda University, 2010. P. 389-398.

10. Lafon P. Sur la variabilité de la fréquence des formes dans un corpus // Mots. 1980. № 1. P. 127-165.

11. Lê S., Josse J., & Husson F. FactoMineR: an R package for multivariate analysis // Journal of Statistical Software. 2008. № 25 (1) P. 1-18.

12. Schmid H. Probabilistic Part-of-Speech Tagging Using Decision Trees // Proceedings of International Conference on New Methods in Language Processing. Manchester, UK. 1994. URL: http://www.cis.uni-muenchen.de/sschmid/tools/TreeTagger/data/tree-tagger1.pdf


Review

For citations:


Lavrentiev A.M., Solovyev F.N., Suvorova M.I., Fokina A.I., Chepovskiy A.M. A NEW TOOLKIT FOR NATURAL TEXT PROCESSING WITH THE TXM PLATFORM AND ITS APPLIСATION TO A CORPUS FOR ANALYSIS OF TEXTS PROPAGATING EXTREMIST VIEWS. NSU Vestnik. Series: Linguistics and Intercultural Communication. 2018;16(3):19-31. (In Russ.) https://doi.org/10.25205/1818-7935-2018-16-3-19-31

Views: 241


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-7935 (Print)