Preview

NSU Vestnik. Series: Linguistics and Intercultural Communication

Advanced search

Distinctive Features of Association Measures Applied to Chinese Character Bigram Extraction Tasks

https://doi.org/10.25205/1818-7935-2022-20-2-64-80

Abstract

Studying professional discourse, a researcher has now an opportunity to create collections of texts and apply linguistic analysis software tools to them. However, when it comes to Chinese discourse there is a problem with the reliability of automatic word segmentation of texts. One of the ways to extract lexical units in Chinese texts is to apply statistical association measures for collocations to Chinese character bigrams. The purpose of this work is to conduct a comparative analysis of seven different statistical measures for collocations as a means of extracting two-syllabic lexical units (binomes) in an unsegmented Chinese character text. The subject of the analysis is the lexical, grammatical and frequency characteristics of bigrams with higher values of the statistical measures. Their comparison makes it possible to draw a conclusion about the features of statistical measures, in particular, about the best correspondence of linguistic tasks to statistical measures. The linguistic material of the study was a collection of 560 military-related news texts in Chinese with more than 720 thousand characters. The results show that the statistical measures considered can be divided into three groups according to the characteristics of bigrams receiving the highest values. The first group includes measures MI, MS and logDice, which give priority to rare bigrams with limited compatibility of components, such as the Chinese two-syllable single morpheme words “lianmianzi”. These measures do not extract terms well, but can be used to search for phraseologically related components. The measures of the second group, t-score and log-likelihood, are frequency-oriented, similar to frequency analysis, but they cope with non-lexical bigrams better, while log-likelihood somewhat lowers the rank of numerals and pronouns, picking out best the typical vocabulary of professional discourse. The third group includes measures MI3 and MI.log-f, which average the opposite approaches of the first two groups. The MI3 measure is considered to be the most universal one; it could be used to compare different corpora or collections of texts. It is concluded that applying statistical association measures to Chinese character bi-grams is possible and appropriate, when taking into account the correspondence of their specifics to a research task.

About the Author

D. S. Korshunov
Military University of Radio Electronics
Russian Federation

Dmitry S. Korshunov, Candidate of Sciences (Philology)

SPIN 7282-7336

Cherepovets



References

1. Alpatov, V. M. Parts of Speech and Semantics. In: Krasnykh V. V., Izotov A. I. (eds.). Language, Consciousness, Communication: Collection of articles. Moscow, MAKS Press, 2016, vol. 53, pp. 11–26. (in Russ.)

2. Chen, X. C., Shi, Z., Qiu, X. P., Huang, X. J. Adversarial multi-criteria learning for Chinese word segmentation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, vol. 1, pp. 1193–1203.

3. Church, К., Hanks, P. Word association norms, mutual information, and lexicography. Computa-tional Linguistics, 1990, no. 16 (1), pp. 22–29.

4. Da, Jun. Chinese text computing. 2004. (in Chin., Engl.) URL: http://lingua.mtsu.edu/chinese-computing (accessed: 23.03.2020).

5. Grokhovskiy, P. L., Dobrov, A. V., Dobrova, A. E., Zakharov, V. P., Soms, N. L. Computer Morphosyntactic Analysis of the Non Segmented Text (Based on the Material of the Corpus of Tibetan Grammar Treatises). In: Nikolayev I. S. (ed.). Structural and Applied Linguistics: In-teruniversity Collection. St. Petersburg, St. Petersburg State Uni. Press, 2019, vol. 12, pp. 69–80. (in Russ.)

6. Grudeva, E. V., Tikhanovich, A. N. Lexical function of MAGN in modern Russian: corpus and experimental study: Monograph. Novosibirsk, SibAK Publ., 2014, 264 p. (in Russ.)

7. Iagunova, E. V., Pivovarova, L. M. Nature of collocations in the Russian language. Experience of automatic extraction and classification on the material of news texts. Sb. NTI. Series 2, 2010, no. 6, pp. 30–40. (in Russ.)

8. Iordanskaya, L. N., Melchuk, I. A. Meaning and compatibility in the dictionary. Moscow, Lan-guages of Slavic Cultures Publ., 2007, 673 p. (in Russ.)

9. Kasevich, V. B. On the strategies of text segmentation (based on the material of Chinese, Japanese and Russian languages). In: Kasevich, V. B. Works on Linguistics: In 2 vols. Ed. by Yu. A. Kley- ner. St. Petersburg, Faculty of Philology, St. Petersburg State Uni. Press, 2011, vol. 2, pp. 615– 622. (in Russ.)

10. Kasevich, V. B. Submorphs, syllomorphisms and syllable languages. In: Kasevich, V. B. Works on Linguistics: In 2 vols. Ed. by Yu. A. Kleyner. St. Petersburg, Faculty of Philology, St. Peters-burg State Uni. Press, 2011, vol. 2, pp. 389–394. (in Russ.)

11. Khamatova, A. A. Word formation of the modern Chinese language. Moscow, Muravey Publ., 2003, 224 p. (in Russ.)

12. Khokhlova, M. V. Distinctive features of association measures for bigram extraction. In: Proceed-ings of the International Conference “Corpus Linguistics – 2017”. St. Petersburg, St. Peters-burg State Uni. Press, 2017, pp. 349–354. (in Russ.)

13. Korshunov, D. S. Frequency of Co-Occurrence of Chinese Characters as an Indicator of Lexicality (When Selecting the Vocabulary of Chinese Military Discourse). Philological Sciences at MGIMO, 2020, vol. 6, no 4 (24), pp. 14–24. (in Russ.) DOI 10.24833/2410-2423-2020-4-24-14-24

14. Lan Huang, Juan Zhou, Jing Xue, Yongxing Li, Youfu Du. DACE: Extracting and Exploring Large Scale Chinese Web Collocations with Distributed Computing. American Journal of In-formation Systems, 2017, vol. 5, no. 1, pp. 27–32. DOI 10.12691/ajis-5-1-4

15. Li Jingyang, Sun Maosong, Zhang Xian. A Comparison and Semi-Quantitative Analysis of Words and Character-Bigrams as Features in Chinese Text Categorization. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, 2006, pp. 545–552.

16. Li Shouji, Guo Shulun. Collocation Analysis Tools for Chinese Collocation Studies. Journal of Technology and Chinese Language Teaching, 2016, no. 7 (1), pp. 56–77.

17. Meng, Y., Li, X., Sun, X., Han, Q., Yuan, A., Li, J. Is Word Segmentation Necessary for Deep Learning of Chinese Representations? Proceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics, 2019, pp. 3242–3252.

18. Pedersen, T. Dependent Bigram Identification. Proceedings of American Association of Artificial Intelligence, 1998, pp. 193. URL: https://www.aaai.org/Papers/AAAI/1998/AAAI98-193.pdf

19. Piao, S., Sun Guangfan, Rayson, P., Yuan Qi. Automatic Extraction of Chinese Multiword Ex-pressions with a Statistical Tool. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics Workshop on Multiword Expressions in a Mul-tilingual Context. Trento, Italy, 2006, pp. 17–24.

20. Sproat, R., Shih, C. A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 1990, vol. 4, no. 4, pp. 336–351.

21. Sun, M. S., Shen, D. Y., Benjamin, K. T. Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data. Meeting of the Association for Computational Linguistics and In-ternational Conference on Computational Linguistics Association for Computational Linguis-tics, 1998, no. 48 (2), pp. 1265–1271.

22. Vlasova, E. A., Karpova, E. L., Olshevskaya, M. Yu. Vocabulary: How Many Words Are Enough? Principles of Minimizing Learners’ Vocabulary. Vestnik NSU. Series: Linguistics and Intercultural Communication, 2019, vol. 17, no. 4, pp. 63–77. (in Russ.) DOI 10.25205/1818-7935-2019-17-4-63-77

23. Vlavatskaya, M. V. Typology of Collocations in Combinatorial Linguistics. The world of science, culture and education, 2019, no. 4 (77), pp. 439–442. (in Russ.)

24. Zakharov, V. Automatic Collocation Extraction: Association Measures Evaluation and Integration. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual Interna-tional Conference “Dialogue” (2017). Moscow, RSUH, 2017a, vol. 1, iss. 16 (23), pp. 396–407.

25. Zakharov, V. Comparative Evaluation and Integration of Collocation Extraction Metrics. In: Ekstein K., Matousek V. (eds.). Lecture Notes in Computer Science, vol. 10415 (Text, Speech, and Dialogue – 20th International Conference, TSD 2017, Prague, Czech Republic, August 27–31, 2017, Proceedings). Springer International Publ. AG, 2017b, pp. 255–262.

26. Zakharov, V. P., Khokhlova, M. V. Study of effectiveness of statistical measures for collocation extraction on Russian texts. Computational Linguistics and Intelligent Technologies, 2010, vol. 9 (16), pp. 137–143. (in Russ.)

27. 王素格,杨军玲,张武 (Wang Suge, Yang Junling, Zhang Wu). 自动获取汉语词语搭配 (Auto-matic Collocation Extraction in Chinese) // 中文信息学报, 2006. 第20卷. 第6期. 31–37页. (in Chin.)

28. 邓耀臣 (Deng Yaochen). 词语搭配研究中的统计方法 (Collocation statistical research methods) // 大连海事大学学报(社会科学版), 2003. 第2卷. 第4期. 74–77页. (in Chin.)

29. 孙茂松, 黄昌宁, 邹嘉彦, 陆方, 沈达阳 (Sun Maosong, Huang Changning, Benjamin K. Tsou, Lu Fang, Shen Dayang) 利用汉字二元语法关系解决汉语自动分词中的交集型歧义 (Us-ing character bigram for ambiguity resolution in Chinese word segmentation) // 计算 机研究与发展, 1997. 第34卷. 第5期. 332–339页. (in Chin.)

30. 全昌勤, 刘辉, 何婷婷 (Quan Changqin, Liu Hui, He Tingting). 基于统计模型的词语搭配自 动获取方法的分析与比较 (Analysis and comparison of automatic collocation extraction methods based on statistical models) // 计算机应用研究, 2005. 第22卷. 第9期. 55–57页. (in Chin.)


Review

For citations:


Korshunov D.S. Distinctive Features of Association Measures Applied to Chinese Character Bigram Extraction Tasks. NSU Vestnik. Series: Linguistics and Intercultural Communication. 2022;20(2):64-80. (In Russ.) https://doi.org/10.25205/1818-7935-2022-20-2-64-80

Views: 258


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-7935 (Print)