[an error occurred while processing this directive]

China Terminology ›› 2021, Vol. 23 ›› Issue (3): 59-67.doi: 10.12339/j.issn.1673-8578.2021.03.009

Previous Articles     Next Articles

Extracting Terms from Russian Texts Based on Multi Strategies

TANG Juxiang1(), SUN Yihui1, LIAO Xiao2, LIU Jianguo3, YU Juan1()   

  • Received:2021-05-11 Online:2021-07-05 Published:2021-06-28

Abstract:

Russian is one of the working languages of the United Nations and the official language of many countries including Russia. With the advancement of the Belt and Road Initiative and the acceleration of globalization, Russian text data has become an important information resource for managerial decision-making of related organizations and Russian text mining has thus become a significant decision-making method. However, Russian text mining methods are still far away from being mature, especially the essential Russian text term extraction method, which affects the accuracy of Russian text modeling. This paper proposes a Russian text term extraction method, which combines multi strategies including Russian POS analysis, grammatical rules and string frequency statistics to automatically extract Russian words and multiword expressions. Experiments on the United Nations Parallel Corpus and the Taiga Corpus show that the proposed method achieves a high accuracy of approximate 85% which is much higher than normal recall rate, such as the n-gram method. The proposed method can be used to create lexicons for Russian text mining applications such as text topic discovery, text classification, and text clustering.

Key words: Russian text mining, term extraction, POS tag, frequent word-string

CLC Number: