中国科技术语 ›› 2021, Vol. 23 ›› Issue (3): 59-67.doi: 10.12339/j.issn.1673-8578.2021.03.009

• • 上一篇    下一篇

多策略融合的俄语文本词语提取方法研究

唐菊香1(), 孙怿晖1, 廖晓2, 刘建国3, 于娟1()   

  1. 1.福州大学经济与管理学院, 福建福州 350108
    2.广东金融学院互联网金融与信息工程学院, 广东广州 510521
    3.上海财经大学会计与财务研究院, 上海 200433
  • 收稿日期:2021-05-11 出版日期:2021-07-05 发布日期:2021-06-28
  • 作者简介:唐菊香(1996—),女,福州大学经济与管理学院硕士研究生,研究方向为数据挖掘与商务智能。通信方式:1767365964@qq.com
    于娟(1981—),女,博士,福州大学经济与管理学院教授,中国系统工程学会数据科学与知识系统工程专委会委员,主要研究领域为数据挖掘、信息与知识管理系统,先后主持和完成多项国家自然科学基金和国家社会科学基金项目。通信方式:yujuan@fzu.edu.cn
  • 基金资助:
    国家自然科学基金项目“基于本体学习与本体映射的组织异构数据融合方法研究”(71771054)

Extracting Terms from Russian Texts Based on Multi Strategies

TANG Juxiang1(), SUN Yihui1, LIAO Xiao2, LIU Jianguo3, YU Juan1()   

  • Received:2021-05-11 Online:2021-07-05 Published:2021-06-28

摘要:

俄语是联合国工作语言之一,是俄罗斯等多个国家的官方语言。随着“一带一路”倡议的推进和全球化进程的加快,俄语文本数据成为有关组织管理决策的重要信息来源,俄语文本挖掘也因而成为重要的管理决策支持方法。然而,俄语文本挖掘方法研究目前还远未成熟,尤其是其关键基础——俄语文本词语提取的性能较低,阻碍着俄语文本建模的准确性。因此,文章提出一种多策略融合的俄语文本词语提取方法,结合俄语词性分析、语法规则和串频统计等多种方法,自动提取包含单词和短语在内的俄语词语。在联合国平行语料库和Taiga Corpus语料库上的实验结果表明,文章提出的方法在保证高召回率的同时,达到了85%以上的高准确率,显著优于常用的n-gram方法,能够为俄语文本主题发现和文本分/聚类等文本挖掘应用提供有效的词库。

关键词: 俄语文本挖掘, 词语提取, 词性标注, 频繁词串

Abstract:

Russian is one of the working languages of the United Nations and the official language of many countries including Russia. With the advancement of the Belt and Road Initiative and the acceleration of globalization, Russian text data has become an important information resource for managerial decision-making of related organizations and Russian text mining has thus become a significant decision-making method. However, Russian text mining methods are still far away from being mature, especially the essential Russian text term extraction method, which affects the accuracy of Russian text modeling. This paper proposes a Russian text term extraction method, which combines multi strategies including Russian POS analysis, grammatical rules and string frequency statistics to automatically extract Russian words and multiword expressions. Experiments on the United Nations Parallel Corpus and the Taiga Corpus show that the proposed method achieves a high accuracy of approximate 85% which is much higher than normal recall rate, such as the n-gram method. The proposed method can be used to create lexicons for Russian text mining applications such as text topic discovery, text classification, and text clustering.

Key words: Russian text mining, term extraction, POS tag, frequent word-string

中图分类号: