中国科技术语 ›› 2022, Vol. 24 ›› Issue (1): 14-25.doi: 10.12339/j.issn.1673-8578.2022.01.002

• 计算术语学 专栏 • 上一篇    下一篇

基于中英文单语术语库的双语术语对齐方法

向露1,2(), 周玉1,2,3(), 宗成庆1,2()   

  1. 1.中国科学院自动化研究所模式识别国家重点实验室,北京 100190
    2.中国科学院大学人工智能学院,北京 100049
    3.凡语AI研究院/北京中科凡语科技有限公司,北京 100080
  • 收稿日期:2021-07-30 修回日期:2021-10-09 出版日期:2022-01-05 发布日期:2021-12-27
  • 作者简介:向露(1988—),女,中国科学院自动化研究所模式识别国家重点实验室博士研究生, 主要研究方向为人机对话系统、文本生成和自然语言处理。通信方式: lu.xiang@nlpr.ia.ac.cn
    宗成庆(1963-),男,博士,中国科学院自动化所研究员,中国科学院大学岗位教授,中国计算机学会会士,中国人工智能学会会士,主要从事自然语言处理和机器翻译研究,出版专著《统计自然语言处理》和《文本数据挖掘》(中、英文版),发表论文200余篇。通信方式: cqzong@nlpr.ia.ac.cn

Bilingual Terminology Alignment Based on Chinese-English Monolingual Terminological Bank

XIANG Lu1,2(), ZHOU Yu1,2,3(), ZONG Chengqing1,2()   

  • Received:2021-07-30 Revised:2021-10-09 Online:2022-01-05 Published:2021-12-27

摘要:

双语术语对齐库是自然语言处理领域的重要资源,对于跨语言信息检索、机器翻译等多语言应用具有重要意义。双语术语对通常是通过人工翻译或从双语平行语料中自动提取获得的。然而,人工翻译需要一定的专业知识且耗时耗力,而特定领域的双语平行语料也很难具有较大规模。但是同一领域中各种语言的单语术语库却较易获得。为此,提出一种基于两种不同语言的单语术语库自动实现术语对齐,以构建双语术语对照表的方法。该方法首先利用多个在线机器翻译引擎通过投票机制生成目标端“伪”术语,然后利用目标端“伪”术语从目标端术语库中检索得到目标端术语候选集合,最后采用基于mBERT的语义匹配算法对目标端候选集合进行重排序,从而获得最终的双语术语对。计算机科学、土木工程和医学三个领域的中英文双语术语对齐实验结果表明,该方法能够提高双语术语抽取的准确率。

关键词: 双语术语, 单语术语库, 术语对齐, 语义匹配

Abstract:

Bilingual terminologies are essential resources in natural language processing, which are of great significance for many multilingual applications such as cross-lingual information retrieval and machine translation. Bilingual terminology pairs are usually obtained by either human translation or automatic extraction from a bilingual parallel corpus. However, human translation requires professional knowledge and is time-consuming and labor-intensive. Besides, it is not easy to have a large bilingual parallel corpus in a specific domain. But the monolingual terminology banks of various languages in the same domain are relatively easy to obtain. Therefore, this paper proposes a novel method to extract bilingual terminology pairs by automatically aligning terms from monolingual terminology banks of two languages. Firstly, multiple online machine translation engines are adopted to generate the target pseudo terminology through a voting mechanism. Secondly, the target pseudo terminology is used to retrieve from the target terminology bank to obtain the candidate set of target terminologies. Finally, a mBERT-based semantic matching model is used to re-rank the candidate set and obtain the final bilingual terminology pair. Experimental results of Chinese-English bilingual terminology alignment on three domains, including computer science, civil engineering, and medicine, show that our proposed method can effectively improve the accuracy of bilingual terminology extraction.

Key words: bilingual terminology, monolingual terminological bank, terminology alignment, semantic matching

中图分类号: