基于自注意力机制的科技术语自动提取技术研究

doi:10.3969/j.issn.1673-8578.2021.02.003

中国科技术语 ›› 2021, Vol. 23 ›› Issue (2): 20-26.doi: 10.3969/j.issn.1673-8578.2021.02.003

基于自注意力机制的科技术语自动提取技术研究

赵颂歌¹(), 张浩²(), 常宝宝¹()

1.北京大学信息科学技术学院计算语言学研究所,北京 100871
2.北京大学软件与微电子学院,北京 102600

收稿日期:2020-12-16 出版日期:2021-04-25 发布日期:2021-04-07
作者简介:赵颂歌(1995—),男,北京大学信息科学技术学院研究生。研究方向为术语抽取、自然语言生成等。通信方式:zhaosongge@pku.edu.cn。
张浩(1993—),男,北京大学软件与微电子学院硕士研究生。研究方向为术语抽取、语义搜索、视频推荐等。通信方式:hao-zhang@pku.edu.cn。
常宝宝(1971—),博士,北京大学信息科学技术学院副教授。主要研究领域为自然语言处理。先后主持了多个国家自然科学基金和国家社会科学基金等项目。在包括ACL、EMNLP、COLING、IJCAI、AAAI等国际顶级会议在内的国内外学术会议及期刊上发表论文近百篇。作为主要成员,先后获得教育部科技进步一等奖、中国电子学会科技进步一等奖、国家科技进步二等奖等。担任《中国科技术语》编委、《中文信息学报》编委、中国中文信息学会计算语言学专业委员会委员、中国人工智能学会自然语言理解专业委员会委员等。通信方式:chbb@pku.edu.cn。
基金资助:
国家自然科学基金项目“基于深度学习的数据-文本生成技术研究”(61876004);全国科学技术名词审定委员会科研项目“基于深度学习的科技术语提取技术研究”(2017001)

Research on Automatic Extraction of Scientific Terminology from Texts Based on Self-Attention

ZHAO Songge¹(), ZHANG Hao²(), CHANG Baobao¹()

Received:2020-12-16 Online:2021-04-25 Published:2021-04-07

摘要/Abstract

摘要：

科技术语提取是科技术语自动处理的重要环节,对后续的机器翻译、信息检索、QA问答等任务有重要意义。传统的人工科技术语提取方法耗费大量的人力成本。而一种自动提取科技术语方法是将术语提取转化为序列标注问题,通过监督学习方法训练出标注模型,但是面临缺乏大规模科技术语标注语料库的问题。文章引入远程监督的方法来产生大规模训练标注语料。另外又提出基于自注意力机制的Bi-LSTM的模型架构来提高科技术语提取结果。发现新模型在发现新的科技术语的能力上远远优于传统机器学习模型(CRF)。

关键词: 科技术语提取, 远程监督, 自注意力

Abstract:

Scientific terminology uses specific words to represent certain scientific concepts. The extraction of scientific terminology is an important part of the automatic processing of scientific terminology, and it is of great significance for the following tasks such as machine translation, information retrieval, and questions and answers. The traditional extraction of scientific terminology consumes a lot of manpower cost, and an automatic method for extracting scientific terminology is transforming terminology extraction into tagging problem and training out the tagging model through supervised learning methods, while the lack of annotated large-scale scientific terminology corpus is the problem. This paper introduces the method of distant supervision to generate large-scale annotated training corpus, and proposes Bi-LSTM model architecture based on Self-attention mechanism for the purpose of improving the extraction results of scientific terminology. We found that the ability of discovering new scientific terminology about our new model is far superior to the traditional machine learning model (CRF).

Key words: the extraction of scientific terminology, distant supervision, self-attention

中图分类号:

赵颂歌, 张浩, 常宝宝. 基于自注意力机制的科技术语自动提取技术研究[J]. 中国科技术语, 2021, 23(2): 20-26.

ZHAO Songge, ZHANG Hao, CHANG Baobao. Research on Automatic Extraction of Scientific Terminology from Texts Based on Self-Attention[J]. China Terminology, 2021, 23(2): 20-26.

图/表 7

图1 利用远程监督自动构建标注语料库

图2 单层Bi-LSTM抽取模型示意图

图3 self-attention Bi-LSTM抽取模型

	数量/个
训练集	60 000
测试集	100
科技术语词典术语	64 748

表1 标注语料库规模

CRF模板	Unigram	Bigram	模板特征数	$F term$
模板1	17	1	15190338	0.505
模板2	23	1	15452469	0.508
模板3	41	1	30641751	0.506
模板4	59	1	42838950	0.511
模板5	95	1	155853543	0.501

CRF模板	Unigram	Bigram	模板特征数	$F term$
模板1	17	1	15190338	0.505
模板2	23	1	15452469	0.508
模板3	41	1	30641751	0.506
模板4	59	1	42838950	0.511
模板5	95	1	155853543	0.501

表2 CRF模板特征选择和结果

模型	$F term$
CRF模板4	0.511
1Layer_BiLSTM	0.586
2Layer_BiLSTM	0.626
3Layer_BiLSTM	0.633
S_att 1Layer_BiLSTM	0.601
S_att 2Layer_BiLSTM	0.639
S_att 3Layer_BiLSTM	0.637

模型	$F term$
CRF模板4	0.511
1Layer_BiLSTM	0.586
2Layer_BiLSTM	0.626
3Layer_BiLSTM	0.633
S_att 1Layer_BiLSTM	0.601
S_att 2Layer_BiLSTM	0.639
S_att 3Layer_BiLSTM	0.637

表3 模型结果对比

图4 self-attention可视化 (X轴代表句子,Y轴代表attention机制中的query词语。颜色代表attention对应的权重,颜色越深,权重越大。)

参考文献 13

[1]	KAGEURA K, UMINO B. Methods of Automatic Term Recognition[J]. Terminology , 1996,3(2):29-35.
[2]	PANTEL P, LIN D. A Statistical Corpus-Based Term Extractor[M]//STUMPTNER M, CORBETT D, BROOKS M. Advances in Artificial Intelligence. Berlin Heidelberg: Springer-Verlag, 2001: 36-46.
[3]	HISAMITSU T, NIWA Y, TSUJII J. A method of measuring term representativeness baseline method using co-occurrence distribution[C]. COLING, 2000: 320-326.
[4]	CHANG J S. Domain specific word extraction from hierarchical web documents: a first step toward building lexicon trees from web corpora[C]// Proceedings of the 4th SIGHAN Workshop on Chinese Language Learning: 64-71.
[5]	FRANTZI K, ANANIADOU S, MIMA H. Automatic recognition of multi-word terms: the C value/NC-value method[J]. International Journal on Digital Libraries, 2000,3(2):115-130. doi: 10.1007/s007999900023 URL
[6]	NAKAGAWA H, MORI T. A simple but powerful automatic term extraction method[C]//Proceedings of 2nd International Workshop on Computational Terminology. COLING-2002 WORKSHOP, 2002,109(4):229-30.
[7]	WERMTER , JOACHIM , HAHN , et al. Paradigmatic modifiability statistics for the extraction of complex multi-word terms[J]. Proceedings of HLT-EMNLP’05, 2005: 843-850.
[8]	JU Z, ZHOU M, ZHU F. Identifying biological terms from text by support vector machine[J]. Industrial Electronics and Applications, 2011: 455-458.
[9]	ZHANG X, SONG Y, FANG A C. Term recognition using Conditional Random fields[J]. International Conference on Natural Language Processing and Knowledge Engineering, 2010: 1-6.
[10]	LI S, LI J, SONG T, et al. A novel topic model for automatic term extraction[C]// Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (SIGIR’13). 2013: 885-888.
[11]	MINTZ M, BILLS S, SNOW R, et al. Distant supervision for relation extraction without labeled data[C]. Meeting of the association for computational linguistics, 2009: 1003-1011.
[12]	word2vec project.[EB/OL].(2013-07-30)[2020-04-07]. https://code.google.com/p/word2vec.
[13]	CRF++: Yet Another CRF toolkit.[EB/OL].(2013-02-13)[2020-04-08]. http://taku910.github.io/crfpp.

[1]	陈柯, 柴启栋. 基于词向量空间模型的机器翻译质量评价分析——以石油术语有道翻译为例[J]. 中国科技术语, 2022, 24(2): 21-25.
[2]	向露, 周玉, 宗成庆. 基于中英文单语术语库的双语术语对齐方法[J]. 中国科技术语, 2022, 24(1): 14-25.
[3]	常宝宝. 科技术语自动提取技术——现状与思考[J]. 中国科技术语, 2022, 24(1): 3-13.
[4]	王华树, 刘世界. 术语抽取软件测评框架研究[J]. 中国科技术语, 2022, 24(1): 45-54.
[5]	陆晓蕾, 王凡柯. 计算语言学中的重要术语——词向量[J]. 中国科技术语, 2020, 22(3): 24-32.
[6]	邱碧华编译. 基于框架术语学理论的北约术语管理[J]. 中国科技术语, 2020, 22(3): 33-39.
[7]	木合亚提·尼亚孜别克, 古力沙吾利·塔里甫. 一种IT领域术语识别系统的设计与实现[J]. 中国科技术语, 2020, 22(2): 29-32.
[8]	雷树杰, 邢富坤. 英文武器装备名的构成类型与构造模式研究[J]. 中国科技术语, 2019, 21(1): 14-20.
[9]	王建良. 浅谈“智能牙刷”[J]. 中国科技术语, 2014, 16(zk1): 22-23.
[10]	乔毅. 情感计算[J]. 中国科技术语, 2014, 16(zk1): 80-82.
[11]	卫研研. 车联网的概念和发展[J]. 中国科技术语, 2014, 16(zk1): 146-147.
[12]	张晖. 科技新词工作实践探索[J]. 中国科技术语, 2013, 15(6): 5-9.
[13]	王华树. 浅议实践中的术语管理[J]. 中国科技术语, 2013, 15(2): 11-14.
[14]	罗季美. 机器翻译中的术语错译分析[J]. 中国科技术语, 2013, 15(1): 41-45.

基于自注意力机制的科技术语自动提取技术研究

Research on Automatic Extraction of Scientific Terminology from Texts Based on Self-Attention

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 13

相关文章 14

编辑推荐

Metrics

本文评价