科技术语自动提取技术——现状与思考

doi:10.12339/j.issn.1673-8578.2022.01.001

中国科技术语 ›› 2022, Vol. 24 ›› Issue (1): 3-13.doi: 10.12339/j.issn.1673-8578.2022.01.001

• 计算术语学专栏 • 下一篇

科技术语自动提取技术——现状与思考

常宝宝()

北京大学计算语言学教育部重点实验室,北京 100871

收稿日期:2021-08-04 修回日期:2021-10-19 出版日期:2022-01-05 发布日期:2021-12-27
作者简介:常宝宝(1971—),博士,北京大学信息科学技术学院副教授,主要研究领域为自然语言处理。先后主持多个国家自然科学基金和国家社会科学基金等项目,在包括ACL、EMNLP、COLING、IJCAI、AAAI等国际顶级会议在内的国内外学术会议及期刊上发表论文近百篇。作为主要成员,先后获得教育部科技进步一等奖、中国电子学会科技进步一等奖、国家科技进步二等奖等。通信方式: chbb@pku.edu.cn。
基金资助:
全国科学技术名词审定委员会科研项目“基于深度学习的科技术语提取技术研究”(2017001);国家自然科学基金项目“基于深度学习的数据-文本生成技术研究”(61876004)

Techniques of Automatic Term Extraction:Current Sate and Reflections

CHANG Baobao()

Received:2021-08-04 Revised:2021-10-19 Online:2022-01-05 Published:2021-12-27

摘要/Abstract

摘要：

文章简要介绍了自动术语提取任务的定义、主要方法和评价指标。针对传统的自动术语提取方法,以互信息、t值、tf-idf、C/NC-value为例介绍了单元度和术语度的概念;针对自动术语标注方法,主要介绍了基于序列标注的建模思想。从提取效果来看,现有自动术语提取技术距离期望仍有差距,文章也尝试给出了一些值得探索的方向。

关键词: 自动术语提取, 自动术语标注, 单元度, 术语度, 机器学习

Abstract:

This paper overviews the definition, major approaches and the evaluation metrics of the ATE task. For the traditional approaches, we mainly elaborate the measurement of the Unithood and Termhood, using pointwise mutual information, t-value, ti-idf weighting and C/NC-value as examples. For Automatic Term Labelling, we mainly present the sequence labelling modelling. We think the performance of Automatic Term Extraction/Labelling is still not satisfactory from a point of view of real application, and try to offer a few directions of further improvements.

Key words: automatic term extraction, automatic term labelling, unithood, termhood, machine learning

中图分类号:

TP391
H083

常宝宝. 科技术语自动提取技术——现状与思考[J]. 中国科技术语, 2022, 24(1): 3-13.

CHANG Baobao. Techniques of Automatic Term Extraction:Current Sate and Reflections[J]. China Terminology, 2022, 24(1): 3-13.

参考文献 19

[1]	语言学名词审定委员会. 语言学名词[M]. 北京:商务印书馆, 2011.
[2]	CABRÉ CASTELLVÍ M T, BAGOT R E, PALATRESI J V. Automatic term detection: a review of current systems[M]//BOURIGAULT D, JACQUEMIN C, L’HOMME M-C. Recent Advances in Computational Terminology. Amsterdam: John Benjamins Publishing Company, 2001: 53-88.
[3]	JUSTESON J, KATZ S. Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text[J]. Natural Language Engineering, 1995, 1(1):9-27. doi: 10.1017/S1351324900000048 URL
[4]	KYO K, BIN U. Methods of automatic term recognition:a review[J]. Terminology, 1996, 3(2):1-23.
[5]	CHANG B B, DANIELSSON P, TEUBERT W. Extraction of Translation Unit from Chinese-English Parallel Corpora[C]// Proceedings of The First SIGHAN Workshop on Chinese Language Processing, 2002.
[6]	CHURCH K W, GALE W A. Inverse document frequency (idf): A measure of deviations from poisson[C]// Proceedings of the ACL 3rd Workshop on Very Large Corpora, 1995: 121-130.
[7]	FRANTZI K, ANANIADOU S, MIMA H. The C-value/NC-value method of automatic recognition for multi-word terms[C]// Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries (ECDL), 1998: 585-604.
[8]	FRANTZI K, ANANIADOU S, MIMA H. Automatic recognition of multi-word terms: the C-value/NC-value method[J]. International Journal on Digital Libraries, 2000, 3(2):115-130. doi: 10.1007/s007999900023 URL
[9]	BARRÓN-CEDEÑO A, SIERRA G, DROUIN P, et al. An improved automatic term recognition method for Spanish[C]// International Conference on Intelligent Text Processing and Computational Linguistics, 2009:125-136.
[10]	王海雄, 郭剑毅, 余正涛, 等. 基于CRFs的中文领域术语自动抽取研究[C]//第六届全国信息检索学术会议论文集, 北京:中国中文信息学会, 2010:505-512.
[11]	ZHANG X, SONG Y, FANG A C. Termrecognition using Conditional Random fields[C]// International Conference on Natural Language Processing and Knowledge Engineering, IEEE, 2010:1-6.
[12]	赵颂歌, 张浩, 常宝宝. 基于自注意力机制的科技术语自动提取技术研究[J]. 中国科技术语, 2021, 23(2):20-26.
[13]	MINTZ M, BILLS S, SNOW R, et al. Distant supervision for relation extraction without labeled data[C]// Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009:1003-1011.
[14]	KIM J-D, OHTA T, TATEISI Y, et al. GENIA corpus:a semantically annotated corpus for bio-textmining[J]. Bioinformatics, 2003, 19(1):i180-i182. doi: 10.1093/bioinformatics/btg1023 URL
[15]	ZADEH B Q, HANDSCHUH S. The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics[C]// Proceedings of the 4th International Workshop on Computational Terminology (Computerm), 2014:52-63.
[16]	冯志伟. 一个新兴的术语学科:计算术语学[J]. 术语标准化与信息技术, 2008(4):4-9.
[17]	ZHANG Z Q, GAO J, CIRAVEGNA F. JATE 2.0: Java Automatic Term Extraction with Apache Solr[C]// Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016: 2262-2269.
[18]	KANG B, CHANG B B, CHEN Y R, et al. Extracting Terminologically Relevant Collocations in the Translation of Chinese Monograph[C]// International Joint Conference on Natural Language Processing, 2005: 1017-1028.
[19]	KANG B, CHANG B B, CHEN Y R, et al. Translating multi word terms into Korean from Chinese documents[C]// International Conference on Natural Language Processing and Knowledge Engineering, 2005: 449-454.

科技术语自动提取技术——现状与思考

Techniques of Automatic Term Extraction:Current Sate and Reflections

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献 19

相关文章 14

编辑推荐

Metrics

本文评价

[1]	陈柯, 柴启栋. 基于词向量空间模型的机器翻译质量评价分析——以石油术语有道翻译为例[J]. 中国科技术语, 2022, 24(2): 21-25.
[2]	向露, 周玉, 宗成庆. 基于中英文单语术语库的双语术语对齐方法[J]. 中国科技术语, 2022, 24(1): 14-25.
[3]	王华树, 刘世界. 术语抽取软件测评框架研究[J]. 中国科技术语, 2022, 24(1): 45-54.
[4]	赵颂歌, 张浩, 常宝宝. 基于自注意力机制的科技术语自动提取技术研究[J]. 中国科技术语, 2021, 23(2): 20-26.
[5]	陆晓蕾, 王凡柯. 计算语言学中的重要术语——词向量[J]. 中国科技术语, 2020, 22(3): 24-32.
[6]	邱碧华编译. 基于框架术语学理论的北约术语管理[J]. 中国科技术语, 2020, 22(3): 33-39.
[7]	木合亚提·尼亚孜别克, 古力沙吾利·塔里甫. 一种IT领域术语识别系统的设计与实现[J]. 中国科技术语, 2020, 22(2): 29-32.
[8]	雷树杰, 邢富坤. 英文武器装备名的构成类型与构造模式研究[J]. 中国科技术语, 2019, 21(1): 14-20.
[9]	王建良. 浅谈“智能牙刷”[J]. 中国科技术语, 2014, 16(zk1): 22-23.
[10]	乔毅. 情感计算[J]. 中国科技术语, 2014, 16(zk1): 80-82.
[11]	卫研研. 车联网的概念和发展[J]. 中国科技术语, 2014, 16(zk1): 146-147.
[12]	张晖. 科技新词工作实践探索[J]. 中国科技术语, 2013, 15(6): 5-9.
[13]	王华树. 浅议实践中的术语管理[J]. 中国科技术语, 2013, 15(2): 11-14.
[14]	罗季美. 机器翻译中的术语错译分析[J]. 中国科技术语, 2013, 15(1): 41-45.