计算语言学中的重要术语——词向量

doi:10.3969/j.issn.1673-8578.2020.03.004

中国科技术语 ›› 2020, Vol. 22 ›› Issue (3): 24-32.doi: 10.3969/j.issn.1673-8578.2020.03.004

计算语言学中的重要术语——词向量

陆晓蕾, 王凡柯

厦门大学,福建厦门 361005

收稿日期:2020-01-02 修回日期:2020-05-17 出版日期:2020-06-25 发布日期:2020-07-20
作者简介:陆晓蕾(1988—),女,博士,厦门大学助理教授,主要研究方向为计算语言学。通信方式: luxiaolei@xmu.edu.cn。
基金资助:
教育部人文社科基金青年项目“‘一带一路’战略下涉外法律机器翻译云平台的构建及应用研究”(18YJCZH117);福建省中青年教师教育科研项目“基于语料库的法律英语教学云平台的构建”(JZ180061);中央高校基本科研项目“基于语义模型的机器翻译研究”(20720191053)

Word Embedding: Concepts and Applications

LU Xiaolei, WANG Fanke

Received:2020-01-02 Revised:2020-05-17 Online:2020-06-25 Published:2020-07-20

摘要/Abstract

摘要：

过去几年,自然语言处理(NLP)技术飞速发展,文本表征成了计算语言学的核心。其中,分布式词向量表征在语义表达方面展现出巨大的潜力与应用效果。文章从语言学理论基础出发,介绍了计算语言学的重要术语——词向量。探讨了词向量的两种表示方式:离散式与分布式;介绍了词向量在语义变迁等历时语言学领域的应用。在此基础上,指出词向量语义计算法存在的局限性,并总结了两种词义消歧方法:无监督与基于知识库。最后,文章提出大规模知识库与词向量的结合可能是未来文本表征研究的重要方向之一。

关键词: 自然语言处理, 文本表征, 词向量

Abstract:

This article focuses on the study of word embedding, a feature-learning technique in natural language processing that maps words or phrases to low-dimensional vectors. Beginning with the linguistic theories concerning contextual similarities — “distributional hypothesis” and “context of situation”, this article introduces two ways of numerical representation of text: one-hot and distributed representation. In addition, this article presents statistical-based language models (such as co-occurrence matrix and singular value decomposition) as well as neural network language models (NNLM, such as continuous bag-of-words and skip-gram). This article also analyzes how word embedding can be applied to the study of word-sense disambiguation and diachronic linguistics.

Key words: natural language processing, text representation, word embedding

中图分类号:

H083
TP391.1

陆晓蕾, 王凡柯. 计算语言学中的重要术语——词向量[J]. 中国科技术语, 2020, 22(3): 24-32.

LU Xiaolei, WANG Fanke. Word Embedding: Concepts and Applications[J]. China Terminology, 2020, 22(3): 24-32.

图/表 7

参考文献 23

[1]	Hinton G E., Learning Distributed Representations of Concepts[C/OL]. [2020-05-17]. http://www.cs.toronto.edu/~hinton/absps/families.pdf.
[2]	Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3:1137-1155.
[3]	Harris Z S. Distributional Structure[J]. Word, 1954,10(2-3):146-162.
[4]	Firth J R. A Synopsis of Linguistic Theory, 1930—1955[J]. Studies in Linguistic Analysis, 1957, 168-205.
[5]	Li S, Zhao Z, Hu R, et al. Analogical reasoning on Chinese morphological and semantic relations[C/OL]. [2020-05-17]. https://arxiv.org/pdf/1805.06504.pdf.
[6]	Xu W, Rudnicky A. Can Artificial Neural Networks Learn Language Models?[C/OL]. [2020-05-17]. https://kilthub.cmu.edu/articles/Can_Artificial_Neural_Networks_Learn_Language_Models_/6604016/files/12094409.pdf.
[7]	Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C/OL]. [2020-05-17]. https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
[8]	Akhtar S S. Robust Representation Learning for Low Resource Languages[M]. INDIA: International Institute of Information Technology, 2018.
[9]	Reifler E. The Mechanical Determination of Meaning[J]. Readings in Machine Translation, 1955: 21-36.
[10]	Weaver W. Translation[J]. Machine Translation of Languages, 1955,14:15-23.
[11]	Weiss S F. Learning to disambiguate[J]. Information Storage and Retrieval, 1973,9(1):33-41.
[12]	Liu P, Qiu X, Huang X. Learning Context-sensitive Word Embeddings with Neural Tensor Skip-gram Model[C/OL]. [2020-05-17]. https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/viewFile/11398/10841.
[13]	Li J, Jurafsky D. Do Multi-sense Embeddings Improve Natural Language Understanding?[C/OL]. [2020-05-17]. https://arxiv.org/pdf/1506.01070.
[14]	Huang E H, Socher R, Manning C D, et al. Improving Word Representations Via Global Context and Multiple Word Prototypes [C/OL]. [2020-05-17]. https://dl.acm.org/doi/pdf/10.5555/2390524.2390645?download=true.
[15]	Yu M, Dredze M. Improving Lexical Embeddings with Semantic Knowledge[C/OL]. [2020-05-17]. https://www.aclweb.org/anthology/P14-2089.pdf.
[16]	Bian J, Gao B, Liu T Y. Knowledge-powered Deep Learning for Word Embedding[C/OL]. [2020-05-17]. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/5BECML20145D20Knowledge-Powered20Word20Embedding.pdf.
[17]	Nguyen K A, Walde S S, Vu N T. Integrating Distributional Lexical Contrast into Word Embeddings for Antonym-synonym Distinction [C/OL]. [2020-05-17]. https://arxiv.org/pdf/1605.07766.pdf.
[18]	Niu Y, Xie R, Liu Z, et al. Improved Word Representation Learning with Sememes[C/OL]. [2020-05-17]. https://www.aclweb.org/anthology/P17-1187.pdf.
[19]	Michel J B, Shen Y K, Aiden A P, et al. Quantitative Analysis of Culture Using Millions of Digitized Books[J]. Science, 2011,331(6014):176-182. URL pmid: 21163965
[20]	Bamman D, Crane G. Measuring Historical Word Sense Variation[C/OL]. [2020-05-17]. https://dl.acm.org/doi/pdf/10.1145/1998076.1998078.
[21]	Mihalcea R, Nastase V. Word Epoch Disambiguation: Finding How Words Change Over Time[C/OL]. [2020-05-17]. https://www.aclweb.org/anthology/P12-2051.pdf.
[22]	刘知远, 刘扬, 涂存超, 等. 词汇语义变化与社会变迁定量观测与分析[J]. 语言战略研究, 2016,1(6):47-54.
[23]	Hamilton W L, Leskovec J, Jurafsky D. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change[C/OL]. [2020-05-17]. https://arxiv.org/pdf/1605.09096.pdf.

	never	trouble	until	you	sleep	is	a	friend
never	0	2	0	0	1	0	0	0
trouble	2	2	2	1	0	1	0	0
until	0	2	0	0	0	0	0	0
you	0	1	0	0	0	0	0	0
sleep	1	0	0	0	0	0	0	0
is	0	1	0	0	0	0	1	0
a	0	0	0	0	0	1	0	1
friend	0	0	0	0	0	0	1	0

	never	trouble	until	you	sleep	is	a	friend
never	0	2	0	0	1	0	0	0
trouble	2	2	2	1	0	1	0	0
until	0	2	0	0	0	0	0	0
you	0	1	0	0	0	0	0	0
sleep	1	0	0	0	0	0	0	0
is	0	1	0	0	0	0	1	0
a	0	0	0	0	0	1	0	1
friend	0	0	0	0	0	0	1	0

from gensim.models import Word2Vec mode = Word2Vec.load(“word60.model”) mode. most_similar (“语言学”)		# 引入Word2Vec包 # 加载训练好的60维词向量模型 # 获取与“语言学”相关度高的词
词汇		相关度(0~1)
语言文学		0.733 127 474 784 851 1
语义学		0.730 978 250 503 541 0
语用学		0.729 380 130 767 822 3
语音学		0.721 356 332 302 093 5
语法学		0.704 832 732 677 459 7
文体学		0.701 937 496 662 139 9
词汇学		0.695 124 447 345 733 6
翻译学		0.694 432 914 257 049 6
	……
竺可桢		0.372 001 677 751 541 1
分配律		0.372 000 008 821 487 4

from gensim.models import Word2Vec mode = Word2Vec.load(“word60.model”) mode. most_similar (“语言学”)		# 引入Word2Vec包 # 加载训练好的60维词向量模型 # 获取与“语言学”相关度高的词
词汇		相关度(0~1)
语言文学		0.733 127 474 784 851 1
语义学		0.730 978 250 503 541 0
语用学		0.729 380 130 767 822 3
语音学		0.721 356 332 302 093 5
语法学		0.704 832 732 677 459 7
文体学		0.701 937 496 662 139 9
词汇学		0.695 124 447 345 733 6
翻译学		0.694 432 914 257 049 6
	……
竺可桢		0.372 001 677 751 541 1
分配律		0.372 000 008 821 487 4

计算语言学中的重要术语——词向量

Word Embedding: Concepts and Applications

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 23

相关文章 1

编辑推荐

Metrics

本文评价