基于LSTM网络的序列标注中文分词法

当前主流的中文分词方法是基于字标注的传统机器学习方法,但传统机器学习方法需要人为地从中文文本中配置并提取特征,存在词库维度高且利用CPU训练模型时间长的缺点.针对以上问题进行了研究,提出基于LSTM（longshort-term memory）网络模型的改进方法,采用不同词位标注集并加入预先训练的字嵌入向量（character embedding）进行中文分词.在中文分词评测常用的语料上进行实验对比结果表明,基于LSTM网络模型的方法能得到比当前传统机器学习方法更好的性能;采用六词位标注并加入预先训练的字嵌入向量能够取得相对最好的分词性能;而且利用GPU可以大大缩短深度神经网络模型的训练时间;...

Full description

Saved in:

Bibliographic Details
Published in	计算机应用研究 Vol. 34; no. 5; pp. 1321 - 1324
Main Author	任智慧徐浩煜封松林周晗施俊
Format	Journal Article
Language	Chinese
Published	上海大学通信与信息工程学院,上海200444 2017 中国科学院大学,北京100049%中国科学院上海高等研究院,上海,201210%上海大学通信与信息工程学院,上海,200444 中国科学院上海高等研究院,上海201210%中国科学院上海高等研究院,上海201210
Subjects	LSTM 中文分词字嵌入自然语言处理中文分词字嵌入 LSTM NLP character embedding 自然语言处理 Chinese word segmentation
Online Access	Get full text
ISSN	1001-3695
DOI	10.3969/j.issn.1001-3695.2017.05.009

Cover

More Information
Summary:	当前主流的中文分词方法是基于字标注的传统机器学习方法,但传统机器学习方法需要人为地从中文文本中配置并提取特征,存在词库维度高且利用CPU训练模型时间长的缺点.针对以上问题进行了研究,提出基于LSTM（longshort-term memory）网络模型的改进方法,采用不同词位标注集并加入预先训练的字嵌入向量（character embedding）进行中文分词.在中文分词评测常用的语料上进行实验对比结果表明,基于LSTM网络模型的方法能得到比当前传统机器学习方法更好的性能;采用六词位标注并加入预先训练的字嵌入向量能够取得相对最好的分词性能;而且利用GPU可以大大缩短深度神经网络模型的训练时间;LLTM网络模型的方法也更容易推广并应用到其他自然语言处理中序列标注的任务.
Bibliography:	Currently ’ the dominant state-of-the-art methods for Chinese word segmentation are based on character taggingmethods by using traditional machine learning technology. However, there are some disadvantages in the trlearning methods： artificially configuring and extracting features from Chinese texts , high dimension of the dict iotraining time by just exploiting CPUs. This paper proposed an improved method based on long short- work model. It used different tag set and added pre-trained character embeddings to perform Chinese word segmentation. pared with the best result in Bakeoff and state-of-the-art methods, this paper conducted the experiments on commpuses. The results demonstrate that traditional machine learning methods are exceeded by the methowork. By using six-tag-set and adding pre-trained character embedding, the proposed method can reach the relatively highestperformance on Chinese word segmentation. Then, it can greatly reduce the training time of deep neural network model byusing GPUs. Moreover,
ISSN:	1001-3695
DOI:	10.3969/j.issn.1001-3695.2017.05.009