问题描述
- 开发环境:
- 红帽linux 6.5,gcc 5.0,CRF ++ 0.58
- the develop environment:
- red hat linux 6.5,gcc 5.0,CRF++0.58
- 模板
- Boson_train.txt
- Boson_test.txt
- 第一列是单词,第二列是pos,第三列是NER tagger
- 当我想训练NER模型时,我输入以下句子"crf_learn -f 3 -c 4.0 template Boson_train crf_model",然后我得到了此通知,读取训练数据:tagger.cpp(399)[feature_index _-> buildFeatures(this)] 0.00s".我听不懂C ++语言,所以我无法解决问题.
- when i want to train the NER model, i type this sentences "crf_learn -f 3 -c 4.0 template Boson_train crf_model", and i gotthis notification, "reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s". I can't understandthe C++ language, so i can't fix the problem.
- 1.更改数据集的编码类型.我使用notepad ++将没有BOM的utf-8"更改为"utf-8".它没用.
- 2.将定界符从'\ t'更改为''(空格).它没用.
- 3.我认为模板可能是错误的.因此,我使用crf ++ 0.58/example/seg/template进行测试.有效.但是这个模板 很简单,所以我使用/example/JapaneseNE/template,它与我的功能模板更相似.没用然后,我检查 JapaneseNE示例效果很好.所以我很困惑.有没有人可以帮助我.
- 1.change the encode type of dataset. I use notepad++ to change "utf-8 with no BOM" to "utf-8". It didn't work.
- 2.change the delimiter from '\t' to ' '(space). It didn't work.
- 3.And i think maybe the template was wrong.So i use the crf++0.58/example/seg/template for test. It worked. But this template is simple, so I use /example/JapaneseNE/template which is more similar with my feature template. It didn't work. Then, i check the JapaneseNE example It works well. So i got confused. Is there someone can help me.
模板
- U00:%x [-2,0]
- U01:%x [-1,0]
- U02:%x [0,0]
- U03:%x [1,0]
- U04:%x [2,0]
- U05:%x [-2,0]/%x [-1,0]/%x [0,0]
- U06:%x [-1,0]/%x [0,0]/%x [1,0]
- U07:%x [0,0]/%x [1,0]/%x [2,0]
- U08:%x [-1,0]/%x [0,0]
-
U09:%x [0,0]/%x [1,0]
- U00:%x[-2,0]
- U01:%x[-1,0]
- U02:%x[0,0]
- U03:%x[1,0]
- U04:%x[2,0]
- U05:%x[-2,0]/%x[-1,0]/%x[0,0]
- U06:%x[-1,0]/%x[0,0]/%x[1,0]
- U07:%x[0,0]/%x[1,0]/%x[2,0]
- U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]
U10:%x [-2,1]/%x [0,1]
U10:%x[-2,1]/%x[0,1]
U19:%x [2,0]/%x [2,1]
U19:%x[2,0]/%x[2,1]
U20:%x [-1,2]
U20:%x[-1,2]
- 浙江ns B_product_name
- 在线b I_product_name
- 杭州ns I_product_name
- 4 m B_time
- 月m I_time
- I_time 25 m
- 日m I_time
- 讯传出
- (x退出
- 记者n外出
- x出
- x B_person_name
- 施宇翔nr I_person_name
- x出
- 通讯员n B_person_name
- x出
- 方英nr B_person_name
- )x Out
- 毒贩n Out
- 很zg Out
- "x熄灭
- 时髦nr Out
- " x退出
- ,x外出
- 用p Out
- 微信vn B_product_name
- 交易n退出
- 毒品n Out
- .x Out
- 没v Out
- 料想v外出
- 警察n B_person_name
- 也d
- 浙江 ns B_product_name
- 在线 b I_product_name
- 杭州 ns I_product_name
- 4 m B_time
- 月 m I_time
- 25 m I_time
- 日 m I_time
- 讯 ng Out
- ( x Out
- 记者 n Out
- x Out
- x B_person_name
- 施宇翔 nr I_person_name
- x Out
- 通讯员 n B_person_name
- x Out
- 方英 nr B_person_name
- ) x Out
- 毒贩 n Out
- 很 zg Out
- " x Out
- 时髦 nr Out
- " x Out
- , x Out
- 用 p Out
- 微信 vn B_product_name
- 交易 n Out
- 毒品 n Out
- 。 x Out
- 没 v Out
- 料想 v Out
- 警方 n B_person_name
- 也 d Out
推荐答案
您正在朝正确的方向调试.问题确实出在您的模板文件上.
You were debugging in the right direction. The issue is indeed with your template file.
您的训练数据有3列(第0列:word
,第1列:pos-tag
和第2列:tag
).
Your training data has 3 columns (column 0:word
, column 1:pos-tag
and column 2:tag
).
您不能使用tag
作为功能,但是您的模板文件在许多功能定义中都引用了它(即第2列)(请参阅U20至U29).删除/更正这些内容后,您的培训应该会起作用.
You cannot use the tag
as feature, but your template file has reference to it (i.e, column 2) in many feature definitions (see, U20 to U29). Your training should work after removing/correcting these.
希望这会有所帮助:)
您还可以查看这些视频教程,以更好地理解模板文件和使用CRF ++培训NER:
You can also checkout these video tutorials for better understanding of Template Files and Training NER with CRF++ :
1) https://youtu.be/GJHeTvDkIaE
2) https://youtu.be/Ur5umC4BwN4
这篇关于使用CRF + 0.58列车NE模型的失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!