问题描述
我想使用 OpenNLP 来标记泰语单词.我下载了OpenNLP和泰国标记化模型并运行以下命令
I want to use OpenNLP in order to tokenize Thai words. I downloaded OpenNLP and Thai tokenize model and run the following
./bin/opennlp POSTagger -lang th -model thai.tok.bin < sentence.txt > output.txt
我将下载的thai.tok.bin
放在从其调用的目录中,然后运行以下命令. sentence.txt
在กินอะไรยังนาย
中包含此文本.但是,我得到的输出只有这些文本:
I put thai.tok.bin
that I downloaded on the directory that I call from and run the following. sentence.txt
has this text inside กินอะไรยังนาย
. However, the output I got has only these text:
Usage: opennlp POSTagger model < sentences
Execution time: 0.000 seconds
我对OpenNLP
还是陌生的,如果有人知道如何从中获取输出,请告诉我.
I'm pretty new to OpenNLP
, please let me know if anyone knows how to get output from it.
推荐答案
链接已过时.首先,您需要一些手动步骤来转换模型.
The models from your link are outdated. First you need some manual steps to convert the model.
- 下载文件 thai.tok.bin.gz 并解压缩到一个空文件夹.将提取的文件
thai.tok.bin
重命名为token.model
-
在同一文件夹中,创建一个名为
manifest.properties
的文件,其内容如下:
- Download the file thai.tok.bin.gz and extract to an empty folder. Rename the extracted file
thai.tok.bin
totoken.model
In the same folder, create a file named
manifest.properties
with the following contents:
Manifest-Version=1.0.
Language=th
OpenNLP-Version=1.5.0
Component-Name=TokenizerME
useAlphaNumericOptimization=false
现在您可以压缩文件,如果您使用的是Linux,则可以使用以下命令:zip thai.tok.bin token.model manifest.properties
尝试您的模型:
sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin < thai_sentence.txt
Loading Tokenizer model ... done (0,097s)
กินอะไร ยังนาย
Average: 333,3 sent/s
Total: 1 sent
Runtime: 0.003s
Execution time: 0,108 seconds
现在您有了更新的令牌生成器,您可以使用POS Tagger模型执行类似的操作.
Now that you have the updated tokenizer, you can do similar with the POS Tagger model.
-
下载文件 thai.tag.bin .gz 并解压缩到一个空文件夹.将提取的文件
thai.tag.bin
重命名为pos.model
Download the file thai.tag.bin.gz and extract to a empty folder. Rename the extracted file
thai.tag.bin
topos.model
在同一文件夹中,创建一个名为manifest.properties
的文件,其内容如下:
In the same folder, create a file named manifest.properties
with the following contents:
Manifest-Version=1.0
Language=th
OpenNLP-Version=1.5.0
Component-Name=POSTaggerME
现在您可以压缩文件,如果您使用的是Linux,则可以使用以下命令:zip thai.pos.bin pos.model manifest.properties
最后,我们可以尝试将两种模型结合使用:
Finally, we can try the two models combined:
sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin < thai_sentence.txt > thai_tokens.txt
sh bin/opennlp POSTagger ~/Downloads/pt-pos-maxent/thai.pos.bin < thai_tokens.txt
结果是:
กินอะไร_VACT ยังนาย_NCMN
请让我知道这是否是预期的结果.
Please, let me know if this is the expected result.
这篇关于命令行的OpenNLP POSTagger输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!