本文介绍了如何为PTB标记器设置定界符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在为我的项目使用StanfordCore NLP库.它使用PTB Tokenizer进行令牌化.对于这样的语句-go to room no. #2145
或
I'm using StanfordCore NLP Library for my project.It uses PTB Tokenizer for tokenization.For a statement that goes like this-go to room no. #2145
or
go to room no. *2145
tokenizer将#2145分为两个令牌:#,2145.有什么办法可以设置令牌化程序,使其不像定界符那样标识#,*?
tokenizer is splitting #2145 into two tokens: #,2145. Is there any way possible to set tokenizer so that it does't identify #,* like a delimiter?
推荐答案
一种快速的解决方案是使用此选项:
A quick solution is to use this option:
(command-line) -tokenize.whitespace
(in Java code) props.setProperty("tokenize.whitespace", "true");
这将导致标记化程序仅在空白处标记化.除了在空格上标记化外,您还需要其他功能吗?
This will cause the tokenizer to just tokenize on white space. Do you need it to do anything other than tokenize on white space?
这篇关于如何为PTB标记器设置定界符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!