问题描述
斯坦福CoreNLP 的 tokenizer 来防止令牌包含空格?
Is there an option in Stanford CoreNLP's tokenizer to prevent tokens from containing a space?
例如如果句子是我的电话是617 1555-6644",则子字符串"617 1555"应为两个不同的标记.
E.g. if the sentence is "my phone is 617 1555-6644", the substring "617 1555" should be Into two different tokens.
我知道选项 normalizeSpace
:
I am aware of the option normalizeSpace
:
但是我不希望令牌包含任何空间,包括不间断的空间.
but I don't want tokens to contain any space, including non-breaking space.
推荐答案
您可以尝试将tokenize.whitespace
选项设置为true,但这将始终且仅在空白处标记化.例如,"it's"将不再标记为"it's".
You can try to set the tokenize.whitespace
option to true, but this will tokenize always and only on whitespace. For example, "it's" will not longer tokenize to "it 's".
这篇关于防止令牌在Stanford CoreNLP中包含空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!