问题描述
有没有一种方法可以向PTBTokenizer提供一组定界符来分割令牌?
There is a way to provide to the PTBTokenizer a set of delimiters characters to split a token ?
我正在测试此分词器的行为,并且我意识到有些字符如竖线"|"令牌生成器会将子字符串分成两个令牌,其他令牌生成器将单个字符串分成斜杠或连字符.
i was testing the behaviour of this tokenizer and i've realized that there are some characters like the vertical bar '|' for which the tokenizer diviedes a substring into two token, and others like the slash or the hypen for which the tokenizer return a single token.
推荐答案
没有任何简单的方法可以使用PTBTokenizer做到这一点.您可以进行一些预处理和后处理以获得所需的内容,尽管有两个值得一提的问题:
There's not any simple way to do this with the PTBTokenizer, no. You can do some pre-processing and post-processing to get what you want, though there are two concerns worth mentioning:
- 与CoreNLP一起分发的所有模型都接受有关标准标记器行为的培训.如果您更改后面这些组件的输入的标记方式,则不能保证这些组件可以正常工作.
- 如果您进行了足够的预处理和后处理(并且没有使用#1中提到的任何后续组件),那么窃取 PTBTokenizer实现,并编写自己的实现.
- All models distributed with CoreNLP are trained on the standard tokenizer behavior. If you change how the input to these later components are tokenized, there's no guarantee that these components will work predictably.
- If you do enough pre- and post-processing (and aren't using any later components as mentioned in #1), it may be simpler to just steal the PTBTokenizer implementation and write your own.
(关于自定义撇号标记化行为,存在类似的问题: Stanford coreNLP-拆分词忽略撇号.)
(There is a similar question on customizing apostrophe tokenization behavior: Stanford coreNLP - split words ignoring apostrophe.)
这篇关于Stanford PTBTokenizer令牌的分割定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!