本文介绍了如何为PTB标记器设置定界符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我的项目使用StanfordCore NLP库.它使用PTB Tokenizer进行令牌化.对于这样的语句-go to room no. #2145

I'm using StanfordCore NLP Library for my project.It uses PTB Tokenizer for tokenization.For a statement that goes like this-go to room no. #2145 or

go to room no. *2145

tokenizer将#2145分为两个令牌:#,2145.有什么办法可以设置令牌化程序,使其不像定界符那样标识#,*?

tokenizer is splitting #2145 into two tokens: #,2145. Is there any way possible to set tokenizer so that it does't identify #,* like a delimiter?

推荐答案

一种快速的解决方案是使用此选项:

A quick solution is to use this option:

(command-line) -tokenize.whitespace
(in Java code) props.setProperty("tokenize.whitespace", "true");

这将导致标记化程序仅在空白处标记化.除了在空格上标记化外,您还需要其他功能吗?

This will cause the tokenizer to just tokenize on white space. Do you need it to do anything other than tokenize on white space?

这篇关于如何为PTB标记器设置定界符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-27 05:27