问题描述
Java 的 StreamTokenizer 在识别数字方面似乎过于贪婪.它的配置选项相对较少,我还没有找到一种方法让它做我想做的事.以下测试通过,IMO 显示了实现中的错误;我真正想要的是将第二个标记识别为单词20001_to_30000".有任何想法吗?
Java's StreamTokenizer seems to be too greedy in identifying numbers. It is relatively light on configuration options, and I haven't found a way to make it do what I want. The following test passes, IMO showing a bug in the implementation; what I'd really like is for the second token to be identified as a word "20001_to_30000". Any ideas?
public void testBrokenTokenizer()
throws Exception
{
final String query = "foo_bah 20001_to_30000";
StreamTokenizer tok = new StreamTokenizer(new StringReader(query));
tok.wordChars('_', '_');
assertEquals(tok.nextToken(), StreamTokenizer.TT_WORD);
assertEquals(tok.sval, "foo_bah");
assertEquals(tok.nextToken(), StreamTokenizer.TT_NUMBER);
assertEquals(tok.nval, 20001.0);
assertEquals(tok.nextToken(), StreamTokenizer.TT_WORD);
assertEquals(tok.sval, "_to_30000");
}
FWIW 我可以使用 StringTokenizer 代替,但它需要大量重构.
FWIW I could use a StringTokenizer instead, but it would require a lot of refactoring.
推荐答案
IMO,最好的解决方案是使用 Scanner,但如果您想强制古老的 StreamTokenizer 为您工作,请尝试以下操作:
IMO, the best solution is using a Scanner, but if you want to force the venerable StreamTokenizer to work for you, try the following:
import java.util.regex.*;
...
final String query = "foo_bah 20001_to_30000\n2.001 this is line number 2 blargh";
StreamTokenizer tok = new StreamTokenizer(new StringReader(query));
// recreate standard syntax table
tok.resetSyntax();
tok.whitespaceChars('\u0000', '\u0020');
tok.wordChars('a', 'z');
tok.wordChars('A', 'Z');
tok.wordChars('\u00A0', '\u00FF');
tok.commentChar('/');
tok.quoteChar('\'');
tok.quoteChar('"');
tok.eolIsSignificant(false);
tok.slashSlashComments(false);
tok.slashStarComments(false);
//tok.parseNumbers(); // this WOULD be part of the standard syntax
// syntax additions
tok.wordChars('0', '9');
tok.wordChars('.', '.');
tok.wordChars('_', '_');
// create regex to verify numeric conversion in order to avoid having
// to catch NumberFormatException errors from Double.parseDouble()
Pattern double_regex = Pattern.compile("[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?");
try {
int type = StreamTokenizer.TT_WORD;
while (type != StreamTokenizer.TT_EOF) {
type = tok.nextToken();
if (type == StreamTokenizer.TT_WORD) {
String str = tok.sval;
Matcher regex_match = double_regex.matcher(str);
if (regex_match.matches()) { // NUMBER
double val = Double.parseDouble(str);
System.out.println("double = " + val);
}
else { // WORD
System.out.println("string = " + str);
}
}
}
}
catch (IOException err) {
err.printStackTrace();
}
本质上,您正在从 StreamTokenizer 卸载数字值的标记化.正则表达式匹配是为了避免依赖 NumericFormatException 来告诉您 Double.parseDouble() 对给定的标记不起作用.
Essentially, you're offloading the tokenizing of numeric values from StreamTokenizer. The regex matching is to avoid relying on NumericFormatException to tell you that Double.parseDouble() doesn't work on the given token.
这篇关于StreamTokenizer 将 001_to_003 拆分为两个令牌;我怎样才能防止它这样做?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!