问题描述
我正在使用java.util.Scanner从大字符串中扫描所有出现的给定正则表达式.
I'm using a java.util.Scanner to scan all occurrences of a given regex from a big string.
Scanner sc = new Scanner(body);
sc.useDelimiter("");
String match = "";
while(match!=null)
{
match = sc.findWithinHorizon(pattern, 0);
if(match==null)break;
MatchResult mr = sc.match();
System.out.println("Match string: "+mr.group());
System.out.println("Match string using indexes: "+body.substring(mr.start(),mr.end());
}
奇怪的是,经过一定数量的扫描后,group()方法返回正确的结果,而start()和end()方法返回错误的索引,例如扫描从文件开头重新开始.正则表达式是多行的(我使用此正则表达式来发现行更改"\ r \ n | [\ n \ r \ u2028 \ u2029 \ u0085]").
The strange thing is that after a certain number of scans, group() method returns the correct occurrence while the start() and end() methods return wrong indexes like the scan has restarted from the beginning of the file.The regex is multiline (i use this regex to discover a line change "\r\n|[\n\r\u2028\u2029\u0085]").
您有什么提示吗?可能与水平"参数有关(我已经尝试过使用该值的差分组合)吗?
Do you have any hint? Could it be related to the "horizon" parameter (I've tried differend combinations for that value)?
有关更多详细信息,它似乎与文件的大小有关(超过1000个字符),大约1000后,计数器从0重新开始(例如,在1003:1020之后出现的第一个错误索引变为3:120).
For more details, it seems related to the dimension of the file (more than 1000 chars), after about 1000 the counter restart from 0 (e.g. the first wrong index occurrence after 1003:1020 becomes 3:120).
推荐答案
Scanner
使用带有1024
个字符的内部缓冲区.使用Pattern
代替:
Scanner
uses an internal buffer with 1024
characters. Use Pattern
instead:
Matcher matcher = Pattern.compile(...).matcher(body);
while(matcher.find()) {
int start = matcher.start();
}
这篇关于扫描仪的Java java.util.regex.MatchResult计数器问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!