问题描述
我想搜索由单行组成的非常大的文件(f.e大于1 GB)中的模式。
无法将其加载到内存中。目前,我使用 BufferedReader
来读取缓冲区(1024个字符)。
主要步骤:
I'd like to search pattern in very large file (f.e above 1 GB) that consists of single line. It is not possible to load it into memory. Currently, I use BufferedReader
to read into buffers (1024 chars).The main steps:
- 将数据读入两个缓冲区
- 搜索模式缓冲区
- 如果找到模式则增加变量
- 将第二个缓冲区复制到第一个
- 将数据加载到第二个缓冲区
- 在两个缓冲区中搜索模式。
- 如果找到模式则增加变量
- 重复上述步骤(开始从4)直到EOF
- Read data into two buffers
- Search pattern in that buffers
- Increment variable if pattern was found
- Copy second buffer into first
- Load data into second buffers
- Search pattern in both buffers.
- Increment variable if pattern was found
- Repeat above steps (start from 4) until EOF
该算法(两个缓冲区)让我避免情况,其中搜索的文本块被块分割。它的作用就像一个chram,除非模式结果小于两个缓冲区的长度。例如,我无法处理大小写,当结果更长时 - 让我们说多3个缓冲区(但我只有两个缓冲区中的数据,因此匹配将失败!)。更重要的是,我可以实现这样的情况:
That algorithm (two buffers) lets me to avoid situation, where searched piece of text is split by chunks. It works like a chram unless pattern result is smaller that two buffers length. For example I can't manage with case, when result is longer - let's say long as 3 buffers (but I've only data in two buffers, so match will fail!). What's more, I can realize such a case:
- 准备1 GB单行文件,其中包含baaaaaaa(....) aaaaab
- 搜索模式
ba * b
。 - 整个文件匹配模式!
- 我没有打印结果,我只能说:是的,我能找到模式或不,我不是能够找到那个。
- Prepare 1 GB single line file, that consits of "baaaaaaa(....)aaaaab"
- Search for pattern
ba*b
. - The whole file match pattern!
- I don't have to print the result, I've only to be able to say: "Yea, I was able to find pattern" or "No, I wasn't able to find that".
用java可以吗?我的意思是:
It's possible with java? I mean:
- 能够确定文件中是否存在模式(不将整行加载到内存中,请参见上面的案例
- 当匹配结果长于块时,找到处理案例的方式。
我希望我的解释非常清楚。
I hope my explanation is pretty clear.
推荐答案
我认为你的解决方案是实现作为非常大的文本文件的包装。
I think the solution for you would be to implement CharSequence
as a wrapper over very large text files.
为什么?因为从模式
构建匹配器
需要 CharSequence
作为参数。
Why? Because building a Matcher
from a Pattern
takes a CharSequence
as an argument.
当然,说起来容易做起来难......但是你只有三种方法实施,所以不应该太难...
Of course, easier said than done... But then you only have three methods to implement, so that shouldn't be too hard...
编辑我拿了暴跌,我。 st部分它确实有效!
EDIT I took the plunge and I ate my own dog's food. The "worst part" is that it actually works!
这篇关于正则表达式搜索模式在非常大的文件中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!