正则表达式搜索模式在非常大的文件中

本文介绍了正则表达式搜索模式在非常大的文件中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想搜索由单行组成的非常大的文件（f.e大于1 GB）中的模式。
无法将其加载到内存中。目前，我使用 BufferedReader 来读取缓冲区（1024个字符）。
主要步骤：

I'd like to search pattern in very large file (f.e above 1 GB) that consists of single line. It is not possible to load it into memory. Currently, I use BufferedReaderto read into buffers (1024 chars).The main steps:

将数据读入两个缓冲区

搜索模式缓冲区

如果找到模式则增加变量

将第二个缓冲区复制到第一个

将数据加载到第二个缓冲区

在两个缓冲区中搜索模式。

如果找到模式则增加变量

重复上述步骤（开始从4）直到EOF

Read data into two buffers
Search pattern in that buffers
Increment variable if pattern was found
Copy second buffer into first
Load data into second buffers
Search pattern in both buffers.
Increment variable if pattern was found
Repeat above steps (start from 4) until EOF

该算法（两个缓冲区）让我避免情况，其中搜索的文本块被块分割。它的作用就像一个chram，除非模式结果小于两个缓冲区的长度。例如，我无法处理大小写，当结果更长时 - 让我们说多3个缓冲区（但我只有两个缓冲区中的数据，因此匹配将失败！）。更重要的是，我可以实现这样的情况：

That algorithm (two buffers) lets me to avoid situation, where searched piece of text is split by chunks. It works like a chram unless pattern result is smaller that two buffers length. For example I can't manage with case, when result is longer - let's say long as 3 buffers (but I've only data in two buffers, so match will fail!). What's more, I can realize such a case:

准备1 GB单行文件，其中包含baaaaaaa（....） aaaaab

搜索模式 ba * b 。

整个文件匹配模式！

我没有打印结果，我只能说：是的，我能找到模式或不，我不是能够找到那个。

Prepare 1 GB single line file, that consits of "baaaaaaa(....)aaaaab"
Search for pattern ba*b.
The whole file match pattern!
I don't have to print the result, I've only to be able to say: "Yea, I was able to find pattern" or "No, I wasn't able to find that".

用java可以吗？我的意思是：

It's possible with java? I mean:

能够确定文件中是否存在模式（不将整行加载到内存中，请参见上面的案例

当匹配结果长于块时，找到处理案例的方式。

我希望我的解释非常清楚。

I hope my explanation is pretty clear.

模式

正则表达式搜索模式在非常大的文件中

问题描述

推荐答案