


有两个文本文件( file_1 file_2 ),需要彼此进行比较。两者都非常巨大(3-4千兆字节,从30,000,000到45,000,000行)。
我的想法是读取内存中的 file_1 的几行(尽可能多),然后将这些行与所有 file_2 。如果有匹配,来自两个匹配的文件的行将被写入一个新文件。然后继续使用 file_1 的下一行1000,并将 file_2的全部行比较直到我完全通过 file_1

但这听起来真的,真的很耗时, 。

对于我的程序,时间并不重要。我没有使用这样巨大的文件的经验,因此我不知道这可能需要多长时间。它不应该需要一天以上。 ;-)但我恐怕我的技术可以永远...


提前感谢。 / p>


file_1 看起来有点像这样:

 code> mat1 1000 2000 TEXT //这意味着范围是1000  -  2000 
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT

file_2 如下所示:

  mat3 10009 TEXT 
mat3 200 TEXT
mat1 999 TEXT

TEXT 指的是对我不感兴趣的字符和数字, mat 可以从 mat1 - mat50 开始,并且没有顺序;也可以有1000x mat2 (但下一列中的数字不同)。我需要找到匹配线,以这样的方式:matX是相同的在两个比较的行在 file_2 中提到的数字适合在 file_1
在我的示例中,我会找到一个匹配: file_1 的第3行和 file_2的第1行因为两者都是mat3和10009在10000和10010之间)。



EDIT 首先,使我没有内存不足的问题。我也认为比起这两个巨大的文件,比较(许多)较小的文件比它更快。之后,我可以比较他们的方式我上面提到。这可能不是完美的方式,但我仍然在学习;-)




分区:读取 file_1 ,并将以 mat1 开头的所有行发送到 file_1_mat1 ,以此类推。对 file_2 执行相同操作。这是一个微不足道的一点 grep ,或者你应该在Java中以编程方式做它是一个初学者的练习。


strong>对于每个分区,仅根据第二列中的数值进行排序( file_1 的下限和 file_2 )。即使800,000行不能容纳到内存中,我想我们可以适应双向外部合并排序,并且比整个未分区的空间更快地执行(整体读取更少)。

比较:现在您只需要通过 file_1_mat1 file_2_mat1 ,无需在内存中保留任何内容,输出与输出文件匹配。依次对其余分区重复。不需要最后的合并步骤(除非你正在并行处理分区)。


I have a general question on your opinion about my "technique".

There are 2 textfiles (file_1 and file_2) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each).My idea is to read several lines (as many as possible) of file_1 to the memory, then compare those to all lines of file_2. If there's a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1 and also compare those to all lines of file_2 until I went through file_1 completely.

But this sounds actually really, really time consuming and complicated to me.Can you think of any other method to compare those two files?

How long do you think the comparison could take?For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn't take more than a day though. ;-) But I am afraid my technique could take forever...

Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it?I want to read as many as possible (because I think that's faster) but I've ran out of memory quite often.

Thanks in advance.

EDITI think I have to explain my problem a bit more.

The purpose is not to see if the two files in general are identical (they are not).There are some lines in each file that share the same "characteristic".Here's an example:file_1 looks somewhat like this:

mat1 1000 2000 TEXT      //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT

file_2looks like this:

mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT

TEXT refers to characters and digits that are of no interest for me, mat can go from mat1 - mat50 and are in no order; also there can be 1000x mat2 (but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2 fits into the range mentioned in file_1.So in my example I would find one match: line 3 of file_1and line 1 of file_2 (because both are mat3 and 10009 is between 10000 and 10010).I hope this makes it clear to you!

So my question is: how would you search for the matching lines?

Yes, I use Java as my programming language.

EDITI now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning ;-)Nonentheless all your approaches were very helpful to me, thank you for your replies!


Now that you've given us more specifics, the approach I would take relies upon pre-partitioning, and optionally, sorting before searching for matches.

This should eliminate a substantial amount of comparisons that wouldn't otherwise match anyway in the naive, brute-force approach. For the sake of argument, lets peg both files at 40 million lines each.

Partitioning: Read through file_1 and send all lines starting with mat1 to file_1_mat1, and so on. Do the same for file_2. This is trivial with a little grep, or should you wish to do it programmatically in Java it's a beginner's exercise.

That's one pass through two files for a total of 80million lines read, yielding two sets of 50 files of 800,000 lines each on average.

Sorting: For each partition, sort according to the numeric value in the second column only (the lower bound from file_1 and the actual number from file_2). Even if 800,000 lines can't fit into memory I suppose we can adapt 2-way external merge sort and perform this faster (fewer overall reads) than a sort of the entire unpartitioned space.

Comparison: Now you just have to iterate once through both pairs of file_1_mat1 and file_2_mat1, without need to keep anything in memory, outputting matches to your output file. Repeat for the rest of the partitions in turn. No need for a final 'merge' step (unless you're processing partitions in parallel).

Even without the sorting stage the naive comparison you're already doing should work faster across 50 pairs of files with 800,000 lines each rather than with two files with 40 million lines each.


07-29 21:35