问题描述
有搜索一个文本文件中的每一行出现在另一个文本文件,不是由这两个文件将逐行一个更快的方法?
Is there a faster way to search each line of one text file for occurrence in another text file, than by going line by line in both files?
我有两个文本文件 - 一个人〜2500线(姑且称之为TxtA),其他有86000〜线(TxtB)。我想搜索TxtB在TxtA每一行,并在TxtB返回行发现的每场比赛。
I have two text files - one has ~2500 lines (let's call it TxtA), the other has ~86000 lines(TxtB). I want to search TxtB for each line in TxtA, and return the line in TxtB for each match found.
目前,我有这样的设置为:在TxtA每一行,搜索TxtB一行行的比赛。然而,这走的是一条很长的时间来处理。看起来这将需要1-3个小时找到所有的比赛。
I currently have this setup as: for each line in TxtA, search TxtB line by line for a match. However this is taking a really long time to process. It seems like it would take 1-3 hours to find all the matches.
下面是我的代码...
Here is my code...
private static void getGUIDAndType()
{
try
{
Console.WriteLine("Begin.");
System.Threading.Thread.Sleep(4000);
String dbFilePath = @"C:\WindowsApps\CRM\crm_interface\data\";
StreamReader dbsr = new StreamReader(dbFilePath + "newdbcontents.txt");
List<string> dblines = new List<string>();
String newDataPath = @"C:\WindowsApps\CRM\crm_interface\data\";
StreamReader nsr = new StreamReader(newDataPath + "HolidayList1.txt");
List<string> new1 = new List<string>();
string dbline;
string newline;
List<string> results = new List<string>();
while ((newline = nsr.ReadLine()) != null)
{
//Reset
dbsr.BaseStream.Position = 0;
dbsr.DiscardBufferedData();
while ((dbline = dbsr.ReadLine()) != null)
{
newline = newline.Trim();
if (dbline.IndexOf(newline) != -1)
{//if found... get all info for now
Console.WriteLine("FOUND: " + newline);
System.Threading.Thread.Sleep(1000);
new1.Add(newline);
break;
}
else
{//the first line of db does not contain this line...
//go to next dbline.
Console.WriteLine("Lines do not match - continuing");
continue;
}
}
Console.WriteLine("Going to next new Line");
System.Threading.Thread.Sleep(1000);
//continue;
}
nsr.Close();
Console.WriteLine("Writing to dbc3.txt");
System.IO.File.WriteAllLines(@"C:\WindowsApps\CRM\crm_interface\data\dbc3.txt", results.ToArray());
Console.WriteLine("Finished. Press ENTER to continue.");
Console.WriteLine("End.");
Console.ReadLine();
}
catch (Exception ex)
{
Console.WriteLine("Error: " + ex);
Console.ReadLine();
}
}
请让我知道如果有一个更快的方式。最好的东西,将需要5-10分钟......我听说过索引,但并没有找到太多关于这个的txt文件。我测试过正则表达式,它的速度不能超过的indexOf。包含将无法工作,因为线永远不会完全一样。
Please let me know if there is a faster way. Preferably something that would take 5-10 minutes... I've heard of indexing but didn't find much on this for txt files. I've tested regex and it's no faster than indexof. Contains won't work because the lines will never be exactly the same.
感谢
推荐答案
编辑:请注意,我假设这是合理至少读的有一个的文件到内存中。你可能想交换低于周围的查询,以避免加载大文件到内存中,但即使是86000行的(说)每行1K将是小于2G内存 - 这是比较少做一些显著。
Note that I'm assuming it's reasonable to read at least one file into memory. You may want to swap the queries below around to avoid loading the "big" file into memory, but even 86,000 lines at (say) 1K per line is going to be less than 2G of memory - which is relatively little to do something significant.
你每次阅读内部文件。有没有必要。加载这两个文件到内存中,并从那里走。哎呀,为的确切的匹配,你可以做整个事情在LINQ轻松:
You're reading the "inner" file each time. There's no need for that. Load both files into memory and go from there. Heck, for exact matches you can do the whole thing in LINQ easily:
var query = from line1 in File.ReadLines("newDataPath + "HolidayList1.txt")
join line2 in File.ReadLines(dbFilePath + "newdbcontents.txt")
on line1 equals line2
select line1;
var commonLines = query.ToList();
但对于非联接它仍然是简单,只需读取一个文件首先完全(明确的),然后流其他:
But for non-joins it's still simple; just read one file completely first (explicitly) and then stream the other:
// Eagerly read the "inner" file
var lines2 = File.ReadAllLines(dbFilePath + "newdbcontents.txt");
var query = from line1 in File.ReadLines("newDataPath + "HolidayList1.txt")
from line2 in lines2
where line2.Contains(line1)
select line1;
var commonLines = query.ToList();
有没有什么聪明在这里 - 这只是编写代码一个非常简单的方法来读取一个所有线路文件,然后遍历在其他文件中的行,并反对在第一个文件中的所有行的每一行检查。但是,即使没有什么巧,我的强烈的怀疑它会执行不够好为您服务。专注于简单,消除不必要的IO,看看这是否是足够好的尝试做任何事情票友了。
There's nothing clever here - it's just a really simple way of writing code to read all the lines in one file, then iterate over the lines in the other file and for each line check against all the lines in the first file. But even without anything clever, I strongly suspect it would perform well enough for you. Concentrate on simplicity, eliminate unnecessary IO, and see whether that's good enough before trying to do anything fancier.
请注意,在你的原代码,您应该使用使用您的
语句,以确保他们得到妥善处置。使用上面的代码变得非常简单,甚至不需要,虽然... 的StreamReader
变量
Note that in your original code, you should be using using
statements for your StreamReader
variables, to ensure they get disposed properly. Using the above code makes it simple to not even need that though...
这篇关于搜索在另一个文本文件一个文本文件的行,更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!