问题描述
这是一个资源分配的问题。我的目标是运行一个查询,以获取最优先转变为任何时隙。
该数据集是非常大的。在这个例子中,假设有各100班1000家公司(尽管真实数据还要大)。它们都加载到内存中,我需要对他们运行一个单一的LINQ to Objects查询:
VAR topShifts =
(从s轮班
其中,(从S2轮班
其中,s2.CompanyId == s.CompanyId和放大器;&安培; s.TimeSlot == s2.TimeSlot
排序依据s2.Priority
选择S2)。首先()。等于(S)
选择S).ToList();
现在的问题是,如果没有优化的LINQ to Objects将比较每一个对象在这两个集合,做一个交叉连接的所有1000×100对1000×100,这相当于10个十亿(百亿)的比较。我想是每一个公司内部只比较对象(如如果公司被收录在SQL表)。这应导致1000组100×100个对象,总共1000万美元(10,000,000)的比较。稍后,将线性扩展,而不是指数的公司数量的增长。
如 I4o技术会允许我做这样的事情,但不幸的是,我不吨有在这我执行这个查询的环境中使用自定义集合的奢侈品。另外,我只期望在任何给定的数据集运行此查询一次,所以一个持久索引的值是有限的。我期待使用扩展方法,将集团由公司的数据,然后运行对每个组的前pression。
全样本code:
公共结构转变
{
公共静态长迭代;
私人诠释companyId;
公众诠释CompanyId
{
获得{迭代++;返回companyId; }
集合{companyId =价值; }
}
公众诠释标识;
公众诠释时隙;
公众诠释优先;
}
类节目
{
静态无效的主要(字串[] args)
{
const int的公司= 1000;
const int的班次= 100;
Console.WriteLine(的String.Format({0}连×{1}班次,公司,班次));
VAR定时器= Stopwatch.StartNew();
Console.WriteLine(填充数据);
VAR的变化=新的名单,其中,移>();
对于(INT companyId = 0; companyId<公司; companyId ++)
{
对于(INT shiftId = 0; shiftId<班次; shiftId ++)
{
shifts.Add(新Shift键(){CompanyId = companyId,ID = shiftId,时隙= shiftId / 3,优先= shiftId%5});
}
}
Console.WriteLine(的String.Format(已完成在{0:N}毫秒,timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine(计算顶档);
VAR topShifts =
(从s轮班
其中,(从S2轮班
其中,s2.CompanyId == s.CompanyId和放大器;&安培; s.TimeSlot == s2.TimeSlot
排序依据s2.Priority
选择S2)。首先()。等于(S)
选择S).ToList();
Console.WriteLine(的String.Format(已完成在{0:N}毫秒,timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine(\ nShifts:);
的foreach(在shifts.Take变种移(20))
{
Console.WriteLine(的String.Format(C {0}标识{1} T {2} p {3},shift.CompanyId,shift.Id,shift.TimeSlot,shift.Priority));
}
Console.WriteLine(\ NTOP班次:);
的foreach(在topShifts.Take变种移(10))
{
Console.WriteLine(的String.Format(C {0}标识{1} T {2} p {3},shift.CompanyId,shift.Id,shift.TimeSlot,shift.Priority));
}
Console.WriteLine(的String.Format(\ n总计比较:{0:N},Shift.Iterations / 2));
Console.WriteLine(任意键继续);
Console.ReadKey();
}
}
示例输出:
问题:
- 如何分区查询(同时仍然执行作为一个单一的LINQ查询),以获得比较下降10十亿10万?
- 是否有解决一个子查询的问题,而不是一种更有效的方式是什么?
如何
VAR topShifts =从S在shifts.GroupBy(S => s.CompanyId)
从在s.GroupBy(B => b.TimeSlot)
选择a.OrderBy(P => p.Priority)。首先();
似乎得到了相同的输出,但100015的比较
与@杰夫的编辑,他只是我的一半比较: - )
This is a resource-allocation problem. My goal is to run a query to fetch the top-priority shift for any time-slot.
The dataset is very large. For this example, let’s say there are 100 shifts each for 1000 companies (though the real dataset is even larger). They are all loaded into memory, and I need to run a single LINQ to Objects query against them:
var topShifts =
(from s in shifts
where (from s2 in shifts
where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
orderby s2.Priority
select s2).First().Equals(s)
select s).ToList();
The problem is that without optimization, LINQ to Objects will compare each and every object in both sets, doing a cross-join of all 1,000 x 100 against 1,000 x 100, which amounts to 10 billion (10,000,000,000) comparisons. What I want is to compare only objects within each company (as if Company were indexed in a SQL table). This should result in 1000 sets of 100 x 100 objects for a total of 10 million (10,000,000) comparisons. The later would scale linearly rather than exponentially as the number of companies grows.
Technologies like I4o would allow me to do something like this, but unfortunately, I don’t have the luxury of using a custom collection in the environment in which I’m executing this query. Also, I only expect to run this query once on any given dataset, so the value of a persistent index is limited. I’m expecting to use an extension method which would group the data by company, then run the expression on each group.
Full Sample code:
public struct Shift
{
public static long Iterations;
private int companyId;
public int CompanyId
{
get { Iterations++; return companyId; }
set { companyId = value; }
}
public int Id;
public int TimeSlot;
public int Priority;
}
class Program
{
static void Main(string[] args)
{
const int Companies = 1000;
const int Shifts = 100;
Console.WriteLine(string.Format("{0} Companies x {1} Shifts", Companies, Shifts));
var timer = Stopwatch.StartNew();
Console.WriteLine("Populating data");
var shifts = new List<Shift>();
for (int companyId = 0; companyId < Companies; companyId++)
{
for (int shiftId = 0; shiftId < Shifts; shiftId++)
{
shifts.Add(new Shift() { CompanyId = companyId, Id = shiftId, TimeSlot = shiftId / 3, Priority = shiftId % 5 });
}
}
Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine("Computing Top Shifts");
var topShifts =
(from s in shifts
where (from s2 in shifts
where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
orderby s2.Priority
select s2).First().Equals(s)
select s).ToList();
Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine("\nShifts:");
foreach (var shift in shifts.Take(20))
{
Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
}
Console.WriteLine("\nTop Shifts:");
foreach (var shift in topShifts.Take(10))
{
Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
}
Console.WriteLine(string.Format("\nTotal Comparisons: {0:n}", Shift.Iterations/2));
Console.WriteLine("Any key to continue");
Console.ReadKey();
}
}
Sample output:
Questions:
- How can I partition the query (while still executing as a single LinQ query) in order to get the comparisons down from 10 billion to 10 million?
- Is there a more efficient way of solving the problem instead of a sub-query?
How about
var topShifts = from s in shifts.GroupBy(s => s.CompanyId)
from a in s.GroupBy(b => b.TimeSlot)
select a.OrderBy(p => p.Priority).First();
Seems to get the same output but 100015 comparisons
with @Geoff's edit he just halved my comparisons :-)
这篇关于如何划分一个LINQ到对象查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!