本文介绍了不得已的性能优化策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此站点上已经存在很多性能问题,但据我所知,几乎所有问题都是针对特定问题的,而且范围很窄.几乎所有人都重复了该建议以避免过早的优化.

There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.

我们假设:

  • 代码已经正常运行
  • 选择的算法已经针对问题的情况进行了优化
  • 已对代码进行了评估,并且已将违规程序隔离了
  • 还将对所有优化尝试进行衡量,以确保它们不会使事情变得更糟

在这里我要寻找的是策略和技巧,这些技巧和技巧可以使关键算法中的最后几个部分都挤到最后几个百分点,而无所不用其极.

What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.

理想情况下,尝试使答案与语言无关,并在适用的情况下指出建议策略的任何缺点.

Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.

我将添加带有我自己的初步建议的回复,并期待Stack Overflow社区可以想到的其他事情.

I'll add a reply with my own initial suggestions, and look forward to whatever else the Stack Overflow community can think of.

推荐答案

好的,您正在将问题定义为似乎没有太多改进空间的地方.以我的经验,那是相当罕见的.我试图在1993年11月的Dobbs博士的一篇文章中对此进行解释,从一个常规设计良好,无明显浪费的简单程序开始,并对其进行了一系列优化,直到其挂钟时间从48秒减少到到1.1秒,并且源代码的大小减少了4倍.我的诊断工具.更改的顺序是这样的:

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:

  • 发现的第一个问题是使用列表集群(现在称为迭代器"和容器类")占了一半以上的时间.那些被替换为相当简单的代码,使时间减少到20秒.

  • The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.

现在,最大的时间获取者是更多的列表建立者.以百分比计,它以前没那么大,但是现在是因为消除了更大的问题.我找到了一种加快速度的方法,时间降到了17秒.

Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds.

现在很难找到明显的罪魁祸首,但是有一些我可以做的较小的罪魁祸首,时间降到了13秒.

Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.

现在我似乎撞墙了.这些样本准确地告诉了我它在做什么,但是我似乎找不到任何可以改进的地方.然后,我对程序的基本设计,事务驱动的结构进行了反思,并询问该程序正在执行的所有列表搜索是否实际上都是由问题的要求所强制执行的.

Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.

然后我进行了重新设计,在该程序中,程序代码实际上是通过较小的一组源(通过预处理器宏)生成的,并且在该程序中,程序并不会不断地找出程序员知道的可以预料的事情.换句话说,不要解释"要做的事情的顺序,而要编译"它.

Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.

  • 重新设计完成,将源代码缩小4倍,时间减少到10秒.

现在,因为它变得如此之快,很难进行采样,所以我给它做10倍的工作,但接下来的时间是基于原始工作量的.

Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.

  • 更多诊断表明,它花时间在队列管理上.内联这些可以将时间减少到7秒.

  • More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.

现在,我一直在做的诊断打印非常耗时.冲洗-4秒.

Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.

现在花费最大的时间是调用 malloc free .回收对象-2.6秒.

Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.

继续进行示例,我仍然发现并非严格必要的操作-1.1秒.

Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.

总加速因子:43.6

Total speedup factor: 43.6

现在没有两个程序是相同的,但是在非玩具软件中,我总是看到这样的进展.首先,您会获得一些简单的东西,然后获得更多的困难,直到收益递减.然后,您获得的洞察力很可能会导致重新设计,开始新一轮的提速,直到您再次获得递减的收益.现在,在这一点上,我们可能想知道++ii++for(;;)while(1)是否更快:我在Stack Overflow上经常看到的这类问题.

Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.

P.S.可能想知道为什么我不使用探查器.答案是这些问题"中的几乎每个都是一个函数调用站点,这些站点会精确地采样.探查器,即使在今天,也几乎没有意识到这样的想法,即语句和调用指令比整个功能更重要的是定位和更容易修复.

P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions.

我实际上构建了一个探查器来执行此操作,但是对于与代码正在做的事情真正的亲密接触,没有什么可以替代您的手指.样本数量很少不是问题,因为发现的所有问题都没有很小到很容易被忽略的问题.

I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.

已添加:jerryjvl请求了一些示例.这是第一个问题.它由少量单独的代码行组成,总共花费了一半以上的时间:

ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:

 /* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

这些正在使用列表集群ILST(类似于列表类).它们以通常的方式实现,信息隐藏"意味着该类的用户不必关心如何实现它们.编写这些行时(大约800行代码中),并未想到它们可能是瓶颈"(我讨厌那个词).他们只是推荐的做事方式.事后看来,很容易说 ,但是根据我的经验,所有性能问题都是这样.通常,最好避免产生性能问题.最好查找并修复所创建的对象,即使它们应该避免"(事后看来).我希望能带来一点味道.

These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.

这是第二个问题,分为两行:

Here is the second problem, in two separate lines:

 /* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)

这些是通过在项目末尾附加项目来构建列表. (解决方法是将项目收集到数组中,并一次构建所有列表.)有趣的是,这些语句仅花费(即在调用堆栈上)原始时间的3/48,因此它们不在事实在一开始是一个大问题.但是,在消除第一个问题之后,它们花费了3/20的时间,因此成为了大鱼".一般来说,就是这样.

These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.

我可能还会补充说,这个项目是从我帮助过的一个真实项目中提炼出来的.在该项目中,性能问题(加速方面)更加严重,例如在内部循环中调用数据库访问例程以查看任务是否完成.

I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.

参考添加:原始代码和重新设计的源代码都可以在 www中找到. ddj.com ,用于1993年,在文件9311.zip中,文件slug.asc和slug.zip.

REFERENCE ADDED:The source code, both original and redesigned, can be found in www.ddj.com, for 1993, in file 9311.zip, files slug.asc and slug.zip.

编辑2011/11/26:现在有一个 SourceForge项目,其中包含Visual C ++中的源代码以及有关如何操作的详尽描述它被调整了.它只经历了上述场景的前半部分,并没有遵循完全相同的顺序,但是仍然获得了2-3个数量级的加速比.

EDIT 2011/11/26:There is now a SourceForge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn't follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.

这篇关于不得已的性能优化策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 00:26