问题描述
我开发了一些工程模拟。这包括实现一些长方程,例如这个方程来计算橡胶材料中的应力:
T = mu *(
pow(l1 * pow(l1 * l2 * l3,-0.1e1 / 0.3e1)a)* a
*(
pow(l1 * l2 * l3, 0.1e1 / 0.3e1)
-l1 * l2 * l3 * pow(l1 * l2 * l3,-0.4e1 / 0.3e1)/ 0.3e1
)* pow(l1 * l2 * l3,0.1 e1 / 0.3e1)/ l1
-pow(l2 * pow(l1 * l2 * l3,-0.1e1 / 0.3e1),a)* a / l1 / 0.3e1
- pow(l1 * l2 * l3,-0.1e1 / 0.3e1),a)* a / l1 / 0.3e1
)/ a
+ K *(l1 * l2 * l3 - 0.1e1)* l2 * l3
)* N1 / l2 / l3
+(
mu *(
- pow(l1 * pow(l1 * l2 * l3,-0.1 e1 / 0.3e1)a)* a / l2 / 0.3e1
+ pow(l2 * pow(l1 * l2 * l3,-0.1e1 / 0.3e1),a)* a
*
pow(l1 * l2 * l3,-0.1e1 / 0.3e1)
-l1 * l2 * l3 * pow(l1 * l2 * l3,-0.4e1 / 0.3e1)/ 0.3e1
)* pow(l1 * l2 * l3,0.1e1 / 0.3e1)/ l2
-pow(l3 * pow(l1 * l2 * l3,-0.1e1 / 0.3e1),a)* a / l2 / 0.3e1
)/ a
+ K *(l1 * l2 * l3 - 0.1e1)* l1 * l3
)* N2 / l1 / l3
$ b b +(
mu *(
- pow(l1 * pow(l1 * l2 * l3,-0.1e1 / 0.3e1),a)* a / l3 / 0.3e1
- pow (l2 * pow(l1 * l2 * l3,-0.1e1 / 0.3e1),a)* a / l3 / 0.3e1
+ pow(l3 * pow(l1 * l2 * l3,-0.1e1 / 0.3 e1),a)* a
*(
pow(l1 * l2 * l3,-0.1e1 / 0.3e1)
-l1 * l2 * l3 * pow(l1 * l2 * l3 ,-0.4e1 / 0.3e1)/ 0.3e1
)* pow(l1 * l2 * l3,0.1e1 / 0.3e1)/ l3
)/ a
+ K * l2 * l3 - 0.1e1)* l1 * l2
)* N3 / l1 / l2;
我使用Maple生成C ++代码以避免错误(并节省时间与繁琐的代数)。由于此代码执行数千(如果不是百万)次,性能是一个问题。不幸的是,数学迄今为止简化了;长方程是不可避免的。
我可以采取什么方法来优化此实现?我正在寻找高级策略,
我使用g ++和进行编译 - enable-optimize = -O3
。
更新:
我知道有很多重复的表达式,将处理这些;我的测试到目前为止建议它。
l1,l2,l3,mu,a,K
都是正实数/ p>
我用一个等效变量替换了 l1 * l2 * l3
: J
。
将 pow(x,0.1e1 / 0.3e1)
替换为 cbrt(x)
是一个很好的建议。
这将在CPU上运行,在不久的将来,
编辑摘要
- 我的原始答案只是指出,代码包含了很多复制计算,并且许多权力涉及到1/3的因素。例如,
pow(x,0.1e1 / 0.3e1)
与cbrt(x)
li>
- 我的第二次编辑是错误的,我的第三次外推这个错误。这是什么让人们害怕改变从字母'M'开始的符号数学程序的类似oracle的结果。我已经勾出(即,
I am developing some engineering simulations. This involves implementing some long equations such as this equation to calculate stress in a rubber like material:
T = ( mu * ( pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a * ( pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1 ) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l1 - pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l1 / 0.3e1 - pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l1 / 0.3e1 ) / a + K * (l1 * l2 * l3 - 0.1e1) * l2 * l3 ) * N1 / l2 / l3 + ( mu * ( - pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l2 / 0.3e1 + pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a * ( pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1 ) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l2 - pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l2 / 0.3e1 ) / a + K * (l1 * l2 * l3 - 0.1e1) * l1 * l3 ) * N2 / l1 / l3 + ( mu * ( - pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l3 / 0.3e1 - pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l3 / 0.3e1 + pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a * ( pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1 ) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l3 ) / a + K * (l1 * l2 * l3 - 0.1e1) * l1 * l2 ) * N3 / l1 / l2;
I use Maple to generate the C++ code to avoid mistakes (and save time with tedious algebra). As this code is executed thousands (if not millions) of times, the performance is a concern. Unfortunately the math only simplifies so far; the long equations are unavoidable.
What approach can I take to optimize this implementation? I'm looking for high-level strategies that I should be applying when implementing such equations, not necessarily specific optimizations for the example shown above.
I'm compiling using g++ with
--enable-optimize=-O3
.Update:
I know there are a lot of repeated expressions, I am using the assumption that the compiler would handle these; my tests so far suggest it does.
l1, l2, l3, mu, a, K
are all positive real numbers (not zero).I have replaced
l1*l2*l3
with an equivalent variable:J
. This did help improve performance.Replacing
pow(x, 0.1e1/0.3e1)
withcbrt(x)
was a good suggestion.This will be run on CPUs, In the near future this would likely run better on GPUs, but for now that option is not available.
解决方案Edit summary
- My original answer merely noted that the code contained a lot of replicated computations and that many of the powers involved factors of 1/3. For example,
pow(x, 0.1e1/0.3e1)
is the same ascbrt(x)
. - My second edit was just wrong, and and my third extrapolated on this wrongness. This is what makes people afraid to change the oracle-like results from symbolic math programs that start with the letter 'M'. I've stricken out (i.e., ) those edits and pushed them to the bottom of the current revision of this answer. However, I did not delete them. I'm human. It's easy for us to make a mistake.
- My fourth edit developed a very compact expression that correctly represents the convoluted expression in the question IF the parameters
l1
,l2
, andl3
are positive real numbers and ifa
is a non-zero real number. (We have yet to hear from the OP regarding the specific nature of these coefficients. Given the nature of the problem, these are reasonable assumptions.) - This edit attempts to answer the generic problem of how to simplify these expressions.
First things first
Maple and Mathematica sometimes miss the obvious. Even more importantly, the users of Maple and Mathematica sometimes make mistakes. Substituting "oftentimes", or maybe even "almost always", in lieu of "sometimes is probably closer to the mark.
You could have helped Maple simplify that expression by telling it about the parameters in question. In the example at hand, I suspect that
l1
,l2
, andl3
are positive real numbers and thata
is a non-zero real number. If that's the case, tell it that. Those symbolic math programs typically assume the quantities at hand are complex. Restricting the domain lets the program make assumptions that are not valid in the complex numbers.How to simplify those big messes from symbolic math programs (this edit)
Symbolic math programs typically provide the ability to provide information about the various parameters. Use that ability, particularly if your problem involves division or exponentiation. In the example at hand, you could have helped Maple simplify that expression by telling it that
l1
,l2
, andl3
are positive real numbers and thata
is a non-zero real number. If that's the case, tell it that. Those symbolic math programs typically assume the quantities at hand are complex. Restricting the domain lets the program make assumptions such as ab=(ab). This is only ifa
andb
are positive real numbers and ifx
is real. It is not valid in the complex numbers.Ultimately, those symbolic math programs follow algorithms. Help it along. Try playing with expanding, collecting, and simplifying before you generate code. In this case, you could have collected those terms involving a factor of
mu
and those involving a factor ofK
. Reducing an expression to its "simplest form" remains a bit of an art.When you get an ugly mess of generated code, don't accept it as a truth that you must not touch. Try to simplify it yourself. Look at what the symbolic math program had before it generated code. Look at how I reduced your expression to something much simpler and much faster, and how Walter's answer took mine several steps further. There is no magic recipe. If there was a magical recipe, Maple would have applied it and given the answer that Walter gave.
About the specific question
You are doing a lot of addition and subtraction in that calculation. You can get in deep trouble if you have terms that nearly cancel one another. You are wasting a lot of CPU if you have one term that dominates over the others.
Next, you are wasting a lot of CPU by performing repeated calculations. Unless you have enabled
-ffast-math
, which lets the compiler break some of the rules of IEEE floating point, the compiler will not (in fact, must not) simplify that expression for you. It will instead do exactly what you told it to do. At a minimum, you should calculatel1 * l2 * l3
prior to computing that mess.Finally, you are making a lot of calls to
pow
, which is extremely slow. Note that several of those calls are of the form (l1*l2*l3). Many of those calls topow
could be performed with a single call tostd::cbrt
:l123 = l1 * l2 * l3; l123_pow_1_3 = std::cbrt(l123); l123_pow_4_3 = l123 * l123_pow_1_3;
With this,
X * pow(l1 * l2 * l3, 0.1e1 / 0.3e1)
becomesX * l123_pow_1_3
.X * pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
becomesX / l123_pow_1_3
.X * pow(l1 * l2 * l3, 0.4e1 / 0.3e1)
becomesX * l123_pow_4_3
.X * pow(l1 * l2 * l3, -0.4e1 / 0.3e1)
becomesX / l123_pow_4_3
.
Maple did miss the obvious.
For example, there's a much easier way to write(pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1)
Assuming that
l1
,l2
, andl3
are real rather than complex numbers, and that the real cube root (rather than the principle complex root) are to be extracted, the above reduces to2.0/(3.0 * pow(l1 * l2 * l3, 1.0/3.0))
or
2.0/(3.0 * l123_pow_1_3)
Using
cbrt_l123
instead ofl123_pow_1_3
, the nasty expression in the question reduces tol123 = l1 * l2 * l3; cbrt_l123 = cbrt(l123); T = mu/(3.0*l123)*( pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3) + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1) + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2)) +K*(l123-1.0)*(N1+N2+N3);
Always double check, but always simplify as well.
Here are some of my steps in arriving at the above:
// Step 0: Trim all whitespace. T=(mu*(pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1+pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l2-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1+pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l3)/a+K*(l1*l2*l3-0.1e1)*l1*l2)*N3/l1/l2; // Step 1: // l1*l2*l3 -> l123 // 0.1e1 -> 1.0 // 0.4e1 -> 4.0 // 0.3e1 -> 3 l123 = l1 * l2 * l3; T=(mu*(pow(l1*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l1-pow(l2*pow(l123,-1.0/3),a)*a/l1/3-pow(l3*pow(l123,-1.0/3),a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l2/3+pow(l2*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l2-pow(l3*pow(l123,-1.0/3),a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l3/3-pow(l2*pow(l123,-1.0/3),a)*a/l3/3+pow(l3*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2; // Step 2: // pow(l123,1.0/3) -> cbrt_l123 // l123*pow(l123,-4.0/3) -> pow(l123,-1.0/3) // (pow(l123,-1.0/3)-pow(l123,-1.0/3)/3) -> 2.0/(3.0*cbrt_l123) // *pow(l123,-1.0/3) -> /cbrt_l123 l123 = l1 * l2 * l3; cbrt_l123 = cbrt(l123); T=(mu*(pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1-pow(l2/cbrt_l123,a)*a/l1/3-pow(l3/cbrt_l123,a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l2/3+pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2-pow(l3/cbrt_l123,a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l3/3-pow(l2/cbrt_l123,a)*a/l3/3+pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2; // Step 3: // Whitespace is nice. l123 = l1 * l2 * l3; cbrt_l123 = cbrt(l123); T = (mu*( pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1 -pow(l2/cbrt_l123,a)*a/l1/3 -pow(l3/cbrt_l123,a)*a/l1/3)/a +K*(l123-1.0)*l2*l3)*N1/l2/l3 +(mu*(-pow(l1/cbrt_l123,a)*a/l2/3 +pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2 -pow(l3/cbrt_l123,a)*a/l2/3)/a +K*(l123-1.0)*l1*l3)*N2/l1/l3 +(mu*(-pow(l1/cbrt_l123,a)*a/l3/3 -pow(l2/cbrt_l123,a)*a/l3/3 +pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a +K*(l123-1.0)*l1*l2)*N3/l1/l2; // Step 4: // Eliminate the 'a' in (term1*a + term2*a + term3*a)/a // Expand (mu_term + K_term)*something to mu_term*something + K_term*something l123 = l1 * l2 * l3; cbrt_l123 = cbrt(l123); T = (mu*( pow(l1/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l1 -pow(l2/cbrt_l123,a)/l1/3 -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3 +K*(l123-1.0)*l2*l3*N1/l2/l3 +(mu*(-pow(l1/cbrt_l123,a)/l2/3 +pow(l2/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l2 -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3 +K*(l123-1.0)*l1*l3*N2/l1/l3 +(mu*(-pow(l1/cbrt_l123,a)/l3/3 -pow(l2/cbrt_l123,a)/l3/3 +pow(l3/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l3))*N3/l1/l2 +K*(l123-1.0)*l1*l2*N3/l1/l2; // Step 5: // Rearrange // Reduce l2*l3*N1/l2/l3 to N1 (and similar) // Reduce 2.0/(3.0*cbrt_l123)*cbrt_l123/l1 to 2.0/3.0/l1 (and similar) l123 = l1 * l2 * l3; cbrt_l123 = cbrt(l123); T = (mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1 -pow(l2/cbrt_l123,a)/l1/3 -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3 +(mu*(-pow(l1/cbrt_l123,a)/l2/3 +pow(l2/cbrt_l123,a)*2.0/3.0/l2 -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3 +(mu*(-pow(l1/cbrt_l123,a)/l3/3 -pow(l2/cbrt_l123,a)/l3/3 +pow(l3/cbrt_l123,a)*2.0/3.0/l3))*N3/l1/l2 +K*(l123-1.0)*N1 +K*(l123-1.0)*N2 +K*(l123-1.0)*N3; // Step 6: // Factor out mu and K*(l123-1.0) l123 = l1 * l2 * l3; cbrt_l123 = cbrt(l123); T = mu*( ( pow(l1/cbrt_l123,a)*2.0/3.0/l1 -pow(l2/cbrt_l123,a)/l1/3 -pow(l3/cbrt_l123,a)/l1/3)*N1/l2/l3 + (-pow(l1/cbrt_l123,a)/l2/3 +pow(l2/cbrt_l123,a)*2.0/3.0/l2 -pow(l3/cbrt_l123,a)/l2/3)*N2/l1/l3 + (-pow(l1/cbrt_l123,a)/l3/3 -pow(l2/cbrt_l123,a)/l3/3 +pow(l3/cbrt_l123,a)*2.0/3.0/l3)*N3/l1/l2) +K*(l123-1.0)*(N1+N2+N3); // Step 7: // Expand l123 = l1 * l2 * l3; cbrt_l123 = cbrt(l123); T = mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1*N1/l2/l3 -pow(l2/cbrt_l123,a)/l1/3*N1/l2/l3 -pow(l3/cbrt_l123,a)/l1/3*N1/l2/l3 -pow(l1/cbrt_l123,a)/l2/3*N2/l1/l3 +pow(l2/cbrt_l123,a)*2.0/3.0/l2*N2/l1/l3 -pow(l3/cbrt_l123,a)/l2/3*N2/l1/l3 -pow(l1/cbrt_l123,a)/l3/3*N3/l1/l2 -pow(l2/cbrt_l123,a)/l3/3*N3/l1/l2 +pow(l3/cbrt_l123,a)*2.0/3.0/l3*N3/l1/l2) +K*(l123-1.0)*(N1+N2+N3); // Step 8: // Simplify. l123 = l1 * l2 * l3; cbrt_l123 = cbrt(l123); T = mu/(3.0*l123)*( pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3) + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1) + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2)) +K*(l123-1.0)*(N1+N2+N3);
Wrong answer, intentionally kept for humility
Note that this is stricken. It's wrong.
这篇关于当在C ++中实现长方程时,如何通过高级方法提高性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
- My original answer merely noted that the code contained a lot of replicated computations and that many of the powers involved factors of 1/3. For example,