问题描述
我们设置了两个相同的 HP Z840 工作站,具有以下规格
- 2 x Xeon E5-2690 v4 @ 2.60GHz(Turbo Boost ON,HT OFF,总共 28 个逻辑 CPU)
- 32GB DDR4 2400 内存,四通道
并分别安装了 Windows 7 SP1 (x64) 和 Windows 10 Creators Update (x64).
然后我们运行了一个小型内存基准测试(下面的代码,使用 VS2015 Update 3 构建,64 位架构),它同时从多个线程执行无内存分配填充.
#include <Windows.h>#include <向量>#include <ppl.h>无符号 __int64 ZQueryPerformanceCounter(){无符号 __int64 c;::QueryPerformanceCounter((LARGE_INTEGER *)&c);返回 c;}无符号 __int64 ZQueryPerformanceFrequency(){无符号 __int64 c;::QueryPerformanceFrequency((LARGE_INTEGER *)&c);返回 c;}类 CZPerfCounter {上市:CZPerfCounter() : m_st(ZQueryPerformanceCounter()) {};无效重置() { m_st = ZQueryPerformanceCounter();};无符号 __int64 elapsedCount() { return ZQueryPerformanceCounter() - m_st;};unsigned long elapsedMS() { return (unsigned long)(elapsedCount() * 1000/m_freq);};unsigned long elapsedMicroSec() { return (unsigned long)(elapsedCount() * 1000 * 1000/m_freq);};静态无符号__int64频率(){返回m_freq;};私人的:无符号 __int64 m_st;静态无符号 __int64 m_freq;};无符号 __int64 CZPerfCounter::m_freq = ZQueryPerformanceFrequency();int main(int argc, char ** argv){SYSTEM_INFO 系统信息;获取系统信息(&系统信息);int ncpu = sysinfo.dwNumberOfProcessors;如果(argc == 2){ncpu = atoi(argv[1]);}{printf("线程数 %d
", ncpu);尝试 {并发::调度器::ResetDefaultSchedulerPolicy();int min_threads = 1;int max_threads = ncpu;concurrency::SchedulerPolicy 策略(2//两个策略设置条目, 并发::MinConcurrency, min_threads, 并发::MaxConcurrency, max_threads);concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);}捕捉(并发::default_scheduler_exists&){printf("无法设置并发运行时调度策略(默认调度已经存在).
");}静态int cnt = 100;静态 int num_fills = 1;CZPerfCounter pcTotal;//malloc/空闲printf("malloc/free
");{CZPerfCounter 电脑;for (int i = 1 * 1024 * 1024; i <= 8 * 1024 * 1024; i *= 2) {并发::parallel_for(0, 50, [i](size_t x) {std::vector<void *>点;ptrs.reserve(cnt);for (int n = 0; n < cnt; n++) {自动 p = malloc(i);ptrs.emplace_back(p);}for (int x = 0; x
令人惊讶的是,与 Windows 7 相比,Windows 10 CU 的结果非常糟糕.我在下面绘制了 1MB 块大小和 8MB 块大小的结果,线程数从 2,4,.. 到 28 个不等.当我们增加线程数时,Windows 7 的性能稍差,而 Windows 10 的可扩展性则更差.
我们已尝试确保已应用所有 Windows 更新、更新驱动程序、调整 BIOS 设置,但均未成功.我们还在其他几个硬件平台上运行了相同的基准测试,并且在 Windows 10 上都给出了类似的曲线.所以这似乎是 Windows 10 的问题.
有没有人有类似的经验,或者可能知道这方面的知识(也许我们错过了什么?).这种行为使我们的多线程应用程序受到了显着的性能影响.
***已编辑
使用
***已编辑
在同一硬件上收集的 Server 2012 R2 数据.Server 2012 R2 也比 Win7 差,但还是比 Win10 CU 好很多.
***已编辑
这也发生在 Server 2016 中.我添加了标签 windows-server-2016.
***已编辑
使用来自@Ext3h 的信息,我修改了基准以使用VirtualAlloc 和VirtualLock.与不使用 VirtualLock 时相比,我可以确认有显着的改进.在同时使用 VirtualAlloc 和 VirtualLock 时,整体 Win10 仍然比 Win7 慢 30% 到 40%.
微软似乎已经通过 Windows 10 Fall Creators Update 和 Windows 10 Pro for Workstation 解决了这个问题.
这是更新后的图表.
Win 10 FCU 和 WKS 的开销低于 Win 7.作为交换,VirtualLock 似乎有更高的开销.
We set up two identical HP Z840 Workstations with the following specs
- 2 x Xeon E5-2690 v4 @ 2.60GHz (Turbo Boost ON, HT OFF, total 28 logical CPUs)
- 32GB DDR4 2400 Memory, Quad-channel
and installed Windows 7 SP1 (x64) and Windows 10 Creators Update (x64) on each.
Then we ran a small memory benchmark (code below, built with VS2015 Update 3, 64-bit architecture) which performs memory allocation-fill-free simultaneously from multiple threads.
#include <Windows.h>
#include <vector>
#include <ppl.h>
unsigned __int64 ZQueryPerformanceCounter()
{
unsigned __int64 c;
::QueryPerformanceCounter((LARGE_INTEGER *)&c);
return c;
}
unsigned __int64 ZQueryPerformanceFrequency()
{
unsigned __int64 c;
::QueryPerformanceFrequency((LARGE_INTEGER *)&c);
return c;
}
class CZPerfCounter {
public:
CZPerfCounter() : m_st(ZQueryPerformanceCounter()) {};
void reset() { m_st = ZQueryPerformanceCounter(); };
unsigned __int64 elapsedCount() { return ZQueryPerformanceCounter() - m_st; };
unsigned long elapsedMS() { return (unsigned long)(elapsedCount() * 1000 / m_freq); };
unsigned long elapsedMicroSec() { return (unsigned long)(elapsedCount() * 1000 * 1000 / m_freq); };
static unsigned __int64 frequency() { return m_freq; };
private:
unsigned __int64 m_st;
static unsigned __int64 m_freq;
};
unsigned __int64 CZPerfCounter::m_freq = ZQueryPerformanceFrequency();
int main(int argc, char ** argv)
{
SYSTEM_INFO sysinfo;
GetSystemInfo(&sysinfo);
int ncpu = sysinfo.dwNumberOfProcessors;
if (argc == 2) {
ncpu = atoi(argv[1]);
}
{
printf("No of threads %d
", ncpu);
try {
concurrency::Scheduler::ResetDefaultSchedulerPolicy();
int min_threads = 1;
int max_threads = ncpu;
concurrency::SchedulerPolicy policy
(2 // two entries of policy settings
, concurrency::MinConcurrency, min_threads
, concurrency::MaxConcurrency, max_threads
);
concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
}
catch (concurrency::default_scheduler_exists &) {
printf("Cannot set concurrency runtime scheduler policy (Default scheduler already exists).
");
}
static int cnt = 100;
static int num_fills = 1;
CZPerfCounter pcTotal;
// malloc/free
printf("malloc/free
");
{
CZPerfCounter pc;
for (int i = 1 * 1024 * 1024; i <= 8 * 1024 * 1024; i *= 2) {
concurrency::parallel_for(0, 50, [i](size_t x) {
std::vector<void *> ptrs;
ptrs.reserve(cnt);
for (int n = 0; n < cnt; n++) {
auto p = malloc(i);
ptrs.emplace_back(p);
}
for (int x = 0; x < num_fills; x++) {
for (auto p : ptrs) {
memset(p, num_fills, i);
}
}
for (auto p : ptrs) {
free(p);
}
});
printf("size %4d MB, elapsed %8.2f s,
", i / (1024 * 1024), pc.elapsedMS() / 1000.0);
pc.reset();
}
}
printf("
");
printf("Total %6.2f s
", pcTotal.elapsedMS() / 1000.0);
}
return 0;
}
Surprisingly, the result is very bad in Windows 10 CU compared to Windows 7. I plotted the result below for 1MB chunk size and 8MB chunk size, varying the number of threads from 2,4,.., up to 28. While Windows 7 gave slightly worse performance when we increased the number of threads, Windows 10 gave much worse scalability.
We have tried to make sure all Windows update is applied, update drivers, tweak BIOS settings, without success. We also ran the same benchmark on several other hardware platforms, and all gave similar curve for Windows 10. So it seems to be a problem of Windows 10.
Does anyone have similar experience, or maybe know-how about this (maybe we missed something ?). This behavior has made our multithreaded application got significant performance hit.
*** EDITED
Using https://github.com/google/UIforETW (thanks to Bruce Dawson) to analyze the benchmark, we found that most of the time is spent inside kernels KiPageFault. Digging further down the call tree, all leads to ExpWaitForSpinLockExclusiveAndAcquire. Seems that the lock contention is causing this issue.
*** EDITED
Collected Server 2012 R2 data on the same hardware. Server 2012 R2 is also worse than Win7, but still a lot better than Win10 CU.
*** EDITED
It happens in Server 2016 as well. I added the tag windows-server-2016.
*** EDITED
Using info from @Ext3h, I modified the benchmark to use VirtualAlloc and VirtualLock. I can confirmed significant improvement compared to when VirtualLock is not used. Overall Win10 is still 30% to 40% slower than Win7 when both using VirtualAlloc and VirtualLock.
Microsoft seems to have fixed this issue with Windows 10 Fall Creators Update and Windows 10 Pro for Workstation.
Here is the updated graph.
Win 10 FCU and WKS has lower overhead than Win 7. In exchange, the VirtualLock seems to have higher overhead.
这篇关于与 Windows 7 相比,Windows 10 性能较差(页面错误处理不可扩展,当线程数 > 16 时出现严重的锁争用)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!