问题描述
一般情况
一个同时占用大量带宽,CPU使用率和GPU使用率的应用程序,需要每秒将大约10-15GB的空间从一个GPU传输到另一个GPU.它使用DX11 API来访问GPU,因此仅使用需要为每个上载进行映射的缓冲区才能将其上载到GPU.上载一次以25MB的块为单位,并且16个线程正在同时将缓冲区写入映射的缓冲区.关于这一切,没有什么可以做的.如果不是以下错误,则写入的实际并发级别应该较低.
这是一个强大的工作站,带有3个Pascal GPU,高端Haswell处理器和四通道RAM.在硬件上没有太多改进.它正在运行Windows 10的桌面版本.
实际问题
一旦我通过了大约50%的CPU负载,MmPageFault()
中的某些内容(在Windows内核内部,当访问已映射到您的地址空间但尚未由操作系统提交的内存时调用)会严重损坏,并且剩下的50%的CPU负载被浪费在MmPageFault()
内部的自旋锁上. CPU利用率达到100%,应用程序性能完全下降.
我必须假定这是由于每秒需要分配给进程的巨大内存量,并且每次取消映射DX11缓冲区时,该内存也将完全从进程中取消映射.相应地,实际上每秒每秒有数千次对MmPageFault()
的调用,随着memcpy()
顺序写入缓冲区而依次发生.对于遇到的每个未提交的页面.
一个CPU负载超过50%,Windows内核中保护页面管理的乐观自旋锁会完全降低性能.
注意事项
缓冲区是由DX11驱动程序分配的.关于分配策略,没有任何调整.不能使用其他内存API,尤其是重复使用.
对DX11 API的调用(映射/取消映射缓冲区)都是从单个线程进行的.实际的复制操作可能会在系统中没有虚拟处理器的情况下跨更多线程进行多线程处理.
无法降低内存带宽要求.这是一个实时应用程序.实际上,硬限制是当前主GPU的PCIe 3.0 16x带宽.如果可以的话,我将需要进一步推动.
避免多线程副本是不可能的,因为存在独立的生产者-消费者队列,这些队列不能被简单地合并.
自旋锁性能下降的情况似乎很少见(因为用例已将其推得那么远),以至于在Google上,您找不到自旋锁函数名称的单一结果.
正在升级到可以对映射进行更多控制的API(Vulkan),但它不适合作为短期解决方案.出于相同的原因,目前还不能切换到更好的OS内核.
减少CPU负载也不起作用.除了(通常是琐碎且便宜的)缓冲区副本以外,还有很多工作要做.
问题
该怎么办?
我需要显着减少单个页面错误的数量.我知道已映射到我的进程中的缓冲区的地址和大小,并且我还知道内存尚未提交.
如何确保以最少的事务提交内存?
DX11的奇异标志将防止在取消映射后取消缓冲区的分配,Windows API强制在单个事务中进行提交,几乎可以接受任何事情.
当前状态
// In the processing threads
{
DX11DeferredContext->Map(..., &buffer)
std::memcpy(buffer, source, size);
DX11DeferredContext->Unmap(...);
}
当前解决方法,简化的伪代码:
// During startup
{
SetProcessWorkingSetSize(GetCurrentProcess(), 2*1024*1024*1024, -1);
}
// In the DX11 render loop thread
{
DX11context->Map(..., &resource)
VirtualLock(resource.pData, resource.size);
notify();
wait();
DX11context->Unmap(...);
}
// In the processing threads
{
wait();
std::memcpy(buffer, source, size);
signal();
}
VirtualLock()
强制内核立即返回具有RAM的指定地址范围.对补充VirtualUnlock()
函数是可选的,当从进程中取消映射地址范围时,它会隐式发生(并且不会产生任何额外费用). (如果显式调用,它的成本大约是锁定成本的1/3.)
为了使VirtualLock()
能够正常工作,请 SetProcessWorkingSetSize()
,因为被VirtualLock()
锁定的所有内存区域的总和不能超过为该进程配置的最小工作集大小.除非将系统实际进行交换,否则将最小"工作集大小设置为大于进程的基准内存占用量不会产生任何副作用,除非您的系统可能进行交换,否则您的进程仍将不会消耗比实际工作集大小更多的RAM.
仅在单独的线程中使用VirtualLock()
,并且对Map
/Unmap
调用使用延迟的DX11上下文,确实将性能损失从40-50%降低到了稍微可接受的15%.
放弃使用延迟的上下文,并排他触发所有软故障,以及在单个线程上解除映射时的以及相应的取消分配,必要的性能提升.现在,该自旋锁的总成本已降至CPU总使用率的
摘要?
当您期望Windows出现软错误时,请尝试将其保留在同一线程中.并行执行memcpy
本身是没有问题的,在某些情况下甚至对于充分利用内存带宽也是必要的.但是,仅当内存已经提交到RAM时才可以. VirtualLock()
是确保这一点的最有效方法.
(除非您正在使用DirectX之类的将内存映射到您的进程的API,否则您不太可能会经常遇到未提交的内存.如果您仅使用标准C ++ new
或malloc
,则您的内存将被池化并回收无论如何都在您的过程中,所以软故障很少见.
在使用Windows时,请确保避免任何形式的并发页面错误.
The General Situation
An application that is extremely intensive on both bandwidth, CPU usage, and GPU usage needs to transfer about 10-15GB per second from one GPU to another. It's using the DX11 API to access the GPU, so upload to the GPU can only happen with buffers that require mapping for each single upload. The upload happens in chunks of 25MB at a time, and 16 threads are writing buffers to mapped buffers concurrently. There's not much that can be done about any of this. The actual concurrency level of the writes should be lower, if it weren't for the following bug.
It's a beefy workstation with 3 Pascal GPUs, a high-end Haswell processor, and quad-channel RAM. Not much can be improved on the hardware. It's running a desktop edition of Windows 10.
The Actual Problem
Once I pass ~50% CPU load, something in MmPageFault()
(inside the Windows kernel, called when accessing memory which has been mapped into your address space, but was not committed by the OS yet) breaks horribly, and the remaining 50% CPU load is being wasted on a spin-lock inside MmPageFault()
. The CPU becomes 100% utilized, and the application performance completely degrades.
I must assume that this is due to the immense amount of memory which needs to be allocated to the process each second and which is also completely unmapped from the process every time the DX11 buffer is unmapped. Correspondingly, it's actually thousands of calls to MmPageFault()
per second, happening sequentially as memcpy()
is writing sequentially to the buffer. For each single uncommitted page encountered.
One the CPU load goes beyond 50%, the optimistic spin-lock in the Windows kernel protecting the page management completely degrades performance-wise.
Considerations
The buffer is allocated by the DX11 driver. Nothing can be tweaked about the allocation strategy. Use of a different memory API and especially re-use is not possible.
Calls to the DX11 API (mapping/unmapping the buffers) all happens from a single thread. The actual copy operations potentially happen multi-threaded across more threads than there are virtual processors in the system.
Reducing the memory bandwidth requirements is not possible. It's a real-time application. In fact, the hard limit is currently the PCIe 3.0 16x bandwidth of the primary GPU. If I could, I would already need to push further.
Avoiding multi-threaded copies is not possible, as there are independent producer-consumer queues which can't be merged trivially.
The spin-lock performance degradation appears to be so rare (because the use case is pushing it that far) that on Google, you won't find a single result for the name of the spin-lock function.
Upgrading to an API which gives more control over the mappings (Vulkan) is in progress, but it's not suitable as a short-term fix. Switching to a better OS kernel is currently not an option for the same reason.
Reducing the CPU load doesn't work either; there is too much work which needs to be done other than the (usually trivial and inexpensive) buffer copy.
The Question
What can be done?
I need to reduce the number of individual pagefaults significantly. I know the address and size of the buffer which has been mapped into my process, and I also know that the memory has not been committed yet.
How can I ensure that the memory is committed with the least amount of transactions possible?
Exotic flags for DX11 which would prevent de-allocation of the buffers after unmapping, Windows APIs to force commit in a single transaction, pretty much anything is welcome.
The current state
// In the processing threads
{
DX11DeferredContext->Map(..., &buffer)
std::memcpy(buffer, source, size);
DX11DeferredContext->Unmap(...);
}
Current workaround, simplified pseudo code:
// During startup
{
SetProcessWorkingSetSize(GetCurrentProcess(), 2*1024*1024*1024, -1);
}
// In the DX11 render loop thread
{
DX11context->Map(..., &resource)
VirtualLock(resource.pData, resource.size);
notify();
wait();
DX11context->Unmap(...);
}
// In the processing threads
{
wait();
std::memcpy(buffer, source, size);
signal();
}
VirtualLock()
forces the kernel to back the specified address range with RAM immediately. The call to the complementing VirtualUnlock()
function is optional, it happens implicitly (and at no extra cost) when the address range is unmapped from the process. (If called explicitly, it costs about 1/3rd of the locking cost.)
In order for VirtualLock()
to work at all, SetProcessWorkingSetSize()
needs to be called first, as the sum of all memory regions locked by VirtualLock()
can not exceed the minimum working set size configured for the process. Setting the "minimum" working set size to something higher than the baseline memory footprint of your process has no side effects unless your system is actually potentially swapping, your process will still not consume more RAM than the actual working set size.
Just the use of VirtualLock()
, albeit in individual threads and using deferred DX11 contexts for Map
/ Unmap
calls, did instantly decrease the performance penalty from 40-50% to slightly more acceptable 15%.
Discarding the use of a deferred context, and exclusively triggering both all soft faults, as well as the corresponding de-allocation when unmapping on a single thread, gave the necessary performance boost. The total cost of that spin-lock is now down to <1% of the total CPU usage.
Summary?
When you expect soft faults on Windows, try what you can to keep them all in the same thread. Performing a parallel memcpy
itself is unproblematic, in some situations even necessary to fully utilize the memory bandwidth. However, that is only if the memory is already committed to RAM yet. VirtualLock()
is the most efficient way to ensure that.
(Unless you are working with an API like DirectX which maps memory into your process, you are unlikely to encounter uncommitted memory frequently. If you are just working with standard C++ new
or malloc
your memory is pooled and recycled inside your process anyway, so soft faults are rare.)
Just make sure to avoid any form of concurrent page faults when working with Windows.
这篇关于如何渴望在C ++中提交分配的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!