问题描述
我的问题是:当我在设备驱动程序中正确使用 [pci_]dma_sync_single_for_{cpu,device}
时,如何确定何时禁用缓存监听是安全的?
My question is this: how can I determine when it is safe to disable cache snooping when I am correctly using [pci_]dma_sync_single_for_{cpu,device}
in my device driver?
我正在为通过 PCI Express (DMA) 直接写入 RAM 的设备开发设备驱动程序,并且担心管理缓存一致性.我可以在启动 DMA 时设置一个控制位以在 DMA 期间启用或禁用缓存侦听,显然为了性能我希望尽可能禁用缓存侦听.
I'm working on a device driver for a device which writes directly to RAM over PCI Express (DMA), and am concerned about managing cache coherence. There is a control bit I can set when initiating DMA to enable or disable cache snooping during DMA, clearly for performance I would like to leave cache snooping disabled if at all possible.
在中断例程中,当切换 DMA 缓冲区时,我会根据需要调用 pci_dma_sync_single_for_cpu()
和 ..._for_device()
,但在 32 位 Linux 2.6.18 上(RHEL 5) 事实证明,这些命令是可以扩展为空的宏……这就解释了为什么在此内核上禁用缓存监听时我的设备返回垃圾!
In the interrupt routine I call pci_dma_sync_single_for_cpu()
and ..._for_device()
as appropriate, when switching DMA buffers, but on 32-bit Linux 2.6.18 (RHEL 5) it turns out that these commands are macros which expand to nothing ... which explains why my device returns garbage when cache snooping is disabled on this kernel!
我浏览了内核源代码的历史,似乎直到 2.6.25 只有 64 位 x86 具有用于 DMA 同步的钩子.从 2.6.26 开始,似乎通过字段 sync_single_for_{cpu,device}
of dma_map_ops
,但到目前为止我还没有找到这些操作的任何定义.
I've trawled through the history of the kernel sources, and it seems that up until 2.6.25 only 64-bit x86 had hooks for DMA synchronisation. From 2.6.26 there seems to be a generic unified indirection mechanism for DMA synchronisation (currently in include/asm-generic/dma-mapping-common.h
) via fields sync_single_for_{cpu,device}
of dma_map_ops
, but so far I've failed to find any definitions of these operations.
推荐答案
我真的很惊讶没有人回答这个问题,所以在这里我们继续一个非 Linux 特定的答案(我对 Linux 内核本身的了解不够更具体)...
I'm really surprised no one has answered this, so here we go on a non-Linux specific answer (I have insufficient knowledge of the Linux kernel itself to be more specific) ...
缓存监听只是告诉 DMA 控制器向所有 CPU 发送缓存失效请求,以获取 DMA 进入的内存.这显然增加了缓存一致性总线的负载,并且它在附加处理器的情况下扩展性特别差,因为并非所有 CPU 都与发出监听的 DMA 控制器有单跳连接.因此,何时禁用缓存监听是安全的"的简单答案是当被 DMA 进入的内存不存在于任何 CPU 缓存中或其缓存行被标记为无效时.换句话说,任何从 DMA 区域读取的尝试都会总是导致从主内存读取.
Cache snooping simply tells the DMA controller to send cache invalidation requests to all CPUs for the memory being DMAed into. This obviously adds load to the cache coherency bus, and it scales particularly badly with additional processors as not all CPUs will have a single hop connection with the DMA controller issuing the snoop. Therefore, the simple answer to "when it is safe to disable cache snooping" is when the memory being DMAed into either does not exist in any CPU cache OR its cache lines are marked as invalid. In other words, any attempt to read from the DMAed region will always result in a read from main memory.
那么您如何确保从 DMA 区域读取的内容始终会进入主内存?
So how do you ensure reads from a DMAed region will always go to main memory?
在我们拥有 DMA 缓存侦听等奇特功能之前的一天,我们过去所做的是通过如下所示的一系列分解阶段将 DMA 内存输送到管道中:
Back in the day before we had fancy features like DMA cache snooping, what we used to do was to pipeline DMA memory by feeding it through a series of broken up stages as follows:
阶段 1:将脏"的 DMA 内存区域添加到脏且需要清理"的 DMA 内存列表中.
Stage 1: Add "dirty" DMA memory region to the "dirty and needs to be cleaned" DMA memory list.
第 2 阶段:下次设备使用新的 DMA 数据中断时,为可能访问这些块的所有 CPU(通常每个CPU 运行它自己的由本地内存块组成的列表).将所述段移动到干净"列表中.
Stage 2: Next time the device interrupts with fresh DMA'ed data, issue an async local CPU cache invalidate for DMA segments in the "dirty and needs to be cleaned" list for all CPUs which might access those blocks (often each CPU runs its own lists made up of local memory blocks). Move said segments into a "clean" list.
第 3 阶段:下一个 DMA 中断(当然你确定在上一个缓存失效完成之前不会发生),从干净"列表中取出一个新区域并告诉设备它的下一个 DMA 应该进入那.回收所有脏块.
Stage 3: Next DMA interrupt (which of course you're sure will not occur before the previous cache invalidate has completed), take a fresh region from the "clean" list and tell the device that its next DMA should go into that. Recycle any dirty blocks.
阶段 4:重复.
尽管这是更多的工作,但它有几个主要优点.首先,您可以将 DMA 处理固定到单个 CPU(通常是主 CPU0)或单个 SMP 节点,这意味着只有单个 CPU/节点需要担心缓存失效.其次,通过在一段时间内间隔操作并分散缓存一致性总线上的负载,您可以为内存子系统提供更多机会为您隐藏内存延迟.性能的关键通常是尝试使任何 DMA 在尽可能靠近相关 DMA 控制器的 CPU 上发生,并在尽可能靠近该 CPU 的内存中发生.
As much as this is more work, it has several major advantages. Firstly, you can pin DMA handling to a single CPU (typically the primary CPU0) or a single SMP node, which means only a single CPU/node need worry about cache invalidation. Secondly, you give the memory subsystem much more opportunity to hide memory latencies for you by spacing out operations over time and spreading out load on the cache coherency bus. The key for performance is generally to try and make any DMA occur on a CPU as close to the relevant DMA controller as possible and into memory as close to that CPU as possible.
如果您总是将新的 DMA 转移到用户空间和/或其他 CPU 的内存中,只需在异步缓存失效管道的前端注入新获取的内存.一些操作系统(不确定 Linux)有一个优化的例程来预排序清零内存,所以操作系统基本上在后台清零内存并保持快速满足缓存 - 它将支付你保持新内存请求低于缓存量,因为清零内存非常慢.我不知道在过去十年中生产的任何平台使用硬件卸载内存归零,因此您必须假设所有新内存可能包含需要失效的有效缓存行.
If you always hand off newly DMAed into memory to user space and/or other CPUs, simply inject freshly acquired memory in at the front of the async cache invalidating pipeline. Some OSs (not sure about Linux) have an optimised routine for preordering zeroed memory, so the OS basically zeros memory in the background and keeps a quick satisfy cache around - it will pay you to keep new memory requests below that cached amount because zeroing memory is extremely slow. I'm not aware of any platform produced in the past ten years which uses hardware offloaded memory zeroing, so you must assume that all fresh memory may contain valid cache lines which need invalidating.
我很感激这只能回答您的问题的一半,但总比没有好.祝你好运!
I appreciate this only answers half your question, but it's better than nothing. Good luck!
尼尔
这篇关于DMA 缓存一致性管理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!