问题描述
假设我在主机上运行了一个c / c ++应用程序。在主机CPU上运行的线程很少,在Xeon Phi内核上运行50个线程。
如何确保这50个线程自己的Xeon Phi内核,并且永远不会从核心缓存中清除(给定代码足够小)。
什么是最快的方式?在主机线程聚合器和50个Phi线程之间交换数据?
由于实际的并行性将非常有限 - 这个应用程序将被更像51线程平面应用程序与一些基本的多线程数据同步。
我可以使用传统的C / C ++编译器来创建这样的应用程序吗?
您提出了几个问题:
-
是的,您可以使用常规C程序并使用常规英特尔C / C ++ / Fortran编译器(称为Intel Composer XE)编译,以生成能够运行的二进制处于本地/对称或卸载模式的Intel Xeon Phi协处理器。在最简单的情况下 - 你只需用-mmic重新编译你的C / C ++程序,并在Phi上按原样运行它。
-
要使用哪些API?使用 OpenMP4.0 标准版或 Intel Cilk Plus 编程模型(实际上是适用于C / C ++的编译指示或关键字集)。 OpenCL,英特尔TBB和可能的OpenACC也是可能的,但是OpenMP和Cilk Plus具有表达线程,矢量化和卸载能力(即3个对于至强融核编程至关重要的东西),而不需要重构或重写常规C / C ++ / Fortran程序
-
主题固定:可以通过 OpenMP关联(详见MIC_KMP_AFFINITY下文) / p>
-
在主机和目标Phi之间交换数据的最快方式是避免任何交换MPI对称方法。但是,您似乎特别询问卸载编程模型,因此使用可以实现最佳性能。同时,同步卸载理论上在编程方面更简单,但在可实现的性能方面更差。
总体来说,你倾向于问几个一般问题,所以我建议从一开始就开始 - 即看下面〜10页的Dr. Dobbs 或给予Intel'。
线程固定是更高级的话题,同时对你来说似乎是最有趣的,所以我将明确解释更多:
- 如果您的代码使用OpenMP4.0标准进行并行化,那么您可以使用用于Xeon Phi的MIC_KMP_AFFINITY / MIC_KMP_PLACE_THREADS和用于主机CPU的KMP_AFFINITY / KMP_PLACE_THREADS来实现理想的行为。
- 很可能您正在查找此特定设置:MIC_KMP_PLACE_THREADS = 50c,1t
- 我看到人们提到PHI_KMP_AFFINITY,而不是MIC_KMP_AFFINITY。我相信他们是别名,但没有尝试自己。
- 在Xeon Phi上使用50个线程通常不是最好的主意。
- 更多关于Xeon Phi上的亲和力的细节将在这3篇文章中解释:
和
和
li>
Suppose I have a single c/c++ app running on the host. there are few threads running on the host CPU and 50 threads running on the Xeon Phi cores.
How can I make sure that each of these 50 runs on its own Xeon Phi core and is never purged off the core cache (given the code is small enough).
Could someone please to outline a very general idea how to do this and which tool/API would be more suitable (for C/C++ code) ?
What is the fastest way to exchange data between the host thread-aggregator and the 50 Phi threads?
Given that the actual parallelism will be very limited - this application is going to be more like 51 thread plane application with some basic multithreading data sync.
Can I use conventional C/C++ compiler to create the app like this?
You have raised several questions:
Yes, you can use conventional C program and compile it using regular Intel C/C++/Fortran compilers (known as Intel Composer XE) in order to generate binary being able to run on Intel Xeon Phi co-processor in either "native"/"symmetric" or "offload" modes. In simplest case - you just recompile your C/C++ program with -mmic and run it "natively" on Phi just "as is".
Which API to use? Use OpenMP4.0 standard or Intel Cilk Plus programming models (actually set of pragmas or keywords applicable to C/C++). OpenCL, Intel TBB and likely OpenACC are also possible, but OpenMP and Cilk Plus have capability to express threading, vectorization and offload (i.e. 3 things essential for Xeon Phi programming) without re-factoring or rewriting "conventional C/C++/Fortran" program .
Threads pinning: could be achieved via OpenMP affinity (see more details on MIC_KMP_AFFINITY below) or Intel TBB affinity stuff.
The fastest way to exchange the data between the host and target Phi - is.. avoid any exchange -using MPI symmetric approach for example. However you seem to ask about "offload" programming model specifically, so using asynchronous offload you can achieve the best performance. At the same time synchronous offload is theoretically simpler in terms of programming, but worse in terms of achievable performance.
Overall, you tend to ask several general questions, so I would recommend to start from the very beginning - i.e. looking at following ~10-pages Dr. Dobbs manual or given Intel' intro document.
Threads pinning is more advanced topic and at the same time seems to be "most interesting" for you, so I will explicitly explain more:
- If your code is parallelized using OpenMP4.0 standard, then you can achieve desirable behavior using MIC_KMP_AFFINITY / MIC_KMP_PLACE_THREADS for Xeon Phi and KMP_AFFINITY / KMP_PLACE_THREADS for Host CPU.
- Quite likely you're looking for this specific setting: MIC_KMP_PLACE_THREADS=50c,1t
- I've seen that people mention PHI_KMP_AFFINITY instead of MIC_KMP_AFFINITY. I believe they are aliased, but didn't try myself.
- Using 50 threads on Xeon Phi is usually not the best idea. It's better to try around 120 threads or so
- More details about affinity on Xeon Phi are explained in these 3 articles:http://www.prace-project.eu/Best-Practice-Guide-Intel-Xeon-Phi-HTML#id-1.6.2.3andhttps://software.intel.com/en-us/articles/best-known-methods-for-using-openmp-on-intel-many-integrated-core-intel-mic-architectureandhttps://software.intel.com/en-us/articles/openmp-thread-affinity-control
这篇关于如何卸载单个应用程序的特定线程到特定的Xeon Phi内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!