Apache Spark如何处理python多线程问题?

本文介绍了Apache Spark如何处理python多线程问题?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

根据 python的GIL ，我们无法在CPU绑定的进程中使用线程，因此我的问题是Apache Spark如何在多核心环境?

According to python's GIL we cannot use threading in CPU bound processes so my question is how does Apache Spark utilize python in multi-core environment?

推荐答案

多线程python问题与Apache Spark内部结构分开. Spark上的并行处理是在JVM内部处理的.

Multi-threading python issues are separated from Apache Spark internals. Parallelism on Spark is dealt with inside the JVM.

原因是在Python驱动程序中，SparkContext使用Py4J启动JVM并创建JavaSparkContext.

And the reason is that in the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.

Py4J仅在驱动程序上用于Python和Java SparkContext对象之间的本地通信.大型数据传输是通过不同的机制进行的.

Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

RDD转换映射到Java中的PythonRDD对象的转换.在远程工作者计算机上，PythonRDD对象启动Python子进程并使用管道与它们进行通信，从而发送用户的代码和要处理的数据.

RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. On remote worker machines, PythonRDD objects launch Python sub-processes and communicate with them using pipes, sending the user's code and the data to be processed.

PS:我不确定这是否能完全回答您的问题.

PS: I'm not sure if this actually answers your question completely.

这篇关于Apache Spark如何处理python多线程问题?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！