问题描述
我有困难得到这些组件中共同编织。我安装了Spark和成功地工作,我可以通过本地纱运行的作业,独立的,而还。我按照建议的步骤(据我所知)和here
I'm having difficulty getting these components to knit together properly. I have Spark installed and working succesfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
我工作在Ubuntu和各种组件版本我是
I'm working on Ubuntu and the various component versions I have are
- 星火火花1.5.1彬hadoop2.6
- 的Hadoop 的Hadoop-2.6.1
- 蒙戈 2.6.10
- 蒙戈-Hadoop的连接从https://github.com/mongodb/mongo-hadoop.git
- 的Python 2.7.10
- Spark spark-1.5.1-bin-hadoop2.6
- Hadoop hadoop-2.6.1
- Mongo 2.6.10
- Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
- Python 2.7.10
我有以下的各种步骤,如哪些罐子增加一些难度的路径,所以我已经添加什么
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
- 在
/usr/local/share/hadoop-2.6.1/share/hadoop/ma$p$pduce
我已经加入蒙戈-Hadoop的核心1.5.0-SNAPSHOT.jar
- 下面的环境变量
-
出口HADOOP_HOME =在/ usr / local / share下/ Hadoop的2.6.1
-
出口PATH = $ PATH:$ HADOOP_HOME / bin中
-
出口SPARK_HOME =在/ usr / local / share下/火花1.5.1彬hadoop2.6
-
出口PYTHONPATH =在/ usr / local / share下/蒙戈-的Hadoop /火花/ src目录/主/蟒蛇
-
出口PATH = $ PATH:$ SPARK_HOME /斌
- in
/usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce
I have addedmongo-hadoop-core-1.5.0-SNAPSHOT.jar
- the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
我的Python程序是基本
My Python program is basic
from pyspark import SparkContext, SparkConf import pymongo_spark pymongo_spark.activate() def main(): conf = SparkConf().setAppName("pyspark test") sc = SparkContext(conf=conf) rdd = sc.mongoRDD( 'mongodb://username:password@localhost:27017/mydb.mycollection') if __name__ == '__main__': main()
我使用的命令运行它。
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
和我得到以下输出结果
Traceback (most recent call last): File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module> main() File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main rdd = sc.mongoRDD('mongodb://username:password@localhost:27017/mydb.mycollection') File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD return self.mongoPairRDD(connection_string, config).values() File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD _ensure_pickles(self) File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles orig_tb) py4j.protocol.Py4JError
据这里
- mongodb/mongo-spark
- Stratio/Spark-MongoDB
虽然前者似乎是相对不成熟后者貌似比蒙戈 - Hadoop的连接器一个更好的选择,并提供了星火SQL API。
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup # although officially 0.11 supports only Spark 1.5 # I haven't encountered any issues on 1.6.1 bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read .format("com.stratio.datasource.mongodb") .options(host="mongo:27017", database="foo", collection="bar") .load()) df.show() ## +---+----+--------------------+ ## | x| y| _id| ## +---+----+--------------------+ ## |1.0|-1.0|56fbe6f6e4120712c...| ## |0.0| 4.0|56fbe701e4120712c...| ## +---+----+--------------------+
这似乎是比稳定得多
蒙戈-Hadoop的火花
,不支持静态配置predicate下推,简单的工作。It seems to be much more stable than
mongo-hadoop-spark
, supports predicate pushdown without static configuration and simply works.原来的答案
实际上,这里还有相当多的运动部件。我试着更易于管理的一点点通过建立这大致匹配描述的配置(我不再赘述Hadoop的图书馆虽然)一个简单的码头工人形象做出来。您可以在
GitHub上
一发现),并从头开始构建的:Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on
GitHub
(DOI 10.5281/zenodo.47882) and build it from scratch:git clone https://github.com/zero323/docker-mongo-spark.git cd docker-mongo-spark docker build -t zero323/mongo-spark .
或下载我所以你的图像可以简单地
泊坞窗拉zero323 /蒙戈火花
)or download an image I've pushed to Docker Hub so you can simply
docker pull zero323/mongo-spark
):启动图像:
docker run -d --name mongo mongo:2.6 docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
开始PySpark外壳通过
- 罐子
和- 驱动程序类路径
:pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
终于看到它是如何工作的:
And finally see how it works:
import pymongo import pymongo_spark mongo_url = 'mongodb://mongo:27017/' client = pymongo.MongoClient(mongo_url) client.foo.bar.insert_many([ {"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}]) client.close() pymongo_spark.activate() rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url)) .map(lambda doc: (doc.get('x'), doc.get('y')))) rdd.collect() ## [(1.0, -1.0), (0.0, 4.0)]
请注意,蒙戈 - Hadoop的似乎是关闭的第一个动作之后的连接。因此呼吁例如
rdd.count()
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example
rdd.count()
after the collect will throw an exception.>和
蒙戈-Hadoop的火花1.5.0-SNAPSHOT.jar
以两个- 罐子
和- 驱动程序类路径
是唯一的硬性要求备注
- 此图像松散的基础上所以请务必一些好人缘发送到如果它帮助。
- 如果不要求开发版本,包括新的API 然后用
- 包
最有可能是更好的选择
- This image is loosely based on jaceklaskowski/docker-spark so please be sure to send some good karma to @jacek-laskowski if it helps.
- If don't require a development version including new API then using
--packages
is most likely a better option.
这篇关于获取星火,Python和MongoDB的共同努力的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
-