本文介绍了获取星火,Python和MongoDB的共同努力的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有困难得到这些组件中共同编织。我安装了Spark和成功地工作,我可以通过本地纱运行的作业,独立的,而还。我按照建议的步骤(据我所知)和here

I'm having difficulty getting these components to knit together properly. I have Spark installed and working succesfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here

我工作在Ubuntu和各种组件版本我是

I'm working on Ubuntu and the various component versions I have are


  • 星火火花1.5.1彬hadoop2.6

  • 的Hadoop 的Hadoop-2.6.1

  • 蒙戈 2.6.10

  • 蒙戈-Hadoop的连接从https://github.com/mongodb/mongo-hadoop.git

  • 的Python 2.7.10

  • Spark spark-1.5.1-bin-hadoop2.6
  • Hadoop hadoop-2.6.1
  • Mongo 2.6.10
  • Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
  • Python 2.7.10

我有以下的各种步骤,如哪些罐子增加一些难度的路径,所以我已经添加什么

I had some difficulty following the various steps such as which jars to add to which path, so what I have added are


  • /usr/local/share/hadoop-2.6.1/share/hadoop/ma$p$pduce 我已经加入 蒙戈-Hadoop的核心1.5.0-SNAPSHOT.jar

  • 下面的环境变量

    • 出口HADOOP_HOME =在/ usr / local / share下/ Hadoop的2.6.1

    • 出口PATH = $ PATH:$ HADOOP_HOME / bin中

    • 出口SPARK_HOME =在/ usr / local / share下/火花1.5.1彬hadoop2.6

    • 出口PYTHONPATH =在/ usr / local / share下/蒙戈-的Hadoop /火花/ src目录/主/蟒蛇

    • 出口PATH = $ PATH:$ SPARK_HOME /斌

    • in /usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce I have added mongo-hadoop-core-1.5.0-SNAPSHOT.jar
    • the following environment variables
      • export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
      • export PATH=$PATH:$HADOOP_HOME/bin
      • export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
      • export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
      • export PATH=$PATH:$SPARK_HOME/bin

      我的Python程序是基本

      My Python program is basic

      from pyspark import SparkContext, SparkConf
      import pymongo_spark
      pymongo_spark.activate()
      
      def main():
          conf = SparkConf().setAppName("pyspark test")
          sc = SparkContext(conf=conf)
          rdd = sc.mongoRDD(
              'mongodb://username:password@localhost:27017/mydb.mycollection')
      
      if __name__ == '__main__':
          main()
      

      我使用的命令运行它。

      I am running it using the command

      $SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
      

      和我得到以下输出结果

      Traceback (most recent call last):
        File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module>
          main()
        File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main
          rdd = sc.mongoRDD('mongodb://username:password@localhost:27017/mydb.mycollection')
        File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD
          return self.mongoPairRDD(connection_string, config).values()
        File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD
          _ensure_pickles(self)
        File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles
          orig_tb)
      py4j.protocol.Py4JError
      

      据这里



    • mongodb/mongo-spark
    • Stratio/Spark-MongoDB

    虽然前者似乎是相对不成熟后者貌似比蒙戈 - Hadoop的连接器一个更好的选择,并提供了星火SQL API。

    While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.

    # Adjust Scala and package version according to your setup
    # although officially 0.11 supports only Spark 1.5
    # I haven't encountered any issues on 1.6.1
    bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
    
    df = (sqlContext.read
      .format("com.stratio.datasource.mongodb")
      .options(host="mongo:27017", database="foo", collection="bar")
      .load())
    
    df.show()
    
    ## +---+----+--------------------+
    ## |  x|   y|                 _id|
    ## +---+----+--------------------+
    ## |1.0|-1.0|56fbe6f6e4120712c...|
    ## |0.0| 4.0|56fbe701e4120712c...|
    ## +---+----+--------------------+
    

    这似乎是比稳定得多蒙戈-Hadoop的火花,不支持静态配置predicate下推,简单的工作。

    It seems to be much more stable than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.

    原来的答案

    实际上,这里还有相当多的运动部件。我试着更易于管理的一点点通过建立这大致匹配描述的配置(我不再赘述Hadoop的图书馆虽然)一个简单的码头工人形象做出来。您可以在 GitHub上

    Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:

    git clone https://github.com/zero323/docker-mongo-spark.git
    cd docker-mongo-spark
    docker build -t zero323/mongo-spark .
    

    或下载我所以你的图像可以简单地泊坞窗拉zero323 /蒙戈火花

    or download an image I've pushed to Docker Hub so you can simply docker pull zero323/mongo-spark):

    启动图像:

    docker run -d --name mongo mongo:2.6
    docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
    

    开始PySpark外壳通过 - 罐子 - 驱动程序类路径

    pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
    

    终于看到它是如何工作的:

    And finally see how it works:

    import pymongo
    import pymongo_spark
    
    mongo_url = 'mongodb://mongo:27017/'
    
    client = pymongo.MongoClient(mongo_url)
    client.foo.bar.insert_many([
        {"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
    client.close()
    
    pymongo_spark.activate()
    rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
        .map(lambda doc: (doc.get('x'), doc.get('y'))))
    rdd.collect()
    
    ## [(1.0, -1.0), (0.0, 4.0)]
    

    请注意,蒙戈 - Hadoop的似乎是关闭的第一个动作之后的连接。因此呼吁例如 rdd.count()

    Please note that mongo-hadoop seems to close the connection after the first action. So calling for example rdd.count() after the collect will throw an exception.

    >和蒙戈-Hadoop的火花1.5.0-SNAPSHOT.jar 以两个 - 罐子 - 驱动程序类路径 是唯一的硬性要求

    备注


    • 此图像松散的基础上所以请务必一些好人缘发送到如果它帮助。

    • 如果不要求开发版本,包括新的API 然后用 - 包最有可能是更好的选择

    • This image is loosely based on jaceklaskowski/docker-spark so please be sure to send some good karma to @jacek-laskowski if it helps.
    • If don't require a development version including new API then using --packages is most likely a better option.

    这篇关于获取星火,Python和MongoDB的共同努力的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-03 10:45