本文介绍了如何从 S3 读取拼花数据以触发数据帧 Python?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark 的新手,我找不到这个...我有很多镶木地板文件上传到 s3 位置:

s3://a-dps/d-l/sco/alpha/20160930/parquet/

这个文件夹的总大小是20+ Gb,.如何将其分块并将其读入数据帧如何将所有这些文件加载​​到数据框中?

分配给 Spark 集群的内存为 6 GB.

 from pyspark import SparkContext从 pyspark.sql 导入 SQLContext从 pyspark 导入 SparkConf从 pyspark.sql 导入 SparkSession进口大熊猫# SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3")sc = SparkContext.getOrCreate()sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')sqlContext = SQLContext(sc)df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*")

错误:

Py4JJavaError:调用 o33.parquet 时发生错误.:java.io.IOException:方案没有文件系统:s3在 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)在 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)在 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)在 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)在 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)在 org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)在 org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)在 org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)在 org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)在 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)在 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)在 scala.collection.immutable.List.foreach(List.scala:381)在 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)在 scala.collection.immutable.List.flatMap(List.scala:344)
解决方案

您使用的文件架构 (s3) 不正确.您需要使用 s3n 架构或 s3a(对于更大的 s3 对象):

//使用 sqlContext 代替 spark 

我建议您阅读更多关于 Hadoop-AWS 模块:与亚马逊网络服务集成概述.

I am new to Spark and I am not able to find this... I have a lot of parquet files uploaded into s3 at location :

s3://a-dps/d-l/sco/alpha/20160930/parquet/

The total size of this folder is 20+ Gb,. How to chunk and read this into a dataframeHow to load all these files into a dataframe?

Allocated memory to spark cluster is 6 gb.

    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from pyspark import SparkConf
    from pyspark.sql import SparkSession
    import pandas
    # SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3")
    sc = SparkContext.getOrCreate()

    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')

    sqlContext = SQLContext(sc)
    df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*")

Error :


    Py4JJavaError: An error occurred while calling o33.parquet.
    : java.io.IOException: No FileSystem for scheme: s3
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:344)

 
解决方案

The file schema (s3)that you are using is not correct. You'll need to use the s3n schema or s3a (for bigger s3 objects):

// use sqlContext instead for spark <2 
val df = spark.read 
              .load("s3n://bucket-name/object-path")

I suggest that you read more about the Hadoop-AWS module: Integration with Amazon Web Services Overview.

这篇关于如何从 S3 读取拼花数据以触发数据帧 Python?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-24 23:39