Spark SQL 删除空格

本文介绍了Spark SQL 删除空格的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个简单的 Spark 程序，它读取一个 JSON 文件并发出一个 CSV 文件.在 JSON 数据中，值包含前导和尾随空格，当我发出 CSV 时，前导和尾随空格消失了.有没有办法可以保留这些空间.我尝试了很多选项，例如 ignoreTrailingWhiteSpace 、 ignoreLeadingWhiteSpace 但没有运气

I have a simple Spark Program which reads a JSON file and emits a CSV file. IN the JSON data the values contain leading and trailing white spaces, when I emit the CSV the leading and trailing white spaces are gone. Is there a way I can retain the spaces. I tried many options like ignoreTrailingWhiteSpace , ignoreLeadingWhiteSpace but no luck

input.json

{"key" : "k1", "value1": "Good String", "value2": "Good String"}
{"key" : "k1", "value1": "With Spaces      ", "value2": "With Spaces      "}
{"key" : "k1", "value1": "with tab\t", "value2": "with tab\t"}

输出.csv

_corrupt_record,key,value1,value2
,k1,Good String,Good String
,k1,With Spaces,With Spaces
,k1,with tab,with tab

预期.csv

_corrupt_record,key,value1,value2
,k1,Good String,Good String
,k1,With Spaces      ,With Spaces
,k1,with tab\t,with tab\t

我的代码:

public static void main(String[] args) {
    SparkSession sparkSession = SparkSession
            .builder()
            .appName(TestSpark.class.getName())
            .master("local[1]").getOrCreate();

    SparkContext context = sparkSession.sparkContext();
    context.setLogLevel("ERROR");
    SQLContext sqlCtx = sparkSession.sqlContext();
    System.out.println("Spark context established");

    List<StructField> kvFields = new ArrayList<>();
    kvFields.add(DataTypes.createStructField("_corrupt_record", DataTypes.StringType, true));
    kvFields.add(DataTypes.createStructField("key", DataTypes.StringType, true));
    kvFields.add(DataTypes.createStructField("value1", DataTypes.StringType, true));
    kvFields.add(DataTypes.createStructField("value2", DataTypes.StringType, true));
    StructType employeeSchema = DataTypes.createStructType(kvFields);

    Dataset<Row> dataset =
            sparkSession.read()
                    .option("inferSchema", false)
                    .format("json")
                    .schema(employeeSchema)
                    .load("D:\\dev\\workspace\\java\\simple-kafka\\key_value.json");

    dataset.createOrReplaceTempView("sourceView");
    sqlCtx.sql("select * from sourceView")
            .write()
            .option("header", true)
            .format("csv")
            .save("D:\\dev\\workspace\\java\\simple-kafka\\output\\" + UUID.randomUUID().toString());
    sparkSession.close();
}

更新

添加 POM 依赖

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>2.1.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>2.1.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql-kafka-0-10_2.10</artifactId>
        <version>2.1.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.10</artifactId>
        <version>2.1.0</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.22</version>
    </dependency>
</dependencies>

推荐答案

默认情况下，CSV 编写器会修剪前导和尾随空格.您可以使用

The CSV writer trims leading and trailing spaces by default. You can turn it off with

   sqlCtx.sql("select * from sourceView").write.
       option("header", true).
       option("ignoreLeadingWhiteSpace",false). // you need this
       option("ignoreTrailingWhiteSpace",false). // and this
       format("csv").save("/my/file/location")

这对我有用.如果它对您不起作用，您能否发布您尝试过的内容，以及您使用的是哪个 Spark 版本?如果我没记错的话，他们去年才推出了这个功能.

this works for me. If it didn't work for you, can you post what you tried, also, which spark version are you using ? They introduced this feature just last year if I remember right.

这篇关于Spark SQL 删除空格的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！