本文介绍了如何在pyspark中拆除CLOB?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我从 Oracle 中提取了数据,并且该表中有一个带有 CLOB DataType 的列,我将其设为 String 以获取 HDFS 中的数据.现在我必须拆除 CLOB 数据并在 Hive 中为其创建一个单独的表.
I sqooped data from Oracle and the table had a column with CLOB DataType, I made it String to get the data in HDFS. Now I have to dismantle the CLOB Data and create a separate table for that in Hive.
我有 txt 格式的 HDFS 文件.我可以分离 CLOB 数据并希望为 CLOB 制作 DataFrame
I have the HDFS file in txt format. I can segregate the CLOB data and be hoping to make DataFrame for CLOB
CLOB 采用以下格式:
[name] Bob [Age] 21 [City] London [work] No,
[name] Steve [Age] 51 [City] London [work] Yes,
.....
around a million rows like this
sc.setLogLevel("WARN")
log_txt=sc.textFile("/path/to/data/sample_data.txt")
header = log_txt.first()
log_txt = log_txt.filter(lambda line: line != header)
log_txt.take(10)
[u'0\\tdog\\t20160906182001\\tgoogle.com', u'1\\tcat\\t20151231120504\\tgoogle.com']
temp_var = log_txt.map(lambda k: k.split("\\t"))
log_df=temp_var.toDF(header.split("\\t"))
log_df = log_df.withColumn("field1Int", log_df["field1"].cast(IntegerType()))
log_df = log_df.withColumn("field3TimeStamp", log_df["field1"].cast(TimestampType()))
log_df.schema
StructType(List(StructField(field1,StringType,true),StructField(field2,StringType,true),StructField(field3,StringType,true),StructField(field4,StringType,true),StructField(field1Int,IntegerType,true),StructField(field3TimeStamp,TimestampType,true)))
这就是我创建 DataFrame 的方式.
This is how I have created DataFrame.
我需要你的帮助来弄清楚如何拆除字符串数据类型形式的CLOB.并在其上创建一个表格.
拆解后,我希望该表具有以下列,例如:
After dismantling, I expect the Table to have following Columns like:
+---------+---------------+----------+-----+
|Name |Age | City | Work|
+---------+---------------+----------+-----+
| Bob| 21 |London | No |
| Steve| 51 |London |Yes |
+---------+---------------+----------+-----+
任何帮助将不胜感激.
推荐答案
这里是:
import re
from pyspark.sql import Row
rdd = sc.parallelize(["[name] Bob [Age] 21 [City] London [work] No",
"[name] Steve [Age] 51 [City] London [work] Yes",
"[name] Steve [Age] [City] London [work] Yes"])
def clob_to_table(line):
m = re.search(r"\[name\](.*)?\[Age\](.*)?\[City\](.*)?\[work\](.*)?", line)
return Row(name=m.group(1), age=m.group(2), city=m.group(3), work=m.group(4))
rdd = rdd.map(clob_to_table)
df = spark.createDataFrame(rdd)
df.show()
+----+--------+-------+----+
| age| city| name|work|
+----+--------+-------+----+
| 21 | London | Bob | No|
| 51 | London | Steve | Yes|
| | London | Steve | Yes
+----+--------+-------+----+
这篇关于如何在pyspark中拆除CLOB?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!