问题描述
我需要从Teradata(只读访问)中提取一个表,以使用Scala(2.11)/Spark(2.1.0)进行镶木地板.我正在建立一个可以成功加载的数据框
I need to extract a table from Teradata (read-only access) to parquet with Scala (2.11) / Spark (2.1.0).I'm building a dataframe that I can load successfully
val df = spark.read.format("jdbc").options(options).load()
但是df.show
给了我NullPointerException:
But df.show
gives me a NullPointerException:
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
我做了一个df.printSchema
,我发现此NPE的原因是数据集包含(nullable = false)
列的null
值(看起来Teradata给了我错误的信息).确实,如果删除有问题的列,我可以实现df.show
.
I did a df.printSchema
and I found out that the reason for this NPE is that the dataset contains null
values for (nullable = false)
columns (it looks like Teradata is giving me wrong information). Indeed, I can achieve a df.show
if I drop the problematic columns.
因此,我尝试指定所有列均设置为(nullable = true)
的新架构:
So, I tried specifying a new schema with all columns set to (nullable = true)
:
val new_schema = StructType(df.schema.map {
case StructField(n,d,nu,m) => StructField(n,d,true,m)
})
val new_df = spark.read.format("jdbc").schema(new_schema).options(options).load()
但是后来我得到了
org.apache.spark.sql.AnalysisException: JDBC does not allow user-specified schemas.;
我还尝试从上一个创建新的数据框,并指定所需的模式:
I also tried to create a new Dataframe from the previous one, specifying the wanted schema:
val new_df = df.sqlContext.createDataFrame(df.rdd, new_schema)
但是在对数据帧执行操作时,我仍然得到了NPE.
But I still got an NPE when taking action on the dataframe.
关于如何解决此问题的任何想法?
Any idea on how I could fix this?
推荐答案
我认为这可以在Teradata最新版本的jar中解决,经过所有研究,我更新了我的teradata jar (terajdbc4.jar和tdgssconfig.jar)版本更改为16.20.00.04,并将Teradata网址更改为
I think this is resolved in teradata latest version jars, After all the research I updated my teradata jars (terajdbc4.jar and tdgssconfig.jar) version to 16.20.00.04 and changed the teradata url to
teradata.connection.url=jdbc:teradata://hostname.some.com/
TMODE=ANSI,CHARSET=UTF8,TYPE=FASTEXPORT,COLUMN_NAME=ON,MAYBENULL=ON
在我添加了teradta url属性 COLUMN_NAME = ON,MAYBENULL = ON
this is worked after I added teradta url properties COLUMN_NAME=ON,MAYBENULL=ON
现在一切正常.
您可以在此处查看参考文件
you can check the reference document here
这篇关于使用Scala/Spark提取Teradata表后出现NullPointerException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!