问题描述
我尝试了一个简单的示例,例如:
I tried a simple example like:
data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("2014 Population estimate", "2015 median sales price").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
效果很好,但是当我尝试类似的东西时:
It works well, But when i try something very similar like:
data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load('/mnt/%s/OnlineNewsTrainingAndValidation.csv' % MOUNT_NAME)
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("timedelta", "shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
display(data)
它引发错误:AnalysisException:u无法解析给定输入列的'timedelta':[data_channel_is_tech,...
It raise error: AnalysisException: u"cannot resolve 'timedelta' given input columns: [ data_channel_is_tech,...
当然,我导入了LabeledPoint和LinearRegression
off-course I imported LabeledPoint and LinearRegression
有什么问题吗?
更简单的情况
df_cleaned = df_cleaned.select("shares")
引发相同的AnalysisException(错误).
raises same AnalysisException (error).
*请注意:df_cleaned.printSchema()效果很好.
*please note: df_cleaned.printSchema() works well.
推荐答案
我发现了问题:某些列名称在名称本身之前包含空格.所以
I found the issue: some of the column names contain white spaces before the name itself. So
data = data.select(" timedelta", " shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
工作.我可以使用
assert " " not in ''.join(df.columns)
现在,我正在考虑一种删除空白的方法.任何想法都非常感谢!
Now I am thinking of a way to remove the white spaces. Any idea is much appreciated!
这篇关于AnalysisException:u"无法解析给定输入列的“名称":spark中sqlContext中的[list]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!