问题描述
或者一个更具体的问题是如何处理大量的数据,而这些数据一次不适合内存?用OFFSET我试图做hiveContext.sql(select ... limit 10 offset 10),同时递增偏移量以获取所有数据,但offset在hiveContext中似乎不是有效的。通常用于实现这一目标的替代方案是什么?对于某些情况,pyspark代码以
<$ p $开头p> from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql(select ... limit 10 offset 10)。show()
您的代码看起来像
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql(with result as
(SELECT colunm1 ,row2,column3,ROW_NUMBER()OVER(ORDER BY columnname)AS RowNum FROM tablename)
从RowNum> = OFFSEtvalue和RowNum<(OFFSEtvalue + limtvalue).show()的结果中选择colunm1,column2,column3 )
注意:根据您的要求更新以下变量tcolunm1,tablename,OFFSEtvalue ,limtvalue
Or a more specific question would be how can I process large amounts of data that do not fit into memory at once? With OFFSET I was trying to do hiveContext.sql("select ... limit 10 offset 10") while incrementing offset to get all the data but offset doesn't seem to be valid within hiveContext. What is the alternative usually used to achieve this goal?
For some context the pyspark code starts with
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("select ... limit 10 offset 10").show()
You code will look like
from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql(" with result as
( SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS RowNum FROM tablename )
select colunm1 ,column2,column3 from result where RowNum >= OFFSEtvalue and RowNum < (OFFSEtvalue +limtvalue ").show()
Note: Update below variables according your requirement tcolunm1 , tablename, OFFSEtvalue, limtvalue
这篇关于在Pyspark HiveContext中,什么是SQL OFFSET的等价物?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!