问题描述
所以我的问题是我有这个数据集:
so the problem I have is I have this dataset:
并且它显示企业在特定日期开展业务.我想要实现的是报告在哪一天添加了哪些业务.也许我在寻找一些答案:
and it shows the businesses are doing business in the specific days. what i want to achieve is to report which businesses are added on what day. Perhaps Im lookign for some answer as:
我设法使用这个 sql 整理了所有记录:
I managed to tide up all the records using this sql:
select [Date]
,Mnemonic
,securityDesc
,sum(cast(TradedVolume as money)) as TradedVolumSum
FROM SomeTable
group by [Date],Mnemonic,securityDesc
但我不知道如何将每一天的记录与前一天的记录进行比较,并将第二天的不存在记录导出到另一个表中.我厌倦了 sql 分区条款,但它使它变得复杂.我可以使用 sql 或 Pyspark sql python 组合.
but I dont know how to compare each days record with the other day and export the non existence record on the following day to another table. I tired sql over partition cluase but it makes it complex. I can either use sql or Pyspark sql python combination.
你能告诉我如何解决这个问题吗?
could you let me how I can resolve this problem?
推荐答案
以下是针对您的问题的数据框操作,您可能需要稍微调整一下,因为我没有它的示例数据,通过查看您的数据编写了代码,请告诉我这是否能解决您的问题:
Below is the dataframe operation for your question you might need to tweak a little bit as I dont have the sample data for it, written the code by seeing your data, please let me know if that solves your problem:
import pyspark.sql.functions as F
from pyspark.sql import Window
some_win = Window.partitionBy("securityDesc").orderBy(F.col("[date]").asc())
some_table.withColumn(
"buisness_added_day",
F.first(F.col("id")).over(some_win)
).select(
"buisness_added_day",
"securityDesc",
"TradedVolumSum",
"Mnemonic"
).distinct().orderBy("buisness_added_day").show()
这篇关于PySpark sql 比较每天的记录并报告差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!