这是我的桌子
我希望它是以下内容,即通过将其他差异Quantity
的(shopID, productID)
Quantity
的(shopID, productID)
复制为新列Quantity_shopID_productID
。
以下是我的代码:
from datetime import date
import pandas as pd
df=pd.DataFrame({"Date":[date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2)],
"ShopID":[1,1,1,1,2,2,2,2],
"ProductID":[1,1,2,2,1,1,2,2],
"Quantity":[3,3,4,4,5,5,6,6]})
for sid in df.ShopID.unique():
for pid in df.ProductID.unique():
col_name='Quantity{}_{}'.format(sid,pid)
print(col_name)
df1=df[(df.ShopID==sid) & (df.ProductID==pid)][['Date','Quantity']]
df1.rename(columns={'Quantity':col_name}, inplace=True)
display(df1)
df=df.merge(df1, how="left",on="Date")
df.loc[(df.ShopID==sid) & (df.ProductID==pid),col_name]=None
print(df)
问题是,由于我在3年中有108种以上的
(shopID, productID)
组合,因此它的工作速度非常慢。是否有提高效率的方法? 最佳答案
这是一个pivot
和merge
问题,还有一些额外的问题:
# somehow merge only works with pandas datetime
df['Date'] = pd.to_datetime(df['Date'])
# define the new column names
df['new_col'] = 'Quantity_'+df['ShopID'].astype(str) + '_' + df['ProductID'].astype(str)
# new data to merge:
pivot = df.pivot_table(index='Date',
columns='new_col',
values='Quantity')
# merge
new_df = df.merge(pivot, left_on='Date', right_index=True)
# mask
mask = new_df['new_col'].values[:,None] == pivot.columns.values
# adding the None the values:
new_df[pivot.columns] = new_df[pivot.columns].mask(mask)
输出:
Date ShopID ProductID Quantity new_col Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
-- ------------------- -------- ----------- ---------- ------------ -------------- -------------- -------------- --------------
0 2019-10-01 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
1 2019-10-02 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
2 2019-10-01 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
3 2019-10-02 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
4 2019-10-01 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
5 2019-10-02 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
6 2019-10-01 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
7 2019-10-02 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
测试大小与实际数据相似的数据:
# 3 years dates
dates = pd.date_range('2015-01-01', '2018-12-31', freq='D')
# 12 Shops and 9 products
idx = pd.MultiIndex.from_product((dates, range(1,13), range(1,10)),
names=('Date','ShopID', 'ProductID'))
# the test data
np.random.seed(1)
df = pd.DataFrame({'Quantity':np.random.randint(0,10, len(idx))},
index=idx).reset_index()
上面的代码在i5笔记本电脑上花费了大约10秒:-)