这是我的桌子

python - 数量重复作为新列-LMLPHP

我希望它是以下内容,即通过将其他差异Quantity(shopID, productID) Quantity(shopID, productID)复制为新列Quantity_shopID_productID
python - 数量重复作为新列-LMLPHP

以下是我的代码:

from datetime import date
import pandas as pd
df=pd.DataFrame({"Date":[date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2)],
                "ShopID":[1,1,1,1,2,2,2,2],
                "ProductID":[1,1,2,2,1,1,2,2],
                "Quantity":[3,3,4,4,5,5,6,6]})


for sid in df.ShopID.unique():
    for pid in df.ProductID.unique():
        col_name='Quantity{}_{}'.format(sid,pid)
        print(col_name)
        df1=df[(df.ShopID==sid) & (df.ProductID==pid)][['Date','Quantity']]
        df1.rename(columns={'Quantity':col_name}, inplace=True)
        display(df1)
        df=df.merge(df1, how="left",on="Date")
        df.loc[(df.ShopID==sid) & (df.ProductID==pid),col_name]=None

print(df)


问题是,由于我在3年中有108种以上的(shopID, productID)组合,因此它的工作速度非常慢。是否有提高效率的方法?

最佳答案

这是一个pivotmerge问题,还有一些额外的问题:

# somehow merge only works with pandas datetime
df['Date'] = pd.to_datetime(df['Date'])

# define the new column names
df['new_col'] = 'Quantity_'+df['ShopID'].astype(str) + '_' + df['ProductID'].astype(str)

# new data to merge:
pivot = df.pivot_table(index='Date',
                        columns='new_col',
                        values='Quantity')

# merge
new_df = df.merge(pivot, left_on='Date', right_index=True)

# mask
mask = new_df['new_col'].values[:,None] == pivot.columns.values

# adding the None the values:
new_df[pivot.columns] = new_df[pivot.columns].mask(mask)


输出:

    Date                   ShopID    ProductID    Quantity  new_col         Quantity_1_1    Quantity_1_2    Quantity_2_1    Quantity_2_2
--  -------------------  --------  -----------  ----------  ------------  --------------  --------------  --------------  --------------
 0  2019-10-01 00:00:00         1            1           3  Quantity_1_1             nan               4               5               6
 1  2019-10-02 00:00:00         1            1           3  Quantity_1_1             nan               4               5               6
 2  2019-10-01 00:00:00         1            2           4  Quantity_1_2               3             nan               5               6
 3  2019-10-02 00:00:00         1            2           4  Quantity_1_2               3             nan               5               6
 4  2019-10-01 00:00:00         2            1           5  Quantity_2_1               3               4             nan               6
 5  2019-10-02 00:00:00         2            1           5  Quantity_2_1               3               4             nan               6
 6  2019-10-01 00:00:00         2            2           6  Quantity_2_2               3               4               5             nan
 7  2019-10-02 00:00:00         2            2           6  Quantity_2_2               3               4               5             nan




测试大小与实际数据相似的数据:

# 3 years dates
dates = pd.date_range('2015-01-01', '2018-12-31', freq='D')

# 12 Shops and 9 products
idx = pd.MultiIndex.from_product((dates, range(1,13), range(1,10)),
                                 names=('Date','ShopID', 'ProductID'))

# the test data
np.random.seed(1)
df = pd.DataFrame({'Quantity':np.random.randint(0,10, len(idx))},
                  index=idx).reset_index()


上面的代码在i5笔记本电脑上花费了大约10秒:-)

10-05 20:30