我下面总结了许多csv文件中的数据。
+----------+-----+------+------+------+
| Date | Loc | Hour | Rain | Wind |
+----------+-----+------+------+------+
| 1-Sep-19 | 1 | 1 | 184 | 65 |
| 1-Sep-19 | 1 | 3 | 126 | 64 |
| 2-Sep-19 | 1 | 1 | 112 | 63 |
| 2-Sep-19 | 1 | 2 | 155 | 0 |
| 2-Sep-19 | 1 | 3 | 186 | 50 |
| 3-Sep-19 | 1 | 2 | 154 | 79 |
| 3-Sep-19 | 1 | 3 | 143 | 61 |
| 1-Sep-19 | 2 | 1 | 187 | 73 |
| 1-Sep-19 | 2 | 2 | 173 | 63 |
| 1-Sep-19 | 2 | 3 | 186 | 63 |
| 3-Sep-19 | 2 | 1 | 172 | 56 |
| 3-Sep-19 | 2 | 2 | 156 | 56 |
| 3-Sep-19 | 2 | 3 | 176 | 79 |
+----------+-----+------+------+------+
收集特定
Rain
,Wind
和Day
的Location
和Hour
详细信息的随机观测值。请注意
Day
,Location
和Hour
中缺少的行。我想用
0
值填充丢失的数据。我想要的最终输出如下。+----------+-----+------+------+------+
| Date | Loc | Hour | Rain | Wind |
+----------+-----+------+------+------+
| 1-Sep-19 | 1 | 1 | 184 | 65 |
| 1-Sep-19 | 1 | 2 | 0 | 0 |
| 1-Sep-19 | 1 | 3 | 126 | 64 |
| 2-Sep-19 | 1 | 1 | 112 | 63 |
| 2-Sep-19 | 1 | 2 | 155 | 0 |
| 2-Sep-19 | 1 | 3 | 186 | 50 |
| 3-Sep-19 | 1 | 1 | 0 | 0 |
| 3-Sep-19 | 1 | 2 | 154 | 79 |
| 3-Sep-19 | 1 | 3 | 143 | 61 |
| 1-Sep-19 | 2 | 1 | 187 | 73 |
| 1-Sep-19 | 2 | 2 | 173 | 63 |
| 1-Sep-19 | 2 | 3 | 186 | 63 |
| 2-Sep-19 | 2 | 1 | 0 | 0 |
| 2-Sep-19 | 2 | 2 | 0 | 0 |
| 2-Sep-19 | 2 | 3 | 0 | 0 |
| 3-Sep-19 | 2 | 1 | 172 | 56 |
| 3-Sep-19 | 2 | 2 | 156 | 56 |
| 3-Sep-19 | 2 | 3 | 176 | 79 |
+----------+-----+------+------+------+
到目前为止,我已经尝试了以下代码
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
d1 = date(2019, 9, 1)
d2 = date(2019, 9, 3)
delta = d2 - d1
# below indexes were created to make the code more dynamic & scalable easily
idx_date = [(d1 + timedelta(days=i)) for i in range(delta.days + 1)]
idx_loc = np.arange(1,3)
idx_hour = np.arange(1,4)
cols =['Rain', 'Wind']
df_filled = df.reindex(index=[idx_date,idx_loc,idx_hour], columns=cols, fill_value=0)
但是,我得到一个错误。如何解决这个问题
最佳答案
首先有必要创建MultiIndex
,因此可以通过所有可能的值使用reindex
:
df = df.set_index(['Date','Loc','Hour'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
Date Loc Hour Rain Wind
0 1Sep19 1 1 184 65
1 1Sep19 1 2 0 0
2 1Sep19 1 3 126 64
3 1Sep19 2 1 187 73
4 1Sep19 2 2 173 63
5 1Sep19 2 3 186 63
6 2Sep19 1 1 112 63
7 2Sep19 1 2 155 0
8 2Sep19 1 3 186 50
9 2Sep19 2 1 0 0
10 2Sep19 2 2 0 0
11 2Sep19 2 3 0 0
12 3Sep19 1 1 0 0
13 3Sep19 1 2 154 79
14 3Sep19 1 3 143 61
15 3Sep19 2 1 172 56
16 3Sep19 2 2 156 56
17 3Sep19 2 3 176 79
您的解决方案应更改:
df['Date'] = pd.to_datetime(df['Date'])
idx_date = pd.date_range('2019-09-01','2019-09-03')
idx_loc = np.arange(1,3)
idx_hour = np.arange(1,4)
cols =['Rain', 'Wind']
mux = pd.MultiIndex.from_product([idx_date,idx_loc,idx_hour],
names=['Date','Loc','Hour'])
df_filled = (df.set_index(['Date','Loc','Hour'])
.reindex(index=mux, columns=cols, fill_value=0)
.reset_index())
print (df_filled)
Date Loc Hour Rain Wind
0 2019-09-01 1 1 184 65
1 2019-09-01 1 2 0 0
2 2019-09-01 1 3 126 64
3 2019-09-01 2 1 187 73
4 2019-09-01 2 2 173 63
5 2019-09-01 2 3 186 63
6 2019-09-02 1 1 112 63
7 2019-09-02 1 2 155 0
8 2019-09-02 1 3 186 50
9 2019-09-02 2 1 0 0
10 2019-09-02 2 2 0 0
11 2019-09-02 2 3 0 0
12 2019-09-03 1 1 0 0
13 2019-09-03 1 2 154 79
14 2019-09-03 1 3 143 61
15 2019-09-03 2 1 172 56
16 2019-09-03 2 2 156 56
17 2019-09-03 2 3 176 79
关于python - Pandas 重新索引多列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57883965/