假设我有这样的表:
+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country | 1 | NAN | NAN | NAN | NAN | 100 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | NAN | NAN | NAN | 20 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | NAN | NAN | NAN | 80 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 4 | sex | 6 | 40 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 5 | sex | 7 | 60 |
+----------+------------+----------+------------+----------+------------+-------+
我想用panadas中
LEFT JOIN
列上的下表来a_name
它:+----+---------+-------+-------+-------+
| id | a_name | c01 | c02 | c03 |
+----+---------+-------+-------+-------+
| 1 | country | dtr1 | dtr2 | dtr3 |
+----+---------+-------+-------+-------+
| 2 | region | dtc1 | dtc2 | dtc3 |
+----+---------+-------+-------+-------+
| 3 | city | dta1 | dta2 | dta3 |
+----+---------+-------+-------+-------+
| 4 | age | dtCo1 | dtCo2 | dtCo3 |
+----+---------+-------+-------+-------+
| 5 | sex | dts1 | dts2 | dts3 |
+----+---------+-------+-------+-------+
我想将列
c01, c02 and c03
添加到第一个表的列country ,region, city, age,sex
中出现的每个值(a_name_0, a_name_1 and a_name_2
)。显然,我需要为出现在
a_name_0, a_name_1 and a_name_2
列中的每个值添加三个新列,否则我的表将具有不同数量的行。其余的行值应为空,或NA或NAN ..无论如何。预期产量:
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| a_name_0 | c01_0 | c01_0 | c01_0 | id_qname_0 | a_name_1 | c01_1 | c01_1 | c01_1 | id_qname_1 | a_name_2 | c01_2 | c01_2 | c01_2 | id_qname_2 | count |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| country | dtCo1 | dtCo2 | dtCo3 | 1 | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | 70 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | dtr1 | dtr2 | dtr2 | 2 | city | dtc1 | dtc2 | dtc3 | NAN | NAN | NAN | NAN | NAN | NAN | 20 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 2 | city | | | | NAN | NAN | | | | NAN | 20 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 3 | age | | | | 4 | sex | | | | 6 | 40 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 3 | age | | | | 5 | sex | | | | 7 | 60 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
说明:
我正在构建数据仓库表,该表将用于数据分析目的。报价表(第一个表)应填充有需要直观表示的各种项目报价信息(表2)。
最佳答案
采用:
#convert count column to index for possible processing all another cols by groups
df1 = df1.set_index('count')
#groups by last value after last _
c = df1.columns.str.rsplit('_').str[-1]
#removed unnecessary id column from df2
df2 = df2.drop('id', axis=1)
#for list of DataFrames
dfs = []
#iterate groups
for i, x in df1.groupby(c, axis=1):
#change columns names for match and for avoid duplicated columns names
df2.columns = [ f'a_name_{i}'] + (df2.columns + f'_{i}').tolist()[1:]
#left join
x = x.merge(df2, on=f'a_name_{i}', how='left')
#convert duplicates by a_name columns to NaNs
m = x.duplicated(subset=[x.columns[0]])
x.iloc[m.to_numpy(), 2:] = np.nan
#convert id_qname columns to end
x[f'id_qname_{i}'] = x.pop(f'id_qname_{i}')
#append to list
dfs.append(x)
#join together and last add count column from index
df = pd.concat(dfs, axis=1).assign(count=df1.index)
print (df)
a_name_0 c01_0 c02_0 c03_0 id_qname_0 a_name_1 c01_0_1 c02_0_1 c03_0_1 \
0 country dtr1 dtr2 dtr3 1 NaN NaN NaN NaN
1 region dtc1 dtc2 dtc3 2 city dta1 dta2 dta3
2 region NaN NaN NaN 2 city NaN NaN NaN
3 region NaN NaN NaN 3 age dtCo1 dtCo2 dtCo3
4 region NaN NaN NaN 3 age NaN NaN NaN
id_qname_1 a_name_2 c01_0_1_2 c02_0_1_2 c03_0_1_2 id_qname_2 count
0 NaN NaN NaN NaN NaN NaN 100
1 NaN NaN NaN NaN NaN NaN 20
2 NaN NaN NaN NaN NaN NaN 80
3 4.0 sex dts1 dts2 dts3 6.0 40
4 5.0 sex NaN NaN NaN 7.0 60
关于python - Python Pandas,将一个表中的多列与另一表中的一列合并,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59444318/