假设我有这样的表:

+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country  | 1          | NAN      | NAN        | NAN      | NAN        | 100   |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | NAN        | NAN      | NAN        | 20    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 2          | city     | NAN        | NAN      | NAN        | 80    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 4          | sex      | 6          | 40    |
+----------+------------+----------+------------+----------+------------+-------+
| region   | 3          | age      | 5          | sex      | 7          | 60    |
+----------+------------+----------+------------+----------+------------+-------+


我想用panadas中LEFT JOIN列上的下表来a_name它:

+----+---------+-------+-------+-------+
| id | a_name  | c01   | c02   | c03   |
+----+---------+-------+-------+-------+
| 1  | country | dtr1  | dtr2  | dtr3  |
+----+---------+-------+-------+-------+
| 2  | region  | dtc1  | dtc2  | dtc3  |
+----+---------+-------+-------+-------+
| 3  | city    | dta1  | dta2  | dta3  |
+----+---------+-------+-------+-------+
| 4  | age     | dtCo1 | dtCo2 | dtCo3 |
+----+---------+-------+-------+-------+
| 5  | sex     | dts1  | dts2  | dts3  |
+----+---------+-------+-------+-------+


我想将列c01, c02 and c03添加到第一个表的列country ,region, city, age,sex中出现的每个值(a_name_0, a_name_1 and a_name_2)。

显然,我需要为出现在a_name_0, a_name_1 and a_name_2列中的每个值添加三个新列,否则我的表将具有不同数量的行。其余的行值应为空,或NA或NAN ..无论如何。

预期产量:

+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| a_name_0 | c01_0 | c01_0 | c01_0 | id_qname_0 | a_name_1 | c01_1 | c01_1 | c01_1 | id_qname_1 | a_name_2 | c01_2 | c01_2 | c01_2 | id_qname_2 | count |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| country  | dtCo1 | dtCo2 | dtCo3 | 1          | NAN      | NAN   | NAN   | NAN   | NAN        | NAN      | NAN   | NAN   | NAN   | NAN        | 70    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   | dtr1  | dtr2  | dtr2  | 2          | city     | dtc1  | dtc2  | dtc3  | NAN        | NAN      | NAN   | NAN   | NAN   | NAN        | 20    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 2          | city     |       |       |       | NAN        | NAN      |       |       |       | NAN        | 20    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 3          | age      |       |       |       | 4          | sex      |       |       |       | 6          | 40    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region   |       |       |       | 3          | age      |       |       |       | 5          | sex      |       |       |       | 7          | 60    |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+


说明:

我正在构建数据仓库表,该表将用于数据分析目的。报价表(第一个表)应填充有需要直观表示的各种项目报价信息(表2)。

最佳答案

采用:

#convert count column to index for possible processing all another cols by groups
df1 = df1.set_index('count')
#groups by last value after last _
c = df1.columns.str.rsplit('_').str[-1]
#removed unnecessary id column from df2
df2 = df2.drop('id', axis=1)

#for list of DataFrames
dfs = []
#iterate groups
for i, x in df1.groupby(c, axis=1):
    #change columns names for match and for avoid duplicated columns names
    df2.columns = [ f'a_name_{i}'] + (df2.columns + f'_{i}').tolist()[1:]
    #left join
    x = x.merge(df2, on=f'a_name_{i}', how='left')
    #convert duplicates by a_name columns to NaNs
    m = x.duplicated(subset=[x.columns[0]])
    x.iloc[m.to_numpy(), 2:] = np.nan
    #convert id_qname columns to end
    x[f'id_qname_{i}'] = x.pop(f'id_qname_{i}')
    #append to list
    dfs.append(x)

#join together and last add count column from index
df = pd.concat(dfs, axis=1).assign(count=df1.index)




print (df)
  a_name_0 c01_0 c02_0 c03_0  id_qname_0 a_name_1 c01_0_1 c02_0_1 c03_0_1  \
0  country  dtr1  dtr2  dtr3           1      NaN     NaN     NaN     NaN
1   region  dtc1  dtc2  dtc3           2     city    dta1    dta2    dta3
2   region   NaN   NaN   NaN           2     city     NaN     NaN     NaN
3   region   NaN   NaN   NaN           3      age   dtCo1   dtCo2   dtCo3
4   region   NaN   NaN   NaN           3      age     NaN     NaN     NaN

   id_qname_1 a_name_2 c01_0_1_2 c02_0_1_2 c03_0_1_2  id_qname_2  count
0         NaN      NaN       NaN       NaN       NaN         NaN    100
1         NaN      NaN       NaN       NaN       NaN         NaN     20
2         NaN      NaN       NaN       NaN       NaN         NaN     80
3         4.0      sex      dts1      dts2      dts3         6.0     40
4         5.0      sex       NaN       NaN       NaN         7.0     60

关于python - Python Pandas,将一个表中的多列与另一表中的一列合并,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59444318/

10-12 18:39