我正在尝试解析具有很多工作表的Excel文件。每张纸都有一个列,该列的信息如下(3张纸= 3列):

ReceivedEmail    OpenedEmail    ClickedURL
[email protected]     [email protected]   [email protected]
[email protected]     [email protected]   [email protected]
[email protected]     [email protected]
[email protected]     [email protected]
[email protected]     [email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]


我想要的是一个表,该表保留工作表的第一列,即具有有关ReceivedEmail(我们大量通过电子邮件发送的人员)的所有数据的表。接下来的列应该是后续工作表的每个第一列,但我不想使用重复的电子邮件,而是要使用列表推导来检查OpenedEmail是否在ReceivedEmail中存在,并给以1,否则给0。

这是我到目前为止所做的:

import pandas as pd
xl = pd.ExcelFile(path_to_file)
xl.sheet_names
['ReceivedEmail', 'OpenedEmail', 'ClickedURL']
df = xl.parse(sheet_name=xl.sheet_names[0], header=None)
df.rename(columns={df.columns[0]:xl.sheet_names[0]}, inplace=True);
df.columns[0]
['ReceivedEmail']
# then I created a buffer dataframe to check next columns
df_buffer = xl.parse(sheet_name=xl.sheet_names[1], header=None)
df_buffer.rename(columns={df_buffer.columns[0]:xl.sheet_names[1]}, inplace=True);


但是然后当我像这样运行列表理解时:

df[df_buffer.columns[0]] = [1 if x in df[df.columns[0]] else 0 for x in df_buffer[df_buffer.columns[0]]]


我收到一个错误:

ValueError:值的长度与索引的长度不匹配

有什么线索可以解决此错误或以明智的方式解决问题吗?我正在手动执行以查看其是否有效,然后我可以在以后进行循环,但是我被错误所困扰。

最终结果应为:

ReceivedEmail    OpenedEmail    ClickedURL
[email protected]         1              1
[email protected]         0              0
[email protected]         0              0
[email protected]         1              0
[email protected]         0              0
[email protected]         0              0
[email protected]         1              0
[email protected]         1              1
[email protected]         0              0
[email protected]         1              0

最佳答案

您可以将read_excel与参数sheetname=None一起使用,以将所有工作表返回到DataFrames的有序字典:

注意:

每张纸都有一列。

dfs = pd.read_excel('file.xlsx', sheetname=None)
print (dfs)
OrderedDict([('ReceivedEmail',               a
0  [email protected]
1  [email protected]
2  [email protected]
3  [email protected]
4  [email protected]
5  [email protected]
6  [email protected]
7  [email protected]
8  [email protected]
9  [email protected]), ('OpenedEmail',               a
0  [email protected]
1  [email protected]
2  [email protected]
3  [email protected]
4  [email protected]), ('ClickedURL',               a
0  [email protected]
1  [email protected])])


然后连接在一起,并按子集[]更改顺序,对于第二个检查成员资格中的每一列,按isin更改顺序,最后将boolena掩码转换为整数。

cols = list(dfs.keys())
df = pd.concat({k: v.iloc[:, 0] for k, v in dfs.items()}, axis=1)[cols]

df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: df.iloc[:, 0].isin(x)).astype(int)
print (df)
  ReceivedEmail OpenedEmail ClickedURL
0  [email protected]           1          1
1  [email protected]           0          0
2  [email protected]           0          0
3  [email protected]           1          0
4  [email protected]           0          0
5  [email protected]           0          0
6  [email protected]           1          0
7  [email protected]           1          1
8  [email protected]           0          0
9  [email protected]           1          0

关于python - 从Excel多工作表文件进行解析:列之间的列表理解,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/53664902/

10-15 19:02