我正在尝试解析具有很多工作表的Excel文件。每张纸都有一个列,该列的信息如下(3张纸= 3列):
ReceivedEmail OpenedEmail ClickedURL
[email protected] [email protected] [email protected]
[email protected] [email protected] [email protected]
[email protected] [email protected]
[email protected] [email protected]
[email protected] [email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
我想要的是一个表,该表保留工作表的第一列,即具有有关ReceivedEmail(我们大量通过电子邮件发送的人员)的所有数据的表。接下来的列应该是后续工作表的每个第一列,但我不想使用重复的电子邮件,而是要使用列表推导来检查OpenedEmail是否在ReceivedEmail中存在,并给以1,否则给0。
这是我到目前为止所做的:
import pandas as pd
xl = pd.ExcelFile(path_to_file)
xl.sheet_names
['ReceivedEmail', 'OpenedEmail', 'ClickedURL']
df = xl.parse(sheet_name=xl.sheet_names[0], header=None)
df.rename(columns={df.columns[0]:xl.sheet_names[0]}, inplace=True);
df.columns[0]
['ReceivedEmail']
# then I created a buffer dataframe to check next columns
df_buffer = xl.parse(sheet_name=xl.sheet_names[1], header=None)
df_buffer.rename(columns={df_buffer.columns[0]:xl.sheet_names[1]}, inplace=True);
但是然后当我像这样运行列表理解时:
df[df_buffer.columns[0]] = [1 if x in df[df.columns[0]] else 0 for x in df_buffer[df_buffer.columns[0]]]
我收到一个错误:
ValueError:值的长度与索引的长度不匹配
有什么线索可以解决此错误或以明智的方式解决问题吗?我正在手动执行以查看其是否有效,然后我可以在以后进行循环,但是我被错误所困扰。
最终结果应为:
ReceivedEmail OpenedEmail ClickedURL
[email protected] 1 1
[email protected] 0 0
[email protected] 0 0
[email protected] 1 0
[email protected] 0 0
[email protected] 0 0
[email protected] 1 0
[email protected] 1 1
[email protected] 0 0
[email protected] 1 0
最佳答案
您可以将read_excel
与参数sheetname=None
一起使用,以将所有工作表返回到DataFrames的有序字典:
注意:
每张纸都有一列。
dfs = pd.read_excel('file.xlsx', sheetname=None)
print (dfs)
OrderedDict([('ReceivedEmail', a
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 [email protected]
5 [email protected]
6 [email protected]
7 [email protected]
8 [email protected]
9 [email protected]), ('OpenedEmail', a
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
4 [email protected]), ('ClickedURL', a
0 [email protected]
1 [email protected])])
然后连接在一起,并按子集
[]
更改顺序,对于第二个检查成员资格中的每一列,按isin
更改顺序,最后将boolena掩码转换为整数。cols = list(dfs.keys())
df = pd.concat({k: v.iloc[:, 0] for k, v in dfs.items()}, axis=1)[cols]
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: df.iloc[:, 0].isin(x)).astype(int)
print (df)
ReceivedEmail OpenedEmail ClickedURL
0 [email protected] 1 1
1 [email protected] 0 0
2 [email protected] 0 0
3 [email protected] 1 0
4 [email protected] 0 0
5 [email protected] 0 0
6 [email protected] 1 0
7 [email protected] 1 1
8 [email protected] 0 0
9 [email protected] 1 0
关于python - 从Excel多工作表文件进行解析:列之间的列表理解,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/53664902/