问题描述
我有一个2列的DataFrame,第1列对应于客户,第2列对应于该客户访问过的城市. DataFrame如下所示:
I have a 2-column DataFrame, column-1 corresponds to customer, column-2 corresponds to the city this customer has visited. The DataFrame looks like the following:
print(df)
customer visited_city
0 John London
1 Mary Melbourne
2 Steve Paris
3 John New_York
4 Peter New_York
5 Mary London
6 John Melbourne
7 John New_York
我想将上述DataFrame转换为行向量格式,以便每行代表一个唯一的用户,行向量表示访问的城市.
I would like to convert the above DataFrame into a row-vector format, such that each row represents a unique user with the row vector indicating the cities visited.
print(wide_format_df)
London Melbourne New_York Paris
John 1.0 1.0 1.0 0.0
Mary 1.0 1.0 0.0 0.0
Steve 0.0 0.0 0.0 1.0
Peter 0.0 0.0 1.0 0.0
下面是我用来生成宽格式的代码.逐一遍历每个用户.我想知道有没有更有效的方法?
Below is the code I used to generate the wide format. It iterates through each user one by one. I was wondering is there any more efficient way to do so?
import pandas as pd
import numpy as np
UNIQUE_CITIESS = np.sort(df['visited_city'].unique())
p = len(UNIQUE_CITIESS)
unique_customers = df['customer'].unique().tolist()
X = []
for customer in unique_customers:
x = np.zeros(p)
city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique())
visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited)
x[visited_idx] = 1
X.append(x)
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers)
wide_format_df
推荐答案
请注意,您的问题已经过编辑,因此提供的答案不再回答您的问题.他们必须进行调整,以使New York
中的John
仅返回1
,尽管他已经来过两次.
Please note that your question has been edited such that the answers provided no longer answer your question. They must adjust to only return 1
for John
in New York
despite the fact he's been there twice.
选项1 pir1
我喜欢这个答案,因为我认为它很优雅.
Option 1 pir1
I like this answer because I think it's elegant.
pd.get_dummies(df.customer).T.dot(pd.get_dummies(df.visited_city)).clip(0, 1)
London Melbourne New_York Paris
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
选项2 pir2
这个答案应该很快.
Option 2 pir2
This answer should be fast.
i, r = pd.factorize(df.customer.values)
j, c = pd.factorize(df.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1
pd.DataFrame(b, r, c).sort_index().sort_index(1)
London Melbourne New_York Paris
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
选项3 pir3
实用且快捷
Option 3 pir3
Practical and pretty quick
df.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)
visited_city London Melbourne New_York Paris
customer
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
计时
下面的代码
Timing
Code Below
# Multiples of Minimum time
#
pir1 pir2 pir3 wen vai
10 1.392237 1.0 1.521555 4.337469 5.569029
30 1.445762 1.0 1.821047 5.977978 7.204843
100 1.679956 1.0 1.901502 6.685429 7.296454
300 1.568407 1.0 1.825047 5.556880 7.210672
1000 1.622137 1.0 1.613983 5.815970 5.396008
3000 1.808637 1.0 1.852953 4.159305 4.224724
10000 1.654354 1.0 1.502092 3.145032 2.950560
30000 1.555574 1.0 1.413612 2.404061 2.299856
wen = lambda d: d.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)
vai = lambda d: pd.crosstab(d.customer, d.visited_city)
pir1 = lambda d: pd.get_dummies(d.customer).T.dot(pd.get_dummies(d.visited_city)).clip(0, 1)
pir3 = lambda d: d.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)
def pir2(d):
i, r = pd.factorize(d.customer.values)
j, c = pd.factorize(d.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1
return pd.DataFrame(b, r, c).sort_index().sort_index(1)
results = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='pir1 pir2 pir3 wen vai'.split(),
dtype=float
)
for i in results.index:
d = pd.concat([df] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.at[i, j] = timeit(stmt, setp, number=10)
print((lambda r: r.div(r.min(1), 0))(results))
results.plot(loglog=True)
这篇关于Python Pandas:如何将成对映射列表转换为行向量格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!