。The DataFrame looks like the following:
print(df)
customer visited_city
0 John London
1 Mary Melbourne
2 Steve Paris
3 John New_York
4 Peter New_York
5 Mary London
6 John Melbourne
7 John New_York
我想把上面的数据框转换成行向量格式,这样每一行代表一个唯一的用户,行向量表示访问的城市。
print(wide_format_df)
London Melbourne New_York Paris
John 1.0 1.0 1.0 0.0
Mary 1.0 1.0 0.0 0.0
Steve 0.0 0.0 0.0 1.0
Peter 0.0 0.0 1.0 0.0
下面是我用来生成宽格式的代码。它逐一遍历每个用户。我想知道有没有更有效的方法?
import pandas as pd
import numpy as np
UNIQUE_CITIESS = np.sort(df['visited_city'].unique())
p = len(UNIQUE_CITIESS)
unique_customers = df['customer'].unique().tolist()
X = []
for customer in unique_customers:
x = np.zeros(p)
city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique())
visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited)
x[visited_idx] = 1
X.append(x)
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers)
wide_format_df
最佳答案
请注意,您的问题已被编辑,因此提供的答案不再回答您的问题。他们必须调整到只返回1
forJohn
inNew York
尽管他去过那里两次。
选项1pir1
我喜欢这个答案,因为我觉得它很优雅。
pd.get_dummies(df.customer).T.dot(pd.get_dummies(df.visited_city)).clip(0, 1)
London Melbourne New_York Paris
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
选项2
pir2
。
i, r = pd.factorize(df.customer.values)
j, c = pd.factorize(df.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1
pd.DataFrame(b, r, c).sort_index().sort_index(1)
London Melbourne New_York Paris
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
选项3
pir3
实用又快
df.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)
visited_city London Melbourne New_York Paris
customer
John 1 1 1 0
Mary 1 1 0 0
Peter 0 0 1 0
Steve 0 0 0 1
Timing
下面的代码
# Multiples of Minimum time
#
pir1 pir2 pir3 wen vai
10 1.392237 1.0 1.521555 4.337469 5.569029
30 1.445762 1.0 1.821047 5.977978 7.204843
100 1.679956 1.0 1.901502 6.685429 7.296454
300 1.568407 1.0 1.825047 5.556880 7.210672
1000 1.622137 1.0 1.613983 5.815970 5.396008
3000 1.808637 1.0 1.852953 4.159305 4.224724
10000 1.654354 1.0 1.502092 3.145032 2.950560
30000 1.555574 1.0 1.413612 2.404061 2.299856
wen = lambda d: d.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)
vai = lambda d: pd.crosstab(d.customer, d.visited_city)
pir1 = lambda d: pd.get_dummies(d.customer).T.dot(pd.get_dummies(d.visited_city)).clip(0, 1)
pir3 = lambda d: d.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)
def pir2(d):
i, r = pd.factorize(d.customer.values)
j, c = pd.factorize(d.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1
return pd.DataFrame(b, r, c).sort_index().sort_index(1)
results = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='pir1 pir2 pir3 wen vai'.split(),
dtype=float
)
for i in results.index:
d = pd.concat([df] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.at[i, j] = timeit(stmt, setp, number=10)
print((lambda r: r.div(r.min(1), 0))(results))
results.plot(loglog=True)