。The DataFrame looks like the following:

print(df)

    customer    visited_city
0   John        London
1   Mary        Melbourne
2   Steve       Paris
3   John        New_York
4   Peter       New_York
5   Mary        London
6   John        Melbourne
7   John        New_York

我想把上面的数据框转换成行向量格式,这样每一行代表一个唯一的用户,行向量表示访问的城市。
print(wide_format_df)

          London  Melbourne  New_York  Paris
John      1.0        1.0       1.0      0.0
Mary      1.0        1.0       0.0      0.0
Steve     0.0        0.0       0.0      1.0
Peter     0.0        0.0       1.0      0.0

下面是我用来生成宽格式的代码。它逐一遍历每个用户。我想知道有没有更有效的方法?
import pandas as pd
import numpy as np

UNIQUE_CITIESS = np.sort(df['visited_city'].unique())
p = len(UNIQUE_CITIESS)
unique_customers = df['customer'].unique().tolist()

X = []
for customer in unique_customers:
    x = np.zeros(p)
    city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique())
    visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited)
    x[visited_idx] = 1
    X.append(x)
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers)
wide_format_df

最佳答案

请注意,您的问题已被编辑,因此提供的答案不再回答您的问题。他们必须调整到只返回1forJohninNew York尽管他去过那里两次。
选项1pir1
我喜欢这个答案,因为我觉得它很优雅。

pd.get_dummies(df.customer).T.dot(pd.get_dummies(df.visited_city)).clip(0, 1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1

选项2pir2

i, r = pd.factorize(df.customer.values)
j, c = pd.factorize(df.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1

pd.DataFrame(b, r, c).sort_index().sort_index(1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1

选项3pir3
实用又快
df.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

visited_city  London  Melbourne  New_York  Paris
customer
John               1          1         1      0
Mary               1          1         0      0
Peter              0          0         1      0
Steve              0          0         0      1

Timing
下面的代码
# Multiples of Minimum time
#
           pir1  pir2      pir3       wen       vai
10     1.392237   1.0  1.521555  4.337469  5.569029
30     1.445762   1.0  1.821047  5.977978  7.204843
100    1.679956   1.0  1.901502  6.685429  7.296454
300    1.568407   1.0  1.825047  5.556880  7.210672
1000   1.622137   1.0  1.613983  5.815970  5.396008
3000   1.808637   1.0  1.852953  4.159305  4.224724
10000  1.654354   1.0  1.502092  3.145032  2.950560
30000  1.555574   1.0  1.413612  2.404061  2.299856

python - Python Pandas:如何将成对映射列表转换为行向量格式?-LMLPHP
wen = lambda d: d.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)
vai = lambda d: pd.crosstab(d.customer, d.visited_city)
pir1 = lambda d: pd.get_dummies(d.customer).T.dot(pd.get_dummies(d.visited_city)).clip(0, 1)
pir3 = lambda d: d.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

def pir2(d):
    i, r = pd.factorize(d.customer.values)
    j, c = pd.factorize(d.visited_city.values)
    n, m = r.size, c.size
    b = np.zeros((n, m), dtype=int)
    b[i, j] = 1

    return pd.DataFrame(b, r, c).sort_index().sort_index(1)

results = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='pir1 pir2 pir3 wen vai'.split(),
    dtype=float
)

for i in results.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.at[i, j] = timeit(stmt, setp, number=10)

print((lambda r: r.div(r.min(1), 0))(results))

results.plot(loglog=True)

07-24 18:07