本文介绍了Python Pandas:如何将成对映射列表转换为行向量格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我有一个2列的DataFrame,第1列对应于客户,第2列对应于该客户访问过的城市. DataFrame如下所示:

I have a 2-column DataFrame, column-1 corresponds to customer, column-2 corresponds to the city this customer has visited. The DataFrame looks like the following:


    customer    visited_city
0   John        London
1   Mary        Melbourne
2   Steve       Paris
3   John        New_York
4   Peter       New_York
5   Mary        London
6   John        Melbourne
7   John        New_York


I would like to convert the above DataFrame into a row-vector format, such that each row represents a unique user with the row vector indicating the cities visited.


          London  Melbourne  New_York  Paris
John      1.0        1.0       1.0      0.0
Mary      1.0        1.0       0.0      0.0
Steve     0.0        0.0       0.0      1.0
Peter     0.0        0.0       1.0      0.0


Below is the code I used to generate the wide format. It iterates through each user one by one. I was wondering is there any more efficient way to do so?

import pandas as pd
import numpy as np

UNIQUE_CITIESS = np.sort(df['visited_city'].unique())
unique_customers = df['customer'].unique().tolist()

X = []
for customer in unique_customers:
    x = np.zeros(p)    
    city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique())
    visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited)
    x[visited_idx] = 1    
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers)


请注意,您的问题已经过编辑,因此提供的答案不再回答您的问题.他们必须进行调整,以使New York中的John仅返回1,尽管他已经来过两次.

Please note that your question has been edited such that the answers provided no longer answer your question. They must adjust to only return 1 for John in New York despite the fact he's been there twice.

选项1 pir1

Option 1 pir1
I like this answer because I think it's elegant.

pd.get_dummies(df.customer).T.dot(pd.get_dummies(df.visited_city)).clip(0, 1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1

选项2 pir2

Option 2 pir2
This answer should be fast.

i, r = pd.factorize(df.customer.values)
j, c = pd.factorize(df.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1

pd.DataFrame(b, r, c).sort_index().sort_index(1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1

选项3 pir3

Option 3 pir3
Practical and pretty quick

df.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

visited_city  London  Melbourne  New_York  Paris
John               1          1         1      0
Mary               1          1         0      0
Peter              0          0         1      0
Steve              0          0         0      1


Code Below

# Multiples of Minimum time
           pir1  pir2      pir3       wen       vai
10     1.392237   1.0  1.521555  4.337469  5.569029
30     1.445762   1.0  1.821047  5.977978  7.204843
100    1.679956   1.0  1.901502  6.685429  7.296454
300    1.568407   1.0  1.825047  5.556880  7.210672
1000   1.622137   1.0  1.613983  5.815970  5.396008
3000   1.808637   1.0  1.852953  4.159305  4.224724
10000  1.654354   1.0  1.502092  3.145032  2.950560
30000  1.555574   1.0  1.413612  2.404061  2.299856
wen = lambda d: d.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)
vai = lambda d: pd.crosstab(d.customer, d.visited_city)
pir1 = lambda d: pd.get_dummies(d.customer).T.dot(pd.get_dummies(d.visited_city)).clip(0, 1)
pir3 = lambda d: d.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

def pir2(d):
    i, r = pd.factorize(d.customer.values)
    j, c = pd.factorize(d.visited_city.values)
    n, m = r.size, c.size
    b = np.zeros((n, m), dtype=int)
    b[i, j] = 1

    return pd.DataFrame(b, r, c).sort_index().sort_index(1)

results = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='pir1 pir2 pir3 wen vai'.split(),

for i in results.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.at[i, j] = timeit(stmt, setp, number=10)

print((lambda r: r.div(r.min(1), 0))(results))


这篇关于Python Pandas:如何将成对映射列表转换为行向量格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-02 19:55