



我有一个pandas df,有些列是其中包含数据的列表,我想对列表中的标签进行编码。

I have a pandas df and some of the columns are lists with data in them and I would like to encode the labels within the lists.


from sklearn.preprocessing import OneHotEncoder
mins = pd.read_csv('recipes.csv')

enc = OneHotEncoder(handle_unknown='ignore')

X = mins['Ingredients']

[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
[flour, tomatoes, vodka, vodka, mustard]]


I hope to get a a column of lists that would have the correctly encoded information

[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
[flour, tomatoes, vodka, vodka, mustard]

[[0, 1, 2, 3, 1]
[0, 1, 4, 3, 1]
[4, 1, 3, 3, 9]]


要对DataFrame系列中的列表列表进行标签编码,我们首先使用唯一的文本标签训练编码器,然后使用 apply 进行 transform 将每个文本标签更改为列表列表中经过训练的整数标签。下面是一个示例:

To label encode list of lists in a DataFrame series, we first train the encoder with the unique text labels and then use apply to transform each text label to the trained integer label in the list of lists. Here is an example:

In [2]: import pandas as pd

In [3]: from sklearn import preprocessing

In [4]: df = pd.DataFrame({"Day":["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"], "Veggies&Drinks":[["lettuce"
   ...: , "tomatoes", "ginger", "vodka", "tomatoes"], ["flour", "vodka", "mustard", "lettuce", "ginger"], ["mustard", "
   ...: tomatoes", "ginger", "vodka", "tomatoes"], ["ginger", "vodka", "lettuce", "tomatoes", "flour"], ["mustard", "le
   ...: ttuce", "ginger", "flour", "tomatoes"]]})

In [5]: df
         Day                                Veggies&Drinks
0     Monday  [lettuce, tomatoes, ginger, vodka, tomatoes]
1    Tuesday      [flour, vodka, mustard, lettuce, ginger]
2  Wednesday  [mustard, tomatoes, ginger, vodka, tomatoes]
3   Thursday     [ginger, vodka, lettuce, tomatoes, flour]
4     Friday   [mustard, lettuce, ginger, flour, tomatoes]

In [9]: label_encoder = preprocessing.LabelEncoder()

In [19]: list_of_veggies_drinks = ["lettuce","tomatoes","ginger","vodka","flour","mustard"]

In [20]:
Out[20]: LabelEncoder()

In [21]: integer_encoded = df["Veggies&Drinks"].apply(lambda x:label_encoder.transform(x))

In [22]: integer_encoded
0    [2, 4, 1, 5, 4]
1    [0, 5, 3, 2, 1]
2    [3, 4, 1, 5, 4]
3    [1, 5, 2, 4, 0]
4    [3, 2, 1, 0, 4]
Name: Veggies&Drinks, dtype: object

In [23]: df["Encoded"] = integer_encoded

In [24]: df
         Day                                Veggies&Drinks          Encoded
0     Monday  [lettuce, tomatoes, ginger, vodka, tomatoes]  [2, 4, 1, 5, 4]
1    Tuesday      [flour, vodka, mustard, lettuce, ginger]  [0, 5, 3, 2, 1]
2  Wednesday  [mustard, tomatoes, ginger, vodka, tomatoes]  [3, 4, 1, 5, 4]
3   Thursday     [ginger, vodka, lettuce, tomatoes, flour]  [1, 5, 2, 4, 0]
4     Friday   [mustard, lettuce, ginger, flour, tomatoes]  [3, 2, 1, 0, 4]


09-05 10:20