如何使用字符串列表作为值的“热编码"列?

本文介绍了如何使用字符串列表作为值的“热编码"列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我基本上是想用这样的值热编码一列:

I'm basically trying to one hot encode a column with values like this:

  tickers
1 [DIS]
2 [AAPL,AMZN,BABA,BAY]
3 [MCDO,PEP]
4 [ABT,ADBE,AMGN,CVS]
5 [ABT,CVS,DIS,ECL,EMR,FAST,GE,GOOGL]
...

首先，我获得了所有股票的全部集合(大约467个股票):

First I got all the set of all the tickers(which is about 467 tickers):

all_tickers = list()
for tickers in df.tickers:
    for ticker in tickers:
        all_tickers.append(ticker)
all_tickers = set(all_tickers)

然后，我通过这种方式实现了一次热编码:

Then I implemented One Hot Encoding this way:

for i in range(len(df.index)):
    for ticker in all_tickers:
        if ticker in df.iloc[i]['tickers']:
            df.at[i+1, ticker] = 1
        else:
            df.at[i+1, ticker] = 0

问题在于，当处理大约5000多个行时，脚本的运行速度非常慢.如何改善算法?

The problem is the script runs incredibly slow when processing about 5000+ rows.How can I improve my algorithm?

推荐答案

我认为您需要 str.join "rel =" noreferrer> str.get_dummies :

I think you need str.join with str.get_dummies:

df = df['tickers'].str.join('|').str.get_dummies()

或者:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

df = pd.DataFrame(mlb.fit_transform(df['tickers']),columns=mlb.classes_, index=df.index)
print (df)
   AAPL  ABT  ADBE  AMGN  AMZN  BABA  BAY  CVS  DIS  ECL  EMR  FAST  GE  \
1     0    0     0     0     0     0    0    0    1    0    0     0   0
2     1    0     0     0     1     1    1    0    0    0    0     0   0
3     0    0     0     0     0     0    0    0    0    0    0     0   0
4     0    1     1     1     0     0    0    1    0    0    0     0   0
5     0    1     0     0     0     0    0    1    1    1    1     1   1

   GOOGL  MCDO  PEP
1      0     0    0
2      0     0    0
3      0     1    1
4      0     0    0
5      1     0    0

这篇关于如何使用字符串列表作为值的“热编码"列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！