本文介绍了如何使用字符串列表作为值的“热编码"列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我基本上是想用这样的值热编码一列:
I'm basically trying to one hot encode a column with values like this:
tickers
1 [DIS]
2 [AAPL,AMZN,BABA,BAY]
3 [MCDO,PEP]
4 [ABT,ADBE,AMGN,CVS]
5 [ABT,CVS,DIS,ECL,EMR,FAST,GE,GOOGL]
...
首先,我获得了所有股票的全部集合(大约467个股票):
First I got all the set of all the tickers(which is about 467 tickers):
all_tickers = list()
for tickers in df.tickers:
for ticker in tickers:
all_tickers.append(ticker)
all_tickers = set(all_tickers)
然后,我通过这种方式实现了一次热编码:
Then I implemented One Hot Encoding this way:
for i in range(len(df.index)):
for ticker in all_tickers:
if ticker in df.iloc[i]['tickers']:
df.at[i+1, ticker] = 1
else:
df.at[i+1, ticker] = 0
问题在于,当处理大约5000多个行时,脚本的运行速度非常慢.如何改善算法?
The problem is the script runs incredibly slow when processing about 5000+ rows.How can I improve my algorithm?
推荐答案
我认为您需要 str.join
"rel =" noreferrer> str.get_dummies
:
I think you need str.join
with str.get_dummies
:
df = df['tickers'].str.join('|').str.get_dummies()
或者:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['tickers']),columns=mlb.classes_, index=df.index)
print (df)
AAPL ABT ADBE AMGN AMZN BABA BAY CVS DIS ECL EMR FAST GE \
1 0 0 0 0 0 0 0 0 1 0 0 0 0
2 1 0 0 0 1 1 1 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 1 1 1 0 0 0 1 0 0 0 0 0
5 0 1 0 0 0 0 0 1 1 1 1 1 1
GOOGL MCDO PEP
1 0 0 0
2 0 0 0
3 0 1 1
4 0 0 0
5 1 0 0
这篇关于如何使用字符串列表作为值的“热编码"列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!