本文介绍了如何在字符级别对句子进行单热编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想将一个句子转换为一个单热向量数组.这些向量将是字母表的 one-hot 表示.它看起来像下面这样:
I would like to convert a sentence to an array of one-hot vector.These vector would be the one-hot representation of the alphabet.It would look like the following:
"hello" # h=7, e=4 l=11 o=14
会变成
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
不幸的是,来自 sklearn 的 OneHotEncoder 没有作为输入字符串.
Unfortunately OneHotEncoder from sklearn does not take as input string.
推荐答案
只需将传递的字符串中的字母与给定的字母表进行比较:
Just compare the letters in your passed string to a given alphabet:
def string_vectorizer(strng, alphabet=string.ascii_lowercase):
vector = [[0 if char != letter else 1 for char in alphabet]
for letter in strng]
return vector
请注意,使用自定义字母表(例如defbcazk",列将在每个元素出现在原始列表中时进行排序).
Note that, with a custom alphabet (e.g. "defbcazk", the columns will be ordered as each element appears in the original list).
string_vectorizer('hello')
的输出:
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
这篇关于如何在字符级别对句子进行单热编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!