本文介绍了Word2Vec 是仅用于句子中的单词还是也用于特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想问更多关于 Word2Vec 的问题:

I would like to ask more about Word2Vec:

我目前正在尝试构建一个程序来检查句子的嵌入向量.同时,我还使用sci-kit构建了一个特征提取学习从句子中提取引理0,引理1,引理2.

I am currently trying to build a program that check for the embedding vectors for a sentence. While at the same time, I also build a feature extraction using sci-kit learn to extract the lemma 0, lemma 1, lemma 2 from the sentence.

据我所知;

1) 特征提取:引理 0、引理 1、引理 22)词嵌入:向量嵌入到每个字符中(这可以通过使用gensim word2vec来实现(我已经试过了))

1) Feature extractions : Lemma 0, lemma 1, lemma 22) Word embedding: vectors are embedded to each character (this can be achieved by using gensim word2vec(I have tried it))

更多解释:

句子 =我有一支笔".Word = 句子的标记,例如have"

Sentence = "I have a pen".Word = token of the sentence, for example, "have"

1) 特征提取

"I have a pen" --> lemma 0:I, lemma_1: have, lemma_2:a.......lemma 0:have, lemma_1: a, lemma_2:pen 等等..然后当尝试使用 one_hot 提取特征,然后将产生:

"I have a pen" --> lemma 0:I, lemma_1: have, lemma_2:a.......lemma 0:have, lemma_1: a, lemma_2:pen and so on.. Then when try to extract the feature by using one_hot then will produce:

[[0,0,1],
[1,0,0],
[0,1,0]]

2) 词嵌入(Word2vec)

2) Word embedding(Word2vec)

"I have a pen" ---> "I", "have", "a", "pen"(tokenized) 然后来自 gensim 的 word2vec 将生成矩阵,例如,如果使用 window_size = 2 生成:

"I have a pen" ---> "I", "have", "a", "pen"(tokenized) then word2vec from gensim will produced matrices for example if using window_size = 2 produced:

[[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345]
]

浮点数和整数用于解释目的,原始数据应因句子而异.这些只是用于解释的虚拟数据.*

问题:

1) 我对 Word2Vec 的理解正确吗?如果是,特征提取和word2vec有什么区别?2)我很好奇我是否也可以使用 word2vec 来获取特征提取嵌入,因为根据我的理解,word2vec 只是为每个单词而不是为特征找到嵌入.

1) Is my understanding about Word2Vec correct? If yes, what is the difference between feature extraction and word2vec?2) I am curious whether can I use word2vec to get the feature extraction embedding too since from my understanding, word2vec is only to find embedding for each word and not for the features.

希望有人能帮助我.

推荐答案

您的要求并不完全清楚,因为您似乎将许多概念混在一起.(Word2Vec 给出每个词的向量,而不是字符;词嵌入是对词的一种特征提取,而不是特征提取"的替代方案;等等.所以:我怀疑您的理解是否正确.)

It's not completely clear what you're asking, as you seem to have many concepts mixed-up together. (Word2Vec gives vectors per word, not character; word-embeddings are a kind of feature-extraction on words, rather than an alternative to 'feature extraction'; etc. So: I doubt your understanding is yet correct.)

特征提取"是一个非常笼统的术语,意思是获取原始数据(例如句子)并创建适合其他类型计算或下游机器学习的数字表示的任何和所有方式.

"Feature extraction" is a very general term, meaning any and all ways of taking your original data (such as a sentence) and creating a numerical representation that's good for other kinds of calculation or downstream machine-learning.

将句子语料库转化为数字数据的一种简单方法是对每个句子中出现的单词使用one-hot"编码.例如,如果您有两个句子...

One simple way to turn a corpus of sentences into numerical data is to use a "one-hot" encoding of which words appear in each sentence. For example, if you have the two sentences...

['A', 'pen', 'will', 'need', 'ink']
['I', 'have', 'a', 'pen']

...那么你有 7 个独特的大小写扁平词...

...then you have 7 unique case-flattened words...

['a', 'pen', 'will', 'need', 'ink', 'i', 'have']

...并且您可以将两个句子one-hot"作为它们包含的每个单词的 1 或 0,从而获得 7 维向量:

...and you could "one-hot" the two sentences as a 1-or-0 for each word they contain, and thus get the 7-dimensional vectors:

 [1, 1, 1, 1, 1, 0, 0]  # A pen will need ink
 [1, 1, 0, 0, 0, 1, 1]  # I have a pen

即使使用这种简单的编码,您现在也可以在数学上比较句子:这两个向量之间的欧几里得距离或余弦距离计算将为您提供总距离数,而没有共享词的句子将具有很高的距离",和那些有很多共同词的人会有一个很小的距离".

Even with this simple encoding, you can now compare sentences mathematically: a euclidean-distance or cosine-distance calculation between those two vectors will give you a summary distance number, and sentences with no shared words will have a high 'distance', and those with many shared words will have a small 'distance'.

这些句子的其他可能的替代特征编码可能涉及每个单词的计数(如果一个单词出现多次,则可能出现大于 1 的数字)或加权计数(其中单词通过某种度量获得额外的显着性因子,例如常见的TF/IDF"计算,因此值被缩放到从 0.0 到高于 1.0 的任何值).

Other very-similar possible alternative feature-encodings of these sentences might involve counts of each word (if a word appeared more than once, a number higher than 1 could appear), or weighted-counts (where words get an extra significance factor by some measure, such as the common "TF/IDF" calculation, and thus values scaled to be anywhere from 0.0 to values higher than 1.0).

请注意,您不能将单个句子编码为与它自己的单词一样宽的向量,例如我有一支笔"到 4 维 [1, 1, 1, 1] 向量.那是不能与任何其他句子相比的.它们都需要转换为相同维度大小的向量,并且在one hot"(或其他简单的词袋")编码中,该向量的维度等于总词汇量在所有句子中都知道.

Note that you can't encode a single sentence as a vector that's just as wide as its own words, such as "I have a pen" into a 4-dimensional [1, 1, 1, 1] vector. That then isn't comparable to any other sentence. They all need to be converted to the same-dimensional-size vector, and in "one hot" (or other simple "bag of words") encodings, that vector is of dimensionality equal to the total vocabulary known among all sentences.

Word2Vec 是一种将单个单词转换为维度较少但这些维度中有许多非零浮点值的密集"嵌入的方法.这不是稀疏嵌入,它有许多维度,大部分为零.从上面单独的笔"的 7 维稀疏嵌入将是:

Word2Vec is a way to turn individual words into "dense" embeddings with fewer dimensions but many non-zero floating-point values in those dimensions. This is instead of sparse embeddings, which have many dimensions that are mostly zero. The 7-dimensional sparse embedding of 'pen' alone from above would be:

[0, 1, 0, 0, 0, 0, 0]  # 'pen'

如果你训练了一个二维 Word2Vec 模型,它可能会有一个密集的嵌入,比如:

If you trained a 2-dimensional Word2Vec model, it might instead have a dense embedding like:

[0.236, -0.711]  # 'pen'

所有 7 个词都有自己的二维密集嵌入.例如(所有值组成):

All the 7 words would have their own 2-dimensional dense embeddings. For example (all values made up):

[-0.101, 0.271]   # 'a'
[0.236, -0.711]   # 'pen'
[0.302, 0.293]    # 'will'
[0.672, -0.026]   # 'need'
[-0.198, -0.203]  # 'ink'
[0.734, -0.345]   # 'i'
[0.288, -0.549]   # 'have'

如果您有 Word2Vec 向量,那么为较长的文本(如句子)制作向量的另一种简单方法是将句子中单词的所有词向量平均在一起.因此,代替句子的 7 维稀疏向量,例如:

If you have Word2Vec vectors, then one alternative simple way to make a vector for a longer text, like a sentence, is to average together all the word-vectors for the words in the sentence. So, instead of a 7-dimensional sparse vector for the sentence, like:

[1, 1, 0, 0, 0, 1, 1]  # I have a pen

...你会得到一个单一的二维密集向量,如:

...you'd get a single 2-dimensional dense vector like:

[ 0.28925, -0.3335 ]  # I have a pen

同样,基于这些密集嵌入特征,不同的句子之间可以通过距离进行比较.或者这些可以很好地作为下游机器学习过程的训练数据.

And again different sentences may be usefully comparable to each other based on these dense-embedding features, by distance. Or these might work well as training data for a downstream machine-learning process.

因此,这是一种特征提取"形式,它使用 Word2Vec 而不是简单的字数统计.还有许多其他更复杂的方法可以将文本转换为矢量;它们都可以算作特征提取".

So, this is a form of "feature extraction" that uses Word2Vec instead of simple word-counts. There are many other more sophisticated ways to turn text into vectors; they could all count as kinds of "feature extraction".

哪种方式最适合您的需求取决于您的数据和最终目标.通常,最简单的技术效果最好,尤其是在您拥有大量数据的情况下.但是几乎没有绝对的确定性,您通常只需要尝试多种替代方案,并在一些定量、可重复的评分评估中测试它们的表现,以找出最适合您的项目的方案.

Which works best for your needs will depend on your data and ultimate goals. Often the most-simple techniques work best, especially once you have a lot of data. But there are few absolute certainties, and you often need to just try many alternatives, and test how well they do in some quantitative, repeatable scoring evaluation, to find which is best for your project.

这篇关于Word2Vec 是仅用于句子中的单词还是也用于特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-16 15:54