问题描述
我想比较不同句子(例如 travel)中提到的同一单词之间的差异。
我想做的是:
I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".What I would like to do is:
- 在句子中提及旅行一词。作为纯文本;
- 在每个句子中,用travel_sent_x替换 travel。
- 在这些句子上训练一个word2vec模型。
- 计算travel_sent1,travel_sent2与其他重新标记的 travel标签之间的距离。
因此,每个句子的旅行获得它自己的向量,用于比较。
- Take the sentences mentioning the term "travel" as plain text;
- In each sentence, replace 'travel' with travel_sent_x.
- Train a word2vec model on these sentences.
- Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"So each sentence's "travel" gets its own vector, which is used for comparison.
我知道word2vec不仅需要几个句子来训练可靠的向量。官方页面推荐了包含数十亿个单词的数据集,但是我的数据集中没有这么多的数字(我有成千上万个单词)。
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
我试图用以下几个测试模型句子:
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
我建立向量的方法是:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
但是我不知道如何可视化结果以查看它们的相似性并获得一些有用的见解。
任何帮助和建议都将受到欢迎。
However I do not know how to visualise the results to see their similarity and get some useful insight.Any help and advice will be welcome.
更新:我将使用主成分分析算法来可视化3维空间中的嵌入。我知道每个单词的用法,但不知道句子的用法。
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.
推荐答案
注意word2vec本质上不是建模句子的方法,只是单词的建模方法。因此,没有使用word2vec来表示句子的官方方法。
Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
快速&粗略的方法是通过将所有单词向量平均在一起来为句子(或其他多词文本)创建向量。它速度快,比没有优势好,并且在某些简单的(广泛主题)任务上还可以-但并不能很好地捕捉文本的全部含义,尤其是依赖于语法,一词多义,
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
仍然,您可以使用它为每个短文本获取固定大小的矢量,并计算这些矢量之间的成对相似度/距离,并将结果输入维数中-用于可视化或其他目的的简化算法。
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
其他算法实际上为较长的文本创建矢量。与word2vec密切相关的一种浅层算法是段落向量,在Gensim中可以作为 Doc2Vec
类使用。但是它仍然不是很复杂,也没有语法意识。许多更深层的网络文本模型,例如BERT,ELMo和&
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec
class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec&相关的算法非常耗费数据:它们的所有有益特性来自同一单词的许多不同用法示例之间的拔河比赛。因此,如果您有一个玩具大小的数据集,那么您将不会获得一组具有有用的相互关系的向量。
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
但是,在较大的数据集中,稀有单词也不会获得良好的向量。在训练中,通常会丢弃出现在某些 min_count
频率以下的单词,就好像它们甚至不在那里一样-因为它们的向量不仅会很差,而且只有一个或一些特殊的示例用法,但是由于总共有很多此类代表性不足的单词,因此将它们保留在周围往往也会使 other 单词向量变得更糟。
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count
frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
因此,您提出的考虑采用个人实例的旅行
&实例的想法。请注意,将它们替换为一次性出现的令牌很可能会产生有趣的结果。将 min_count
降低到1将为您提供每个变体的向量-但与其他词向量相比,它们的质量(和更多随机性)要差得多与其他单词相比,培训的注意力相对较少,并且每个单词都仅受其周围几个单词的影响(而不是周围所有上下文的全部范围,而这一切都可能有助于统一的旅行
令牌)。
So, your proposed idea of taking individual instances of travel
& replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count
to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel
token).
(您可以通过(1)保留句子的原始版本来稍微抵消这些问题。 travel
向量;(2)重复几次标记残缺不全的句子,并改组它们以出现在整个语料库中,以某种方式模拟合成语境的更多真实出现。但是,如果没有真正的变化,此类单上下文向量的大多数问题将仍然存在。)
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel
vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
比较 travel_sent_A
的另一种可能方法, 旅行el_sent_B
等将完全忽略 travel
或 travel_sent_X
的确切矢量,但是而是为该单词周围的N个单词编译摘要向量。例如,如果您有100个单词 travel
的示例,则创建100个向量,每个向量都是N个单词 around 的行进路径。这些向量可能显示出一些模糊的簇/邻域,尤其是在单词的交替含义非常不同的情况下。 (一些采用word2vec来解决一词多义的研究使用这种上下文向量
方法来影响/选择其他词义。)
Another possible way to compare travel_sent_A
, travel_sent_B
, etc would be to ignore the exact vector for travel
or travel_sent_X
entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel
, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector
approach to influence/choose among alternate word-senses.)
您可能还会发现有关将单词建模为来自替代语篇原子的研究的有趣之处:
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
在某种程度上,您拥有简短的标题式文本,并且只有单词向量(没有数据或算法可以进行更深入的建模) ,您可能还需要查看移动器的距离计算以比较文本。与其将单个文本简化为单个矢量,不如将其建模为单词矢量袋。然后,它将距离定义为一个袋子到另一个袋子的转换成本。 (更多相似的单词比不相似的单词更容易相互转化,因此非常相似的表达式(仅替换了几个同义词)报告的非常接近。)
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
它可以对于较长的文本而言,计算起来相当昂贵,但对于简短的短语和少量的标题/推文/等而言可能效果很好。它可以在Gensim KeyedVector
类中以。在本文中可能会有用的一种相关示例:
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector
classes as wmdistance()
. An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance
这篇关于使用word2vec嵌入句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!