A Beginner’s Guide to Word2Vec and Neural Word Embeddings

Introduction to Word2Vec

Word2vec是一个处理文本的双层神经网络。它的输入是一个文本语料库，它的输出是一组向量：该语料库中单词的特征向量。虽然Word2vec不是深度神经网络，但它将文本转换为深网可以理解的数字形式。 Deeplearning4j实现了一个分布式的Word2vec for Java和Scala，它可以在Spark上运行GPU。

Word2vec的应用程序不仅仅是解析野外的句子。它也可以应用于基因，代码，喜欢，播放列表，社交媒体图和其他可以辨别模式的语言或符号系列。

为什么？因为单词就像上面提到的其他数据一样只是离散状态，我们只是在寻找这些状态之间的过渡概率：它们共同发生的可能性。所以gene2vec，like2vec和follower2vec都是可能的。考虑到这一点，下面的教程将帮助您了解如何为任何离散和共现状态组创建神经嵌入。

Word2Vec的目的和用处是将相似单词的向量组合在向量空间中。也就是说，它以数学方式检测相似性。 Word2Vec创建的向量是单词特征的分布式数字表示，诸如单个单词的上下文之类的特征。它没有人为干预就这样做了。

有了足够的数据，用法和上下文，Word2Vec可以根据过去的外观对单词的含义进行高度准确的猜测。这些猜测可用于建立单词与其他单词的关联（例如“男人”是“男孩”，“女人”是“女孩”），或集群文档并按主题对其进行分类。这些集群可以构成搜索，情感分析和科学研究，法律发现，电子商务和客户关系管理等多个领域的建议的基础。

Word2Vec神经网络的输出是一个词汇表，其中每个项目都附有一个向量，可以将其输入深度学习网络或简单地查询以检测单词之间的关系。

测量余弦相似度，没有相似性表示为90度角，而1的总相似度是0度角，完全重叠;瑞典等于瑞典，而挪威与瑞典的余弦距离为0.760124，是其他任何国家中最高的。

以下是使用Word2vec与“瑞典”相关联的单词列表，按照接近顺序排列：
自然语言处理之-----Word2Vec-LMLPHP
斯堪的纳维亚国家和几个富裕，北欧，日耳曼国家都位列前九。

Neural Word Embeddings
The vectors we use to represent words are called neural word embeddings, and representations are strange. One thing describes another, even though those two things are radically different. As Elvis Costello said: “Writing about music is like dancing about architecture.” Word2vec “vectorizes” about words, and by doing so it makes natural language computer-readable – we can start to perform powerful mathematical operations on words to detect their similarities.

So a neural word embedding represents a word with numbers. It’s a simple, yet unlikely, translation.

Word2vec is similar to an autoencoder, encoding each word in a vector, but rather than training against the input words through reconstruction, as a restricted Boltzmann machine does, word2vec trains words against other words that neighbor them in the input corpus.

It does so in one of two ways, either using context to predict a target word (a method known as continuous bag of words, or CBOW), or using a word to predict a target context, which is called skip-gram. We use the latter method because it produces more accurate results on large datasets.
自然语言处理之-----Word2Vec-LMLPHP

When the feature vector assigned to a word cannot be used to accurately predict that word’s context, the components of the vector are adjusted. Each word’s context in the corpus is the teacher sending error signals back to adjust the feature vector. The vectors of words judged similar by their context are nudged closer together by adjusting the numbers in the vector.

Just as Van Gogh’s painting of sunflowers is a two-dimensional mixture of oil on canvas that represents vegetable matter in a three-dimensional space in Paris in the late 1880s, so 500 numbers arranged in a vector can represent a word or group of words.

Those numbers locate each word as a point in 500-dimensional vectorspace. Spaces of more than three dimensions are difficult to visualize. (Geoff Hinton, teaching people to imagine 13-dimensional space, suggests that students first picture 3-dimensional space and then say to themselves: “Thirteen, thirteen, thirteen.” 😃

A well trained set of word vectors will place similar words close to each other in that space. The words oak, elm and birch might cluster in one corner, while war, conflict and strife huddle together in another.

Similar things and ideas are shown to be “close”. Their relative meanings have been translated to measurable distances. Qualities become quantities, and algorithms can do their work. But similarity is just the basis of many associations that Word2vec can learn. For example, it can gauge relations between words of one language, and map them to another.

自然语言处理之-----Word2Vec-LMLPHP
These vectors are the basis of a more comprehensive geometry of words. Not only will Rome, Paris, Berlin and Beijing cluster near each other, but they will each have similar distances in vectorspace to the countries whose capitals they are; i.e. Rome - Italy = Beijing - China. And if you only knew that Rome was the capital of Italy, and were wondering about the capital of China, then the equation Rome -Italy + China would return Beijing. No kidding.

自然语言处理之-----Word2Vec-LMLPHP

Amusing Word2Vec Results
Let’s look at some other associations Word2vec can produce.

Instead of the pluses, minus and equals signs, we’ll give you the results in the notation of logical analogies, where : means “is to” and :: means “as”; e.g. “Rome is to Italy as Beijing is to China” = Rome:Italy::Beijing:China. In the last spot, rather than supplying the “answer”, we’ll give you the list of words that a Word2vec model proposes, when given the first three elements:

king:queen:👨[woman, Attempted abduction, teenager, girl]
//Weird, but you can kind of see it

China:Taiwan::Russia:[Ukraine, Moscow, Moldova, Armenia]
//Two large countries and their small, estranged neighbors

house:roof::castle:[dome, bell_tower, spire, crenellations, turrets]

knee:leg::elbow:[forearm, arm, ulna_bone]

New York Times:Sulzberger::Fox:[Murdoch, Chernin, Bancroft, Ailes]
//The Sulzberger-Ochs family owns and runs the NYT.
//The Murdoch family owns News Corp., which owns Fox News.
//Peter Chernin was News Corp.'s COO for 13 yrs.
//Roger Ailes is president of Fox News.
//The Bancroft family sold the Wall St. Journal to News Corp.

love:indifference::fear:[apathy, callousness, timidity, helplessness, inaction]
//the poetry of this single array is simply amazing…

Donald Trump:Republican::Barack Obama:[Democratic, GOP, Democrats, McCain]
//It’s interesting to note that, just as Obama and McCain were rivals,
//so too, Word2vec thinks Trump has a rivalry with the idea Republican.

monkey:human::dinosaur:[fossil, fossilized, Ice_Age_mammals, fossilization]
//Humans are fossilized monkeys? Humans are what’s left
//over from monkeys? Humans are the species that beat monkeys
//just as Ice Age mammals beat dinosaurs? Plausible.

building:architect::software:[programmer, SecurityCenter, WinPcap]
This model was trained on the Google News vocab, which you can import and play with. Contemplate, for a moment, that the Word2vec algorithm has never been taught a single rule of English syntax. It knows nothing about the world, and is unassociated with any rules-based symbolic logic or knowledge graph. And yet it learns more, in a flexible and automated fashion, than most knowledge graphs will learn after many years of human labor. It comes to the Google News documents as a blank slate, and by the end of training, it can compute complex analogies that mean something to humans.

You can also query a Word2vec model for other assocations. Not everything has to be two analogies that mirror each other.

Geopolitics: Iraq - Violence = Jordan
Distinction: Human - Animal = Ethics
President - Power = Prime Minister
Library - Books = Hall
Analogy: Stock Market ≈ Thermometer
By building a sense of one word’s proximity to other similar words, which do not necessarily contain the same letters, we have moved beyond hard tokens to a smoother and more general sense of meaning.

N-grams & Skip-grams

Words are read into the vector one at a time, and scanned back and forth within a certain range. Those ranges are n-grams, and an n-gram is a contiguous sequence of n items from a given linguistic sequence; it is the nth version of unigram, bigram, trigram, four-gram or five-gram. A skip-gram simply drops items from the n-gram.

The skip-gram representation popularized by Mikolov and used in the DL4J implementation has proven to be more accurate than other models, such as continuous bag of words, due to the more generalizable contexts generated.

This n-gram is then fed into a neural network to learn the significance of a given word vector; i.e. significance is defined as its usefulness as an indicator of certain larger meanings, or labels.

Advances in NLP: ElMO, BERT and GPT-2
Word vectors form the basis of most recent advances in natural-language processing, including language models such as ElMO, ULMFit and BERT. But those language models change how they represent words; that is, that which the vectors represent changes.

Word2vec is an algorithm used to produce distributed representations of words, and by that we mean word types; i.e. any given word in a vocabulary, such as get or grab or go has its own word vector, and those vectors are effectively stored in a lookup table or dictionary. Unfortunately, this approach to word representation does not addres polysemy, or the co-existence of many possible meanings for a given word or phrase. For example, go is a verb and it is also a board game; get is a verb and it is also an animal’s offspring. The meaning of a given word type such as go or get varies according to its context; i.e. the words that surround it.

One thing that ElMO and BERT demonstrate is that by encoding the context of a given word, by including information about preceding and succeeding words in the vector that represents a given instance of a word, we can obtain much better results in natural language processing tasks. BERT owes its performance to the attention mechanism.

Tested on the SWAG benchmark, which measures commonsense reasoning, ELMo was found to produce a 5% error reduction relaitve to non-contextual word vectors, while BERT showed an additional 66% error reduction past ELMo. More recently, OpenAI’s work with GPT-2 showed surprisingly good results in generating natural language in response to a prompt.

Adam坤

自然语言处理之-----Word2Vec

A Beginner’s Guide to Word2Vec and Neural Word Embeddings