问题描述
H2O最近在其API中添加了word2vec.能够轻松地在自己提供的语料库上训练自己的单词向量,真是太好了.
但是,由于网络带宽和计算能力的原因,使用大数据和大型计算机的可能性更大,这种类型的软件供应商如Google或H2O.ai,但没有那么多的H2O最终用户可以访问功率限制.
单词嵌入可以看作是一种无监督的学习.这样,通过使用基于非常大的语料库构建的预训练词向量作为特定应用程序的基础结构,可以在数据科学管道中获得巨大的价值.使用通用的预训练词向量可以看作是转移学习的一种形式.重用单词向量类似于计算机视觉深度学习通用最低层,该最低层学会检测照片中的边缘.较高的层会检测由它们下面的边缘层组成的特定种类的对象.
例如,Google通过word2vec软件包提供了一些经过预训练的单词向量.实例越多,无监督学习通常越好.此外,有时单个数据科学家实际上很难下载庞大的文本集来训练自己的单词向量.而且没有足够的理由让每个用户通过在像维基百科这样的通用通用语料库(corpi?)上训练单词向量本身来重新创建相同的轮子.
词嵌入非常重要,并且有可能成为可能应用的星系的实体. TF-IDF是许多自然语言数据科学应用程序的旧基础,现在通过使用词嵌入来作废.
三个问题:
1-H2O当前是否提供任何通用的预训练词嵌入(词向量),例如在法律或其他公有(政府)网站,维基百科,twitter或craigslist或其他免费或开放的文本上进行的训练手写文字的共同来源?
2-是否有一个社区站点,H2O用户可以在此社区上共享他们的训练有素的word2vec词向量,这些词向量是建立在更专业的语料库(例如医学和法律)上的?
3-H2O可以从word2vec软件包中导入Google的预训练词向量吗?
感谢您的提问.
您绝对正确,在许多情况下,您不需要自定义模型并且预训练的模型会很好地工作.我假设人们将主要针对特定领域中较小的问题构建自己的模型,并使用预先训练的模型来补充自定义模型.
您可以将经过第三方培训的模型导入H2O,只要它们是类似CSV的格式即可.对于许多可用的GloVe型号都是如此.
为此,将模型导入到框架中(就像其他任何数据集一样):
w2v.frame <- h2o.importFile("pretrained.glove.txt")
然后将其转换为常规的H2O word2vec模型:
w2v.model <- h2o.word2vec(pre_trained = w2v.frame, vec_size = 100)
请注意,您需要提供嵌入的大小.
据我所知,H2O并不打算为w2v模型提供模型交换/模型市场.您可以使用在线提供的模型: https://github.com/3Top/word2vec-api我们目前不支持导入Google的词嵌入二进制格式,但是我们的路线图上对此提供了支持,因为这对我们的用户来说很有意义.
H2O recently added word2vec in its API. It is great to be able to easily train your own word vectors on a corpus you provide yourself.
However even greater possibilities exist from using big data and big computers, of the type that software vendors like Google or H2O.ai, but not so many end-users of H2O, may have access to, due to network bandwidth and compute power limitations.
Word embeddings can be seen as a type of unsupervised learning. As such, great value can be had in a data science pipeline by using pretrained word vectors that were built on a very large corpus as infrastructure in specific applications. Using general purpose pretrained word vectors can be seen as a form of transfer learning. Reusing word vectors is analogous to computer vision deep learning generic lowest layers that learn to detect edges in photographs. Higher layers detect specific kinds of objects composed from the edge layers below them.
For example Google provides some pretrained word vectors with their word2vec package. The more examples the better is often true with unsupervised learning. Further, sometimes it's practically difficult for an individual data scientist to download a giant corpus of text on which to train your own word vectors. And there is no good reason for every user to recreate the same wheel by training word vectors themselves on the same general purpose corpuses (corpi?) like wikipedia.
Word embeddings are very important and have the potential to be the bricks and mortar of a galaxy of possible applications. TF-IDF, the old basis for many natural language data science applications, stands to be made obsolete by using word embeddings instead.
Three questions:
1 - Does H2O currently provide any general purpose pretrained word embeddings (word vectors), for example trained on text found at legal or other public-owned (government) websites, or wikipedia or twitter or craigslist, or other free or Open Commons sources of human-written text?
2 - Is there a community site where H2O users can share their trained word2vec word vectors that are built on more specialized corpuses, such as medicine and law?
3 - Can H2O import Google's pretrained word vectors from their word2vec package?
thank you for your questions.
You are absolutely right, there are many situations when you don't need a custom model and pre-trained model will work well. I assume people will mostly build their own models on smaller problems in their specific domain and use pre-trained models to complement the custom model.
You can import 3rd party pre-trained models into H2O as long as they are in a CSV-like format. This is true for many available GloVe models.
To do that import the model into a Frame (just like with any other dataset):
w2v.frame <- h2o.importFile("pretrained.glove.txt")
And then convert it to a regular H2O word2vec model:
w2v.model <- h2o.word2vec(pre_trained = w2v.frame, vec_size = 100)
Please note that you need to provide the size of the embeddings.
H2O doens't plan to provide a model exchange/model market for w2v model as far as I know. You can use models that are available on-line: https://github.com/3Top/word2vec-api
We currently do not support importing Google's binary format of word embeddings, however the support is on our road map as it makes a lot of sense for our users.
这篇关于H2O是否或将提供与h2o word2vec一起使用的任何预训练向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!