本文介绍了什么是“未知"?在预训练的 GloVe 矢量文件(例如 glove.6B.50d.txt)中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 从 https 下载的手套矢量文件 glove.6B.50d.txt 中发现了unk"标记://nlp.stanford.edu/projects/glove/.其值如下:

UNK -0.79149 0.86617 0.11998 0.00092287 0.2776 -0.49185 0.50195 0.00060792 -0.25845 0.17865 0.2535 0.76572 0.50664 0.4025 -0.0021388 -0.28397 -0.50324 0.30449 0.51779 0.01509 -0.35031 -1.1278 0.33253 -0.3525 0.041326 1.0863 0.03391 0.33564 0.49745 -0.070131 -1.2192 -0.48512 -0.038512 -0.13554 -0.1638 0.52321 -0.31318 -0.1655 0.11909 -0.15115 -0.15621 -0.62655 -0.62336 -0.4215 0.41873 -0.92472 -0.29996 1.1049 0.3954 -0.0063003

它是用于表示未知单词的标记还是某种缩写?

解决方案

预训练的 GloVe 文件中的 unk 标记不是未知标记!

请参阅此 google 群组线程 Jeffrey Pennington(手套作者)写道:

预训练的向量没有未知的标记,目前代码在生成共现计数时只是忽略了词汇外的词.

这是在语料库中出现unk"时像任何其他嵌入一样学习的嵌入(这似乎偶尔发生!)

相反,Pennington 建议(在同一篇文章中):

...我发现仅取所有词向量或其中一部分词向量的平均值就可以生成一个很好的未知向量.

您可以使用以下代码执行此操作(应该适用于任何预训练的 GloVe 文件):

将 numpy 导入为 npGLOVE_FILE = 'glove.6B.50d.txt'# 获取向量的数量和隐藏的暗淡使用 open(GLOVE_FILE, 'r') 作为 f:对于 i,enumerate(f) 中的行:经过n_vec = i + 1hidden_​​dim = len(line.split(' ')) - 1vecs = np.zeros((n_vec, hidden_​​dim), dtype=np.float32)使用 open(GLOVE_FILE, 'r') 作为 f:对于 i,enumerate(f) 中的行:vecs[i] = np.array([float(n) for n in line.split(' ')[1:]], dtype=np.float32)average_vec = np.mean(vecs,axis=0)打印(average_vec)

对于 glove.6B.50d.txt 这给出:

[-0.12920076 -0.28866628 -0.01224866 -0.05676644 -0.20210965 -0.083890110.33359843 0.16045167 0.03867431 0.17833012 0.04696583 -0.002858020.29099807 0.04613704 -0.20923874 -0.06613114 -0.06822549 0.076659120.3134014 0.17848536 -0.1225775 -0.09916984 -0.07495987 0.064132270.14441176 0.60894334 0.17463093 0.05335403 -0.01273871 0.03474107-0.8123879 -0.04688699 0.20193407 0.2031118 -0.03935686 0.06967544-0.015536​​38 -0.03405238 -0.06528071 0.12250231 0.13991883 -0.17446303-0.08011883 0.0849521 -0.01041659 -0.13705009 0.20127155 0.100694080.00653003 0.01685157]

而且由于对较大的手套文件执行此操作需要大量计算,因此我继续为您计算了 glove.840B.300d.txt 的向量:

0.22418134 -0.28881392 0.13854356 0.00365387 -0.12870757 0.10243822 0.061626635 0.07318011 -0.061350107 -1.3477012 0.42037755 -0.063593924 -0.09683349 0.18086134 0.23704372 0.014126852 0.170096 -1.1491593 0.31497982 0.06622181 0.024687296 0.076693475 0.13851812 0.021302193 -0.06640582 -0.010336159 0.13523154 -0.042144544 -0.11938788 0.006948221 0.13333307 -0.182763790.052385733 0.008943111 -0.23957317 0.08500333 -0.006894406 0.063391194 0.0015864656 0.19177166 -0.13113557 -0.11295479 -0.14276934 0.03413971 -0.034278486 -0.051366422 0.18891625 -0.16673574 -0.057783455 0.036823478 0.08078679 0.022949161 0.033298038 0.011784158 0.05643189 -0.042776518 0.011959623 0.011552498 -0.0007971594 0.11300405 -0.031369694 -0.0061559738 -0.009043574 -0.415336 -0.18870236 0.137088430.005911723 -0.113035575 -0.030096142 -0.23908928 -0.05354085 -0.044904727 -0.20228513 0.0065645403 -0.09578946 -0.07391877 -0.06487607 0.111740574 -0.048649278 -0.16565254 -0.052037314 -0.078968436 0.136849880.0757494 -0.006275573 0.28693774 0.52017444 -0.0877165 -0.33010918 -0.1359622 0.114895485 -0.09744406 0.06269521 0.12118575 -0.08026362 0.35256687 -0.060017522 -0.04889904 -0.06828978 0.088740796 0.003964443 -0.0766291 0.1263925 0.07809314 -0.023164088 -0.5680669 -0.037892066 -0.1350967 -0.11351585 -0.111434504 -0.0905027 0.25174105 -0.14841858 0.034635577 -0.07334565 0.06320108 -0.038343467 -0.05413284 0.042197507 -0.090380974 -0.070528865 -0.009174437 0.009069661 0.1405178 0.02958134 -0.036431845 -0.08625681 0.042951006 0.08230793 0.0903314 -0.12279937 -0.013899368 0.048119213 0.08678239 -0.14450377 -0.04424887 0.018319942 0.015026873 -0.100526 0.06021201 0.74059093 -0.0016333034 -0.24960588 -0.023739101 0.016396184 0.11928964 0.13950661 -0.031624354-0.01645025 0.14079992 -0.0002824564 -0.08052984 -0.0021310581 -0.025350995 0.086938225 0.14308536 0.17146006 -0.13943303 0.048792403 0.09274929 -0.053167373 0.031103406 0.012354865 0.21057427 0.32618305 0.18015954 -0.15881181 0.15322933 -0.22558987 -0.04200665 0.0084689725 0.038156632 0.15188617 0.13274793 0.113756925 -0.095273495 -0.049490947 -0.10265804 -0.27064866 -0.034567792 -0.018810693 -0.0010360252 0.10340131 0.13883452 0.21131058 -0.01981019 0.1833468 -0.10751636 -0.03128868 0.02518242 0.23232952 0.042052146 0.11731903 -0.15506615 0.0063580726 -0.15429358 0.1511722 0.12745973 0.2576985 -0.25486213 -0.07094630.17983761 0.054027 -0.09884228 -0.24595179 -0.093028545 -0.028203879 0.094398156 0.09233813 0.029291354 0.13110267 0.15682974 -0.016919162 0.23927948 -0.1343307 -0.22422817 0.14634751 -0.064993896 0.4703685 -0.027190214 0.06224946 -0.091360025 0.21490277 -0.19562101 -0.10032754 -0.09056772 -0.06203493 -0.18876675 -0.10963594 -0.27734384 0.12616494 -0.02217992 -0.16058226 -0.080475815 0.026953284 0.110732645 0.014894041 0.09416802 0.14299914 -0.1594008 -0.066080004 -0.007995227 -0.11668856 -0.13081996 -0.09237365 0.14741232 0.09180138 0.081735 0.3211204 -0.0036552632 -0.047030564 -0.02311798 0.048961394 0.08669574 -0.06766279 -0.50028914 -0.048515294 0.14144728 -0.032994404 -0.11954345 -0.14929578 -0.2388355 -0.019883996 -0.15917352 -0.052084364 0.2801028 -0.0029121689 -0.054581646 -0.47385484 0.17112483 -0.12066923 -0.042173345 0.1395337 0.26115036 0.012869649 0.009291686 -0.0026459037 -0.075331464 0.017840583 -0.26869613 -0.21820338 -0.17084768-0.1022808 -0.055290595 0.13513643 0.12362477 -0.10980586 0.13980341 -0.20233242 0.08813751 0.3849736 -0.10653763 -0.06199595 0.028849555 0.03230154 0.023856193 0.069950655 0.19310954 -0.077677034 -0.144811

I found "unk" token in the glove vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/. Its value is as follows:

unk -0.79149 0.86617 0.11998 0.00092287 0.2776 -0.49185 0.50195 0.00060792 -0.25845 0.17865 0.2535 0.76572 0.50664 0.4025 -0.0021388 -0.28397 -0.50324 0.30449 0.51779 0.01509 -0.35031 -1.1278 0.33253 -0.3525 0.041326 1.0863 0.03391 0.33564 0.49745 -0.070131 -1.2192 -0.48512 -0.038512 -0.13554 -0.1638 0.52321 -0.31318 -0.1655 0.11909 -0.15115 -0.15621 -0.62655 -0.62336 -0.4215 0.41873 -0.92472 1.1049 -0.29996 -0.0063003 0.3954

Is it a token to be used for unknown words or is it some kind of abbreviation?

解决方案

The unk token in the pretrained GloVe files is not an unknown token!

See this google groups thread where Jeffrey Pennington (GloVe author) writes:

It's an embedding learned like any other on occurrences of "unk" in the corpus (which appears to happen occasionally!)

Instead, Pennington suggests (in the same post):

You can do that with the following code (should work with any pretrained GloVe file):

import numpy as np

GLOVE_FILE = 'glove.6B.50d.txt'

# Get number of vectors and hidden dim
with open(GLOVE_FILE, 'r') as f:
    for i, line in enumerate(f):
        pass
n_vec = i + 1
hidden_dim = len(line.split(' ')) - 1

vecs = np.zeros((n_vec, hidden_dim), dtype=np.float32)

with open(GLOVE_FILE, 'r') as f:
    for i, line in enumerate(f):
        vecs[i] = np.array([float(n) for n in line.split(' ')[1:]], dtype=np.float32)

average_vec = np.mean(vecs, axis=0)
print(average_vec)

For glove.6B.50d.txt this gives:

[-0.12920076 -0.28866628 -0.01224866 -0.05676644 -0.20210965 -0.08389011
  0.33359843  0.16045167  0.03867431  0.17833012  0.04696583 -0.00285802
  0.29099807  0.04613704 -0.20923874 -0.06613114 -0.06822549  0.07665912
  0.3134014   0.17848536 -0.1225775  -0.09916984 -0.07495987  0.06413227
  0.14441176  0.60894334  0.17463093  0.05335403 -0.01273871  0.03474107
 -0.8123879  -0.04688699  0.20193407  0.2031118  -0.03935686  0.06967544
 -0.01553638 -0.03405238 -0.06528071  0.12250231  0.13991883 -0.17446303
 -0.08011883  0.0849521  -0.01041659 -0.13705009  0.20127155  0.10069408
  0.00653003  0.01685157]

And because it is fairly compute intensive to do this with the larger glove files, I went ahead and computed the vector for glove.840B.300d.txt for you:

0.22418134 -0.28881392 0.13854356 0.00365387 -0.12870757 0.10243822 0.061626635 0.07318011 -0.061350107 -1.3477012 0.42037755 -0.063593924 -0.09683349 0.18086134 0.23704372 0.014126852 0.170096 -1.1491593 0.31497982 0.06622181 0.024687296 0.076693475 0.13851812 0.021302193 -0.06640582 -0.010336159 0.13523154 -0.042144544 -0.11938788 0.006948221 0.13333307 -0.18276379 0.052385733 0.008943111 -0.23957317 0.08500333 -0.006894406 0.0015864656 0.063391194 0.19177166 -0.13113557 -0.11295479 -0.14276934 0.03413971 -0.034278486 -0.051366422 0.18891625 -0.16673574 -0.057783455 0.036823478 0.08078679 0.022949161 0.033298038 0.011784158 0.05643189 -0.042776518 0.011959623 0.011552498 -0.0007971594 0.11300405 -0.031369694 -0.0061559738 -0.009043574 -0.415336 -0.18870236 0.13708843 0.005911723 -0.113035575 -0.030096142 -0.23908928 -0.05354085 -0.044904727 -0.20228513 0.0065645403 -0.09578946 -0.07391877 -0.06487607 0.111740574 -0.048649278 -0.16565254 -0.052037314 -0.078968436 0.13684988 0.0757494 -0.006275573 0.28693774 0.52017444 -0.0877165 -0.33010918 -0.1359622 0.114895485 -0.09744406 0.06269521 0.12118575 -0.08026362 0.35256687 -0.060017522 -0.04889904 -0.06828978 0.088740796 0.003964443 -0.0766291 0.1263925 0.07809314 -0.023164088 -0.5680669 -0.037892066 -0.1350967 -0.11351585 -0.111434504 -0.0905027 0.25174105 -0.14841858 0.034635577 -0.07334565 0.06320108 -0.038343467 -0.05413284 0.042197507 -0.090380974 -0.070528865 -0.009174437 0.009069661 0.1405178 0.02958134 -0.036431845 -0.08625681 0.042951006 0.08230793 0.0903314 -0.12279937 -0.013899368 0.048119213 0.08678239 -0.14450377 -0.04424887 0.018319942 0.015026873 -0.100526 0.06021201 0.74059093 -0.0016333034 -0.24960588 -0.023739101 0.016396184 0.11928964 0.13950661 -0.031624354 -0.01645025 0.14079992 -0.0002824564 -0.08052984 -0.0021310581 -0.025350995 0.086938225 0.14308536 0.17146006 -0.13943303 0.048792403 0.09274929 -0.053167373 0.031103406 0.012354865 0.21057427 0.32618305 0.18015954 -0.15881181 0.15322933 -0.22558987 -0.04200665 0.0084689725 0.038156632 0.15188617 0.13274793 0.113756925 -0.095273495 -0.049490947 -0.10265804 -0.27064866 -0.034567792 -0.018810693 -0.0010360252 0.10340131 0.13883452 0.21131058 -0.01981019 0.1833468 -0.10751636 -0.03128868 0.02518242 0.23232952 0.042052146 0.11731903 -0.15506615 0.0063580726 -0.15429358 0.1511722 0.12745973 0.2576985 -0.25486213 -0.0709463 0.17983761 0.054027 -0.09884228 -0.24595179 -0.093028545 -0.028203879 0.094398156 0.09233813 0.029291354 0.13110267 0.15682974 -0.016919162 0.23927948 -0.1343307 -0.22422817 0.14634751 -0.064993896 0.4703685 -0.027190214 0.06224946 -0.091360025 0.21490277 -0.19562101 -0.10032754 -0.09056772 -0.06203493 -0.18876675 -0.10963594 -0.27734384 0.12616494 -0.02217992 -0.16058226 -0.080475815 0.026953284 0.110732645 0.014894041 0.09416802 0.14299914 -0.1594008 -0.066080004 -0.007995227 -0.11668856 -0.13081996 -0.09237365 0.14741232 0.09180138 0.081735 0.3211204 -0.0036552632 -0.047030564 -0.02311798 0.048961394 0.08669574 -0.06766279 -0.50028914 -0.048515294 0.14144728 -0.032994404 -0.11954345 -0.14929578 -0.2388355 -0.019883996 -0.15917352 -0.052084364 0.2801028 -0.0029121689 -0.054581646 -0.47385484 0.17112483 -0.12066923 -0.042173345 0.1395337 0.26115036 0.012869649 0.009291686 -0.0026459037 -0.075331464 0.017840583 -0.26869613 -0.21820338 -0.17084768 -0.1022808 -0.055290595 0.13513643 0.12362477 -0.10980586 0.13980341 -0.20233242 0.08813751 0.3849736 -0.10653763 -0.06199595 0.028849555 0.03230154 0.023856193 0.069950655 0.19310954 -0.077677034 -0.144811

这篇关于什么是“未知"?在预训练的 GloVe 矢量文件(例如 glove.6B.50d.txt)中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-16 15:13