问题描述
我在手套矢量文件Gloves.6B.50d.txt中找到了"unk"令牌,从https下载 ://nlp.stanford.edu/projects/glove/.其值如下:
I found "unk" token in the glove vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/. Its value is as follows:
unk -0.79149 0.86617 0.11998 0.00092287 0.2776 -0.49185 0.50195 0.00060792 -0.25845 0.17865 0.2535 0.76572 0.50664 0.4025 -0.0021388 -0.28397 -0.50324 0.30449 0.51779 0.01509 -0.35031 -1.1278 0.33253 -0.3525 0.041326 1.0863 0.03391 0.33564 0.49745 -0.070131 -1.2192 -0.48512 -0.038512 -0.13554 -0.1638 0.52321 -0.31318 -0.1655 0.11909 -0.15115 -0.15621 -0.62655 -0.62336 -0.4215 0.41873 -0.92472 1.1049 -0.29996 -0.0063003 0.3954
是用于未知单词的标记还是某种缩写?
Is it a token to be used for unknown words or is it some kind of abbreviation?
推荐答案
预训练的GloVe文件中的unk
令牌不是未知令牌!
The unk
token in the pretrained GloVe files is not an unknown token!
请参见 Google网上论坛线程 Jeffrey Pennington(GloVe作者)在其中写道:
See this google groups thread where Jeffrey Pennington (GloVe author) writes:
这是一种在语料库中出现"unk"时(就像偶尔发生的一样)学习的嵌入方式
It's an embedding learned like any other on occurrences of "unk" in the corpus (which appears to happen occasionally!)
相反,彭宁顿建议(在同一篇文章中):
Instead, Pennington suggests (in the same post):
您可以使用以下代码(应与任何经过预训练的GloVe文件一起使用)来做到这一点:
You can do that with the following code (should work with any pretrained GloVe file):
import numpy as np
GLOVE_FILE = 'glove.6B.50d.txt'
# Get number of vectors and hidden dim
with open(GLOVE_FILE, 'r') as f:
for i, line in enumerate(f):
pass
n_vec = i + 1
hidden_dim = len(line.split(' ')) - 1
vecs = np.zeros((n_vec, hidden_dim), dtype=np.float32)
with open(GLOVE_FILE, 'r') as f:
for i, line in enumerate(f):
vecs[i] = np.array([float(n) for n in line.split(' ')[1:]], dtype=np.float32)
average_vec = np.mean(vecs, axis=0)
print(average_vec)
对于glove.6B.50d.txt
,这给出:
[-0.12920076 -0.28866628 -0.01224866 -0.05676644 -0.20210965 -0.08389011
0.33359843 0.16045167 0.03867431 0.17833012 0.04696583 -0.00285802
0.29099807 0.04613704 -0.20923874 -0.06613114 -0.06822549 0.07665912
0.3134014 0.17848536 -0.1225775 -0.09916984 -0.07495987 0.06413227
0.14441176 0.60894334 0.17463093 0.05335403 -0.01273871 0.03474107
-0.8123879 -0.04688699 0.20193407 0.2031118 -0.03935686 0.06967544
-0.01553638 -0.03405238 -0.06528071 0.12250231 0.13991883 -0.17446303
-0.08011883 0.0849521 -0.01041659 -0.13705009 0.20127155 0.10069408
0.00653003 0.01685157]
并且因为使用较大的手套文件来执行此操作相当耗费计算资源,所以我继续为您计算glove.840B.300d.txt
的向量:
And because it is fairly compute intensive to do this with the larger glove files, I went ahead and computed the vector for glove.840B.300d.txt
for you:
0.22418134 -0.28881392 0.13854356 0.00365387 -0.12870757 0.10243822 0.061626635 0.07318011 -0.061350107 -1.3477012 0.42037755 -0.063593924 -0.09683349 0.18086134 0.23704372 0.014126852 0.170096 -1.1491593 0.31497982 0.06622181 0.024687296 0.076693475 0.13851812 0.021302193 -0.06640582 -0.010336159 0.13523154 -0.042144544 -0.11938788 0.006948221 0.13333307 -0.18276379 0.052385733 0.008943111 -0.23957317 0.08500333 -0.006894406 0.0015864656 0.063391194 0.19177166 -0.13113557 -0.11295479 -0.14276934 0.03413971 -0.034278486 -0.051366422 0.18891625 -0.16673574 -0.057783455 0.036823478 0.08078679 0.022949161 0.033298038 0.011784158 0.05643189 -0.042776518 0.011959623 0.011552498 -0.0007971594 0.11300405 -0.031369694 -0.0061559738 -0.009043574 -0.415336 -0.18870236 0.13708843 0.005911723 -0.113035575 -0.030096142 -0.23908928 -0.05354085 -0.044904727 -0.20228513 0.0065645403 -0.09578946 -0.07391877 -0.06487607 0.111740574 -0.048649278 -0.16565254 -0.052037314 -0.078968436 0.13684988 0.0757494 -0.006275573 0.28693774 0.52017444 -0.0877165 -0.33010918 -0.1359622 0.114895485 -0.09744406 0.06269521 0.12118575 -0.08026362 0.35256687 -0.060017522 -0.04889904 -0.06828978 0.088740796 0.003964443 -0.0766291 0.1263925 0.07809314 -0.023164088 -0.5680669 -0.037892066 -0.1350967 -0.11351585 -0.111434504 -0.0905027 0.25174105 -0.14841858 0.034635577 -0.07334565 0.06320108 -0.038343467 -0.05413284 0.042197507 -0.090380974 -0.070528865 -0.009174437 0.009069661 0.1405178 0.02958134 -0.036431845 -0.08625681 0.042951006 0.08230793 0.0903314 -0.12279937 -0.013899368 0.048119213 0.08678239 -0.14450377 -0.04424887 0.018319942 0.015026873 -0.100526 0.06021201 0.74059093 -0.0016333034 -0.24960588 -0.023739101 0.016396184 0.11928964 0.13950661 -0.031624354 -0.01645025 0.14079992 -0.0002824564 -0.08052984 -0.0021310581 -0.025350995 0.086938225 0.14308536 0.17146006 -0.13943303 0.048792403 0.09274929 -0.053167373 0.031103406 0.012354865 0.21057427 0.32618305 0.18015954 -0.15881181 0.15322933 -0.22558987 -0.04200665 0.0084689725 0.038156632 0.15188617 0.13274793 0.113756925 -0.095273495 -0.049490947 -0.10265804 -0.27064866 -0.034567792 -0.018810693 -0.0010360252 0.10340131 0.13883452 0.21131058 -0.01981019 0.1833468 -0.10751636 -0.03128868 0.02518242 0.23232952 0.042052146 0.11731903 -0.15506615 0.0063580726 -0.15429358 0.1511722 0.12745973 0.2576985 -0.25486213 -0.0709463 0.17983761 0.054027 -0.09884228 -0.24595179 -0.093028545 -0.028203879 0.094398156 0.09233813 0.029291354 0.13110267 0.15682974 -0.016919162 0.23927948 -0.1343307 -0.22422817 0.14634751 -0.064993896 0.4703685 -0.027190214 0.06224946 -0.091360025 0.21490277 -0.19562101 -0.10032754 -0.09056772 -0.06203493 -0.18876675 -0.10963594 -0.27734384 0.12616494 -0.02217992 -0.16058226 -0.080475815 0.026953284 0.110732645 0.014894041 0.09416802 0.14299914 -0.1594008 -0.066080004 -0.007995227 -0.11668856 -0.13081996 -0.09237365 0.14741232 0.09180138 0.081735 0.3211204 -0.0036552632 -0.047030564 -0.02311798 0.048961394 0.08669574 -0.06766279 -0.50028914 -0.048515294 0.14144728 -0.032994404 -0.11954345 -0.14929578 -0.2388355 -0.019883996 -0.15917352 -0.052084364 0.2801028 -0.0029121689 -0.054581646 -0.47385484 0.17112483 -0.12066923 -0.042173345 0.1395337 0.26115036 0.012869649 0.009291686 -0.0026459037 -0.075331464 0.017840583 -0.26869613 -0.21820338 -0.17084768 -0.1022808 -0.055290595 0.13513643 0.12362477 -0.10980586 0.13980341 -0.20233242 0.08813751 0.3849736 -0.10653763 -0.06199595 0.028849555 0.03230154 0.023856193 0.069950655 0.19310954 -0.077677034 -0.144811
这篇关于什么是"unk"?在预先训练的GloVe矢量文件中(例如,Gloves.6B.50d.txt)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!