想象一下,我有一个字典/哈希表,其中包含成对的字符串(键)及其各自的概率(值):
import numpy as np
import random
import uuid
# Creating the N vocabulary and M vocabulary
max_word_len = 20
n_vocab_size = random.randint(8000,10000)
m_vocab_size = random.randint(8000,10000)
def random_word():
return str(uuid.uuid4().get_hex().upper()[0:random.randint(1,max_word_len)])
# Generate some random words.
n_vocab = [random_word() for i in range(n_vocab_size)]
m_vocab = [random_word() for i in range(m_vocab_size)]
# Let's hallucinate probabilities for each word pair.
hashes = {(n, m): random.random() for n in n_vocab for m in m_vocab}
hashes
哈希表如下所示:{('585F', 'B4867'): 0.7582038699473549,
('69', 'D98B23C5809A'): 0.7341569569849136,
('4D30CB2BF4134', '82ED5FA3A00E4728AC'): 0.9106077161619021,
('DD8F8AFA5CF', 'CB'): 0.4609114677237601,
...
}
想象一下,这是我将从CSV文件读取的输入哈希表,第一和第二列是哈希表的单词对(关键字),第三列是概率
如果将概率放入某种
numpy
矩阵中,则必须从哈希表中进行此操作: n_words, m_words = zip(*hashes.keys())
probs = np.array([[hashes[(n, m)] for n in n_vocab] for m in m_vocab])
还有另一种方法可以将
prob
放入| N |。 * | M |来自哈希表的矩阵,而没有通过m_vocab和n_vocab进行嵌套循环? (注意:我在这里创建随机单词和随机概率,但想象一下我已经从文件中读取了哈希表,并将其读取到该哈希表结构中)
假定两种情况,其中:
csv
文件(@bunji的答案解决了这个问题)最终矩阵必须是可查询的,这一点很重要,以下内容是不可取的:
$ echo -e 'abc\txyz\t0.9\nefg\txyz\t0.3\nlmn\topq\t\0.23\nabc\tjkl\t0.5\n' > test.txt
$ cat test.txt
abc xyz 0.9
efg xyz 0.3
lmn opq .23
abc jkl 0.5
$ python
Python 2.7.10 (default, Jul 30 2016, 18:31:42)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pt = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack().as_matrix()
>>> pt
array([[ 0.5, nan, 0.9],
[ nan, nan, 0.3],
[ nan, nan, nan]])
>>> pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()
2
1 jkl opq xyz
0
abc 0.5 NaN 0.9
efg NaN NaN 0.3
lmn NaN NaN NaN
>>> df = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()
>>> df
2
1 jkl opq xyz
0
abc 0.5 NaN 0.9
efg NaN NaN 0.3
lmn NaN NaN NaN
>>> df['abc', 'jkl']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
return self._getitem_multilevel(key)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1617, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)
KeyError: ('abc', 'jkl')
>>> df['abc']['jkl']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
return self._getitem_multilevel(key)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
loc = self._get_level_indexer(key, level=0)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
loc = level_index.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
File "pandas/index.pyx", line 163, in pandas.index.IndexEngine.get_loc (pandas/index.c:4090)
KeyError: 'abc'
>>> df[0][2]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
return self._getitem_multilevel(key)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
loc = self._get_level_indexer(key, level=0)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
loc = level_index.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
KeyError: 0
>>> df[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
return self._getitem_multilevel(key)
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
loc = self._get_level_indexer(key, level=0)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
loc = level_index.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
KeyError: 0
产生的矩阵/数据帧应该是可查询的,即能够执行以下操作:
probs[('585F', 'B4867')] = 0.7582038699473549
最佳答案
我不确定是否有一种方法可以完全避免循环,但是我想可以使用 itertools
对其进行优化:
import itertools
nested_loop_iter = itertools.product(n_vocab,m_vocab)
#note that because it iterates over n_vocab first we will need to transpose it at the end
probs = np.fromiter(map(hashes.get, nested_loop_iter),dtype=float)
probs.resize((len(n_vocab),len(m_vocab)))
probs = probs.T
关于python - 创建| N | x | M |哈希表中的矩阵,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/40209612/