想象一下,我有一个字典/哈希表,其中包含成对的字符串(键)及其各自的概率(值):

import numpy as np
import random
import uuid

# Creating the N vocabulary and M vocabulary
max_word_len = 20
n_vocab_size = random.randint(8000,10000)
m_vocab_size = random.randint(8000,10000)

def random_word():
    return str(uuid.uuid4().get_hex().upper()[0:random.randint(1,max_word_len)])

# Generate some random words.
n_vocab = [random_word() for i in range(n_vocab_size)]
m_vocab = [random_word() for i in range(m_vocab_size)]


# Let's hallucinate probabilities for each word pair.
hashes =  {(n, m): random.random() for n in n_vocab for m in m_vocab}
hashes哈希表如下所示:
{('585F', 'B4867'): 0.7582038699473549,
 ('69', 'D98B23C5809A'): 0.7341569569849136,
 ('4D30CB2BF4134', '82ED5FA3A00E4728AC'): 0.9106077161619021,
 ('DD8F8AFA5CF', 'CB'): 0.4609114677237601,
...
}

想象一下,这是我将从CSV文件读取的输入哈希表,第一和第二列是哈希表的单词对(关键字),第三列是概率

如果将概率放入某种numpy矩阵中,则必须从哈希表中进行此操作:
 n_words, m_words = zip(*hashes.keys())
 probs = np.array([[hashes[(n, m)] for n in n_vocab] for m in m_vocab])

还有另一种方法可以将prob放入| N |。 * | M |来自哈希表的矩阵,而没有通过m_vocab和n_vocab进行嵌套循环?

(注意:我在这里创建随机单词和随机概率,但想象一下我已经从文件中读取了哈希表,并将其读取到该哈希表结构中)

假定两种情况,其中:
  • 哈希表来自csv文件(@bunji的答案解决了这个问题)
  • 哈希表来自一个腌制的字典。或者,在到达需要将哈希表转换为矩阵的部分之前,以其他方式计算了哈希表。


  • 最终矩阵必须是可查询的,这一点很重要,以下内容是不可取的:
    $ echo -e 'abc\txyz\t0.9\nefg\txyz\t0.3\nlmn\topq\t\0.23\nabc\tjkl\t0.5\n' > test.txt
    
    $ cat test.txt
    abc xyz 0.9
    efg xyz 0.3
    lmn opq .23
    abc jkl 0.5
    
    
    $ python
    Python 2.7.10 (default, Jul 30 2016, 18:31:42)
    [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pandas as pd
    >>> pt = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack().as_matrix()
    >>> pt
    array([[ 0.5,  nan,  0.9],
           [ nan,  nan,  0.3],
           [ nan,  nan,  nan]])
    >>> pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()
           2
    1    jkl opq  xyz
    0
    abc  0.5 NaN  0.9
    efg  NaN NaN  0.3
    lmn  NaN NaN  NaN
    
    >>> df = pd.read_csv('test.txt', index_col=[0,1], header=None, delimiter='\t').unstack()
    
    >>> df
           2
    1    jkl opq  xyz
    0
    abc  0.5 NaN  0.9
    efg  NaN NaN  0.3
    lmn  NaN NaN  NaN
    
    >>> df['abc', 'jkl']
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
        return self._getitem_multilevel(key)
      File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
        loc = self.columns.get_loc(key)
      File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1617, in get_loc
        return self._engine.get_loc(key)
      File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
      File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
      File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)
      File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)
    KeyError: ('abc', 'jkl')
    >>> df['abc']['jkl']
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
        return self._getitem_multilevel(key)
      File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
        loc = self.columns.get_loc(key)
      File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
        loc = self._get_level_indexer(key, level=0)
      File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
        loc = level_index.get_loc(key)
      File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
        return self._engine.get_loc(self._maybe_cast_indexer(key))
      File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
      File "pandas/index.pyx", line 163, in pandas.index.IndexEngine.get_loc (pandas/index.c:4090)
    KeyError: 'abc'
    
    >>> df[0][2]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
        return self._getitem_multilevel(key)
      File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
        loc = self.columns.get_loc(key)
      File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
        loc = self._get_level_indexer(key, level=0)
      File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
        loc = level_index.get_loc(key)
      File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
        return self._engine.get_loc(self._maybe_cast_indexer(key))
      File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
      File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
      File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
      File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
    KeyError: 0
    
    >>> df[0]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2055, in __getitem__
        return self._getitem_multilevel(key)
      File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2099, in _getitem_multilevel
        loc = self.columns.get_loc(key)
      File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1597, in get_loc
        loc = self._get_level_indexer(key, level=0)
      File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 1859, in _get_level_indexer
        loc = level_index.get_loc(key)
      File "/Library/Python/2.7/site-packages/pandas/indexes/base.py", line 2106, in get_loc
        return self._engine.get_loc(self._maybe_cast_indexer(key))
      File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)
      File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
      File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8141)
      File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8085)
    KeyError: 0
    

    产生的矩阵/数据帧应该是可查询的,即能够执行以下操作:
    probs[('585F', 'B4867')] = 0.7582038699473549
    

    最佳答案

    我不确定是否有一种方法可以完全避免循环,但是我想可以使用 itertools 对其进行优化:

    import itertools
    nested_loop_iter = itertools.product(n_vocab,m_vocab)
    #note that because it iterates over n_vocab first we will need to transpose it at the end
    probs = np.fromiter(map(hashes.get, nested_loop_iter),dtype=float)
    probs.resize((len(n_vocab),len(m_vocab)))
    probs = probs.T
    

    关于python - 创建| N | x | M |哈希表中的矩阵,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/40209612/

    10-12 20:40