问题描述
NumPy似乎缺少对3字节和6字节类型(也称为uint24
和uint48
)的内置支持.我有一个使用这些类型的大型数据集,并希望将其提供给numpy.我目前正在做什么(针对uint24):
NumPy seems to lack built-in support for 3-byte and 6-byte types, aka uint24
and uint48
.I have a large data set using these types and want to feed it to numpy. What I currently do (for uint24):
import numpy as np
dt = np.dtype([('head', '<u2'), ('data', '<u2', (3,))])
# I would like to be able to write
# dt = np.dtype([('head', '<u2'), ('data', '<u3', (2,))])
# dt = np.dtype([('head', '<u2'), ('data', '<u6')])
a = np.memmap("filename", mode='r', dtype=dt)
# convert 3 x 2byte data to 2 x 3byte
# w1 is LSB, w3 is MSB
w1, w2, w3 = a['data'].swapaxes(0,1)
a2 = np.ndarray((2,a.size), dtype='u4')
# 3 LSB
a2[0] = w2 % 256
a2[0] <<= 16
a2[0] += w1
# 3 MSB
a2[1] = w3
a2[1] <<=8
a2[1] += w2 >> 8
# now a2 contains "uint24" matrix
虽然它适用于100MB输入,但它看起来效率很低(想想100s GB的数据).有没有更有效的方法?例如,创建一种特殊的只读视图以掩盖部分数据将很有用(类型为"uint64,两个MSB始终为零").我只需要对数据的只读访问.
While it works for 100MB input, it looks inefficient (think of 100s GBs of data). Is there a more efficient way? For example, creating a special kind of read-only view which masks part of the data would be useful (kind of "uint64 with two MSBs always zero" type). I only need read-only access to the data.
推荐答案
我不相信有一种方法可以执行您所要的操作(它需要未对齐的访问权限,这在某些体系结构上效率很低).我的解决方案来自读取和存储任意字节文件中的长度整数可能更有效地将数据传输到进程内数组:
I don't believe there's a way to do what you're asking (it would require unaligned access, which is highly inefficient on some architectures). My solution from Reading and storing arbitrary byte length integers from a file might be more efficient at transferring the data to an in-process array:
a = np.memmap("filename", mode='r', dtype=np.dtype('>u1'))
e = np.zeros(a.size / 6, np.dtype('>u8'))
for i in range(3):
e.view(dtype='>u2')[i + 1::4] = a.view(dtype='>u2')[i::3]
您可以使用strides
构造函数参数获得未对齐的访问权限:
You can get unaligned access using the strides
constructor parameter:
e = np.ndarray((a.size - 2) // 6, np.dtype('<u8'), buf, strides=(6,))
但是,每个元素都会与下一个元素重叠,因此要实际使用它,您必须屏蔽访问时的高字节.
However with this each element will overlap with the next, so to actually use it you'd have to mask out the high bytes on access.
这篇关于NumPy:3字节,6字节类型(又名uint24,uint48)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!