问题描述
是否有可能以未知字段长度的文本字段加载数组?
Is it possible to somehow load an array with a text field of unknown field length?
我想出了如何传递dtype
来获取字符串.但是,在没有指定长度的情况下,我只能得到U0
.似乎无法保存任何数据的类型.例如:
I figured out how to pass dtype
to get string into it. However, with out specifying length i just get U0
. Type which seems not to be able to hold any data. E.g:
data = io.StringIO("test data lololol\ntest2 d4t4 ololol")
>>> ar = numpy.loadtxt(data, dtype=[("1",str), ("2",'S'), ("3",'S')])
>>> ar
array([('', b'', b''), ('', b'', b'')],
dtype=[('1', '<U0'), ('2', '|S0'), ('3', '|S0')])
当我更改为指定大小的模式时,会得到输入:
When I change to mode with specified size I get input:
>>> data.seek(0)
0
>>> numpy.loadtxt(data, dtype=[("1",(str,30)), ("2",(str,30)), ("3",('S',30))])
array([("b'test'", "b'data'", b'lololol'),
("b'test2'", "b'd4t4'", b'ololol')],
dtype=[('1', '<U30'), ('2', '<U30'), ('3', '|S30')])
我可能会选择S
或U
都可以.在我的情况下,该字段应用于保存一组文本标志.像linux环境变量之类的东西.因此,以防万一预分配大空间似乎是一大浪费.尤其是当行数达到数百万时.
I'd be fine with either S
or U
probably. The field in my case is supposed to be used to hold set of textual flags. Something like linux environmental variables. Thus, preallocating large space just in case seems like a big waste. Especially when number of rows goes into millions.
我确实知道或有主意,这些设计可以从何而来.就像构造一个struct
一样的对象,该对象将整个行保存在连续的内存块中.但是,我认为也许有一种方法可以使它像字符串一样保持指针的状态.
I do understand, or have ideas, where such design can come from. Like constructing a struct
like object that holds whole row in continuous memory block. However, I thought maybe there could a way to make it keep like a pointer in case of strings.
有可能吗?
推荐答案
以numpy格式获取索引使用np.recfromtxt
,它可以自动生成dtype
.实际上,它使用dtype=None
调用np.genfromtxt
.
getting indices in numpyuses np.recfromtxt
, which can generate the dtype
automatically. Effectively it calls np.genfromtxt
with a dtype=None
.
数据类似:
david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160
产生一个:
array([('david', 'weight_2005', 50), ('david', 'weight_2012', 60),
('david', 'height_2005', 150), ('david', 'height_2012', 160),...],
dtype=[('f0', 'S5'), ('f1', 'S11'), ('f2', '<i4')])
genfromtxt
中用于确定dtype
的代码看起来很复杂.我猜想它会调整Snn
以适应它在该字段中遇到的最长的字符串.
The code in genfromtxt
for determining dtype
looks complex. My guess it adjusts the Snn
to accommodate the longest string that it encounters in that field.
自定义dtype
的一种方法是在getnfromtxt
中分配names
,然后使用astype
重铸值.
One way to customize the dtype
is to assign names
in getnfromtxt
, and recast the values after with astype
.
x=np.genfromtxt('stack19944408.txt',dtype=None,names=['one','two','thr'])
x.astype(dtype=[('one','S10'),('two','S10'),('thr','f')])
#array([('david', 'weight_200', 50.0), ('david', 'weight_201', 60.0),
# ...
# dtype=[('one', 'S10'), ('two', 'S10'), ('thr', '<f4')])
这篇关于numpy中的可变/未知长度字符串/unicode dtype的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!