问题描述
是否可以在不事先知道字符串长度的情况下初始化容纳字符串的numpy数组?
Is it possible to initialise a numpy recarray that will hold strings, without knowing the length of the strings beforehand?
作为(人为)示例:
mydf = np.empty( (numrows,), dtype=[ ('file_name','STRING'), ('file_size_MB',float) ] )
问题在于我在构造rearray之前要先填充信息,而我不一定事先知道file_name
的最大长度.
The problem is that I'm constructing my recarray in advance of populating it with information, and I don't necessarily know the maximum length of file_name
in advance.
我所有的尝试都导致字符串字段被截断:
All my attempts result in the string field being truncated:
>>> mydf = np.empty( (2,), dtype=[('file_name',str),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('', 6.9164002347457e-310), ('', 9.9413127e-317)],
dtype=[('file_name', 'S'), ('file_size_mb', '<f8')])
>>> mydf['file_name']
array(['f', 'a'],
dtype='|S1')
(顺便说一句,为什么mydf['file_name']
为什么显示'f'和'a',而mydf
为什么显示''和'?)
(As an aside, why does mydf['file_name']
show 'f' and 'a' whilst mydf
shows '' and ''?)
类似地,如果我将类型(例如)|S10
初始化为file_name
,则长度将被截断为10.
Similarly, if I initialise with type (say) |S10
for file_name
then things get truncated at length 10.
我可以找到的唯一类似问题是这个问题,但这可以计算出适当的字符串长度先验,因此与我的字符串长度不太一样(因为我事先并不知道).
The only similar question I could find is this one, but this calculates the appropriate string length a priori and hence is not quite the same as mine (as I know nothing in advance).
除了用|S9999999999999
来表示file_name
(即一些可笑的上限)以外,还有其他选择吗?
Is there any alternative other than initalising the file_name
with (eg) |S9999999999999
(ie some ridiculous upper limit)?
推荐答案
人们可以始终使用object
作为dtype,而不是使用STRING
dtype.这将允许将任何对象分配给数组元素,包括Python可变长度字符串.例如:
Instead of using the STRING
dtype, one can always use object
as dtype. That will allow any object to be assigned to an array element, including Python variable length strings. For example:
>>> import numpy as np
>>> mydf = np.empty( (2,), dtype=[('file_name',object),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('foobarasdf.tif', 0.0), ('arghtidlsarbda.jpg', 0.0)],
dtype=[('file_name', '|O8'), ('file_size_mb', '<f8')])
拥有可变长度的元素是违反数组概念的精神,但这是尽可能接近的.数组的概念是将元素存储在内存中定义良好且规则间隔的内存地址中,这禁止了可变长度的元素.通过将指向字符串的指针存储在数组中,可以避免这种限制. (基本上就是上面的示例.)
It is a against the spirit of the array concept to have variable length elements, but this is as close as one can get. The idea of an array is that elements are stored in memory at well-defined and regularly spaced memory addresses, which prohibits variable length elements. By storing the pointers to a string in an array, one can circumvent this limitation. (This is basically what the above example does.)
这篇关于可变长度的numpy recarray字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!