问题描述
我想知道使numpy recarray
标准化/标准化的最佳方法是什么.为了清楚起见,我不是在谈论数学矩阵,而是一个具有例如文字列(例如标签).
I wonder what the best way of normalizing/standardizing a numpy recarray
is.To make it clear, I'm not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as labels).
a = np.genfromtxt("iris.csv", delimiter=",", dtype=None)
print a.shape
> (150,)
如您所见,我不能因为形状是一维的,所以处理a[:,:-1]
.
As you can see, I cannot e.g. process a[:,:-1]
as the shape is one-dimensional.
我发现最好的方法是遍历所有列:
The best I found is to iterate over all columns:
for nam in a.dtype.names[:-1]:
col = a[nam]
a[nam] = (col - col.min()) / (col.max() - col.min())
还有其他更优雅的方式吗?某处是否有诸如规范化"或标准化"之类的方法?
Any more elegant way of doing this? Is there some method such as "normalize" or "standardize" somewhere?
推荐答案
有很多方法可以做到,但是有些方法比其他方法更干净.
There are a number of ways to do it, but some are cleaner than others.
通常,在numpy中,您将字符串数据保存在单独的数组中.
Usually, in numpy, you keep the string data in a separate array.
(事情要比R的数据帧低一些.通常,您只是将它们包装在一个类中以进行关联,但是将不同的数据类型分开.)
(Things are a bit more low-level than, say, R's data frame. You typically just wrap things up in a class for the association, but keep different data types separate.)
老实说,numpy并未针对处理诸如此类的灵活"数据类型进行优化(尽管它当然可以做到).诸如 pandas
之类的东西为类似电子表格"的数据提供了更好的界面(而熊猫只是位于numpy).
Honestly, numpy isn't optimized for handling "flexible" datatypes such as this (though it can certainly do it). Things like pandas
provide a better interface for "spreadsheet-like" data (and pandas is just a layer on top of numpy).
但是,当您传递字段名称列表时,结构化数组(这就是您在这里拥有的)将允许您按列对它们进行切片. (例如data[['col1', 'col2', 'col3']]
)
However, structured arrays (which is what you have here) will allow you to slice them column-wise when you pass in a list of field names. (e.g. data[['col1', 'col2', 'col3']]
)
无论如何,一种方法是做这样的事情:
At any rate, one way is to do something like this:
import numpy as np
data = np.recfromcsv('iris.csv')
# In this case, it's just all but the last, but we could be more general
# This must be a list and not a tuple, though.
float_fields = list(data.dtype.names[:-1])
float_dat = data[float_fields]
# Now we just need to view it as a "regular" 2D array...
float_dat = float_dat.view(np.float).reshape((data.size, -1))
# And we can normalize columns as usual.
normalized = (float_dat - float_dat.min(axis=0)) / float_dat.ptp(axis=0)
但是,这远非理想.如果要就地进行操作(如当前操作),最简单的解决方案就是已经拥有的解决方案:只需遍历字段名即可.
However, this is far from ideal. If you want to do the operation in-place (as you currently are) the easiest solution is what you already have: Just iterate over the field names.
顺便说一句,使用pandas
,您将执行以下操作:
Incidentally, using pandas
, you'd do something like this:
import pandas
data = pandas.read_csv('iris.csv', header=None)
float_dat = data[data.columns[:-1]]
dmin, dmax = float_dat.min(axis=0), float_dat.max(axis=0)
data[data.columns[:-1]] = (float_dat - dmin) / (dmax - dmin)
这篇关于规范/标准化一个numpy的rearray的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!