背景:

我正在处理8波段多光谱卫星图像,并根据反射率值估算水深。使用statsmodels,我提出了一个OLS模型,该模型将基于该像素的8个反射率值来预测每个像素的深度。为了轻松使用OLS模型,我将所有像素反射率值粘贴到了熊猫数据框中,格式如下例所示;其中每一行代表一个像素,每一列代表多光谱图像的光谱带。

由于一些预处理步骤,所有在岸像素均已转换为全零。我不想尝试预测那些像素的“深度”,所以我想将我的OLS模型预测限制为非全零值的行。

我将需要将结果重塑为原始图像的x列尺寸,因此我不能只删除全部零行。

具体问题:

我有一个Pandas数据框。有些行包含全零。我想掩盖这些行以进行一些计算,但我需要保留这些行。我不知道如何为全为零的行屏蔽所有条目。

例如:

In [1]: import pandas as pd
In [2]: import numpy as np
        # my actual data has about 16 million rows so
        # I'll simulate some data for the example.
In [3]: cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
In [4]: rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
In [5]: zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
In [6]: df = pd.concat((rdf,zdf)).reset_index(drop=True)
In [7]: df
Out[7]:
        band1  band2  band3  band4  band5  band6  band7  band8
    0       9      9      8      7      2      7      5      6
    1       7      7      5      6      3      0      9      8
    2       5      4      3      6      0      3      8      8
    3       6      4      5      0      5      7      4      5
    4       8      3      2      4      1      3      2      5
    5       9      7      6      3      8      7      8      4
    6       6      2      8      2      2      6      9      8
    7       9      4      0      2      7      6      4      8
    8       1      3      5      3      3      3      0      1
    9       4      2      9      7      3      5      5      0
    10      0      0      0      0      0      0      0      0
    11      0      0      0      0      0      0      0      0
    12      0      0      0      0      0      0      0      0

    [13 rows x 8 columns]


我知道这样做可以得到我感兴趣的行:

In [8]: df[df.any(axis=1)==True]
Out[8]:
       band1  band2  band3  band4  band5  band6  band7  band8
    0      9      9      8      7      2      7      5      6
    1      7      7      5      6      3      0      9      8
    2      5      4      3      6      0      3      8      8
    3      6      4      5      0      5      7      4      5
    4      8      3      2      4      1      3      2      5
    5      9      7      6      3      8      7      8      4
    6      6      2      8      2      2      6      9      8
    7      9      4      0      2      7      6      4      8
    8      1      3      5      3      3      3      0      1
    9      4      2      9      7      3      5      5      0

   [10 rows x 8 columns]


但是我稍后需要重新调整数据的形状,因此我需要将这些行放在正确的位置。我已经尝试过各种方法,包括df.where(df.any(axis=1)==True),但找不到任何有效的方法。

失败:


df.any(axis=1)==True为我感兴趣的行提供True,为我要屏蔽的行提供False,但是当我尝试df.where(df.any(axis=1)==True)时,我只是取回了所有零的完整数据帧。我想要整个数据框,但要屏蔽掉那些零行中的所有值,所以据我所知,它们应该显示为Nan,对吗?
我尝试获取全零的行的索引并按行屏蔽:

mskidxs = df[df.any(axis=1)==False].index
df.mask(df.index.isin(mskidxs))


那也没有给我带来任何帮助:

ValueError: Array conditional must be same shape as self


.index只是回送Int64Index。我需要一个与数据框尺寸相同的布尔数组,但我只是想不出如何获得一个布尔数组。


在此先感谢您的帮助。

-Jared

最佳答案

澄清我的问题的过程使我以一种about回的方式找到了答案。 This question也帮助我指出正确的方向。这是我发现的:

import pandas as pd
# Set up my fake test data again. My actual data is described
# in the question.
cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
df = pd.concat((zdf,rdf)).reset_index(drop=True)

# View the dataframe. (sorry about the alignment, I don't
# want to spend the time putting in all the spaces)
df

    band1   band2   band3   band4   band5   band6   band7   band8
0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0
3   6   3   7   0   1   7   1   8
4   9   2   6   8   7   1   4   3
5   4   2   1   1   3   2   1   9
6   5   3   8   7   3   7   5   2
7   8   2   6   0   7   2   0   7
8   1   3   5   0   7   3   3   5
9   1   8   6   0   1   5   7   7
10  4   2   6   2   2   2   4   9
11  8   7   8   0   9   3   3   0
12  6   1   6   8   2   0   2   5

13 rows × 8 columns

# This is essentially the same as item #2 under Fails
# in my question. It gives me the indexes of the rows
# I want unmasked as True and those I want masked as
# False. However, the result is not the right shape to
# use as a mask.
df.apply( lambda row: any([i<>0 for i in row]),axis=1 )
0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
dtype: bool

# This is what actually works. By setting broadcast to
# True, I get a result that's the right shape to use.
land_rows = df.apply( lambda row: any([i<>0 for i in row]),axis=1,
                      broadcast=True )

land_rows

Out[92]:
    band1   band2   band3   band4   band5   band6   band7   band8
0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0
3   1   1   1   1   1   1   1   1
4   1   1   1   1   1   1   1   1
5   1   1   1   1   1   1   1   1
6   1   1   1   1   1   1   1   1
7   1   1   1   1   1   1   1   1
8   1   1   1   1   1   1   1   1
9   1   1   1   1   1   1   1   1
10  1   1   1   1   1   1   1   1
11  1   1   1   1   1   1   1   1
12  1   1   1   1   1   1   1   1

13 rows × 8 columns

# This produces the result I was looking for:
df.where(land_rows)

Out[93]:
    band1   band2   band3   band4   band5   band6   band7   band8
0   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
1   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
3   6   3   7   0   1   7   1   8
4   9   2   6   8   7   1   4   3
5   4   2   1   1   3   2   1   9
6   5   3   8   7   3   7   5   2
7   8   2   6   0   7   2   0   7
8   1   3   5   0   7   3   3   5
9   1   8   6   0   1   5   7   7
10  4   2   6   2   2   2   4   9
11  8   7   8   0   9   3   3   0
12  6   1   6   8   2   0   2   5

13 rows × 8 columns


再次感谢那些提供帮助的人。希望我发现的解决方案在某些时候对某些人有用。

我找到了另一种方法来做同样的事情。涉及的步骤更多,但是根据%timeit的说法,它快了大约9倍。这里是:

def mask_all_zero_rows_numpy(df):
    """
    Take a dataframe, find all the rows that contain only zeros
    and mask them. Return a dataframe of the same shape with all
    Nan rows in place of the all zero rows.
    """
    no_data = -99
    arr = df.as_matrix().astype(int16)
    # make a row full of the 'no data' value
    replacement_row = np.array([no_data for x in range(arr.shape[1])], dtype=int16)
    # find out what rows are all zeros
    mask_rows = ~arr.any(axis=1)
    # replace those all zero rows with all 'no_data' rows
    arr[mask_rows] = replacement_row
    # create a masked array with the no_data value masked
    marr = np.ma.masked_where(arr==no_data,arr)
    # turn masked array into a data frame
    mdf = pd.DataFrame(marr,columns=df.columns)
    return mdf


mask_all_zero_rows_numpy(df)的结果应与上面的Out[93]:相同。

关于python - 根据整个行屏蔽Pandas DataFrame行,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/23798961/

10-10 05:27