python - 在pandas DataFrame中有效地找到匹配的行(基于内容)

我正在编写一些测试，并且正在使用Pandas DataFrames容纳一个大数据集〜（600,000 x 10）。我已经从源数据中提取了10个随机行（使用Stata），现在我想编写一个测试，看看这些行是否在我的测试套件的DataFrame中。

作为一个小例子

np.random.seed(2)
raw_data = pd.DataFrame(np.random.rand(5,3), columns=['one', 'two', 'three'])
random_sample = raw_data.ix[1]

这里的raw_data是：

派生出random_sample来保证匹配，它是：

目前我写过：

for idx, row in raw_data.iterrows():
    if random_sample.equals(row):
        print "match"
        break

哪个可行，但是在大型数据集上非常慢。有没有更有效的方法来检查DataFrame中是否包含整行？

顺便说一句：我的示例还需要能够比较np.NaN相等性，这就是为什么我使用equals()方法的原因

最佳答案

equals似乎没有广播，但是我们总是可以手动进行相等比较：

>>> df = pd.DataFrame(np.random.rand(600000, 10))
>>> sample = df.iloc[-1]
>>> %timeit df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
1 loops, best of 3: 231 ms per loop
>>> df[((df == sample) | (df.isnull() & sample.isnull())).all(1)]
              0         1         2         3         4         5         6  \
599999  0.07832  0.064828  0.502513  0.851816  0.976464  0.761231  0.275242

               7        8         9
599999  0.426393  0.91632  0.569807

对我而言，这比迭代版本快得多（需要30秒以上）。

但是，由于我们有很多行，而列却相对较少，因此我们可以在列上循环，在典型情况下，可能会大大减少要查看的行数。例如，类似

def finder(df, row):
    for col in df:
        df = df.loc[(df[col] == row[col]) | (df[col].isnull() & pd.isnull(row[col]))]
    return df

给我

>>> %timeit finder(df, sample)
10 loops, best of 3: 35.2 ms per loop

这大约快了一个数量级，因为在第一列之后仅剩一行。

（我想我曾经有过很多花哨的方法来做到这一点，但是对于我一生来说，我现在已经不记得了。）