DataFrame的所有值设置为零

DataFrame的所有值设置为零

本文介绍了如何将现有Pandas DataFrame的所有值设置为零?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前已有一个带有日期索引的Pandas DataFrame,每个列都有一个特定的名称。

I currently have an existing Pandas DataFrame with a date index, and columns each with a specific name.

对于数据单元,它们充满了各种浮点数值。

As for the data cells, they are filled with various float values.

我想复制DataFrame,但将所有这些值替换为零。

I would like to copy my DataFrame, but replace all these values with zero.

目标是重用DataFrame的结构(维度,索引,列名),但通过将它们替换为零来清除所有当前值。

The objective is to reuse the structure of the DataFrame (dimensions, index, column names), but clear all the current values by replacing them with zeroes.

我当前的方式实现此目标的方法如下:

The way I'm currently achieving this is as follow:

df[df > 0] = 0

但是,这不会替换DataFrame中的任何负值。

However, this would not replace any negative value in the DataFrame.

不是有一种更通用的方法来用单个公共值填充整个现有DataFrame吗?

Isn't there a more general approach to filling an entire existing DataFrame with a single common value?

预先感谢您

推荐答案

绝对最快的方法,它还保留 dtypes ,如下所示:

The absolute fastest way, which also preserves dtypes, is the following:

for col in df.columns:
    df[col].values[:] = 0

这直接写入每个列的基础numpy数组。我怀疑其他任何方法都不会比这更快,因为这不会分配额外的存储空间,并且不会通过熊猫的 dtype 处理。您还可以使用 np.issubdtype 仅将数字列清零。如果您有混合的 dtype DataFrame,这可能就是您想要的,但是,如果您的DataFrame已经是完全数字的,那么当然没有必要。

This directly writes to the underlying numpy array of each column. I doubt any other method will be faster than this, as this allocates no additional storage and doesn't pass through pandas's dtype handling. You can also use np.issubdtype to only zero out numeric columns. This is probably what you want if you have a mixed dtype DataFrame, but of course it's not necessary if your DataFrame is already entirely numeric.

for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0

对于小型DataFrame,子类型检查的成本较高。但是,将非数字列清零的成本很高,因此,如果不确定DataFrame是否完全为数字,则可能应包括 issubdtype 检查。 / p>




时间比较



设置



For small DataFrames, the subtype check is somewhat costly. However, the cost of zeroing a non-numeric column is substantial, so if you're not sure whether your DataFrame is entirely numeric, you should probably include the issubdtype check.

import pandas as pd
import numpy as np

def make_df(n, only_numeric):
    series = [
        pd.Series(range(n), name="int", dtype=int),
        pd.Series(range(n), name="float", dtype=float),
    ]
    if only_numeric:
        series.extend(
            [
                pd.Series(range(n, 2 * n), name="int2", dtype=int),
                pd.Series(range(n, 2 * n), name="float2", dtype=float),
            ]
        )
    else:
        series.extend(
            [
                pd.date_range(start="1970-1-1", freq="T", periods=n, name="dt")
                .to_series()
                .reset_index(drop=True),
                pd.Series(
                    [chr((i % 26) + 65) for i in range(n)],
                    name="string",
                    dtype="object",
                ),
            ]
        )

    return pd.concat(series, axis=1)





>>> make_df(5, True)
   int  float  int2  float2
0    0    0.0     5     5.0
1    1    1.0     6     6.0
2    2    2.0     7     7.0
3    3    3.0     8     8.0
4    4    4.0     9     9.0

>>> make_df(5, False)
   int  float                  dt string
0    0    0.0 1970-01-01 00:00:00      A
1    1    1.0 1970-01-01 00:01:00      B
2    2    2.0 1970-01-01 00:02:00      C
3    3    3.0 1970-01-01 00:03:00      D
4    4    4.0 1970-01-01 00:04:00      E



小型DataFrame



Small DataFrame

n = 10_000

# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
    df[col].values[:] = 0
36.1 µs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0
53 µs ± 645 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
    df[col].values[:] = 0
113 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0
39.4 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



大型DataFrame



Large DataFrame

n = 10_000_000

# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
    df[col].values[:] = 0
38.7 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0
39.1 ms ± 556 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
    df[col].values[:] = 0
99.5 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0
17.8 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)






我以前曾建议以下答案,但现在我认为这是有害的-它比上述答案慢得多,也很难推理。唯一的好处是写起来更好。

df[:] = 0

不幸的是 dtype 的情况有点模糊,因为结果数据帧中的每个
列都具有相同的 dtype 。如果 df 的每个
列最初都是 float ,则新的 dtypes 仍将是
float 。但是,如果单个列是 int object ,似乎
是新的 dtypes all 全部为 int

Unfortunately the dtype situation is a bit fuzzy because every column in the resulting dataframe will have the same dtype. If every column of df was originally float, the new dtypes will still be float. But if a single column was int or object, it seems that the new dtypes will all be int.

这篇关于如何将现有Pandas DataFrame的所有值设置为零?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 00:17