目录
一、用法精讲
576、pandas.DataFrame.merge方法
576-1、语法
# 576、pandas.DataFrame.merge方法
pandas.DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
Warning
If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
Parameters:
rightDataFrame or named Series
Object to merge with.
how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
Type of merge to be performed.
left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order of the left keys.
onlabel or list
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
left_onlabel or list, or array-like
Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
right_onlabel or list, or array-like
Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
left_indexbool, default False
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.
right_indexbool, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index.
sortbool, default False
Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).
suffixeslist-like, default is (“_x”, “_y”)
A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.
copybool, default True
If False, avoid copy if possible.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write pd.options.mode.copy_on_write = True
indicatorbool or str, default False
If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appears in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames.
validatestr, optional
If specified, checks if merge is of specified type.
“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.
Returns:
DataFrame
A DataFrame of the two merged objects.
576-2、参数
576-2-1、right(必须):DataFrame,表示要与当前数据框合并的另一个数据框。
576-2-2、how(可选,默认值为'inner'):字符串,表示合并方式,可选值:
- 'left':保留左侧数据框的所有行。
- 'right':保留右侧数据框的所有行。
- 'outer':保留左右两侧数据框的所有行(取并集)。
- 'inner':仅保留左右两侧数据框中都有匹配值的行(取交集)。
576-2-3、on(可选,默认值为None):标签或列表,用于合并的列名称,如果两侧数据框有相同列名,可以直接用on参数指定,如果on为None,且未指定left_on和right_on,则尝试使用所有相同的列进行合并。
576-2-4、left_on(可选,默认值为None):标签或列表,表示左侧数据框中用于合并的列名。
576-2-5、right_on(可选,默认值为None):标签或列表,表示右侧数据框中用于合并的列名。
576-2-6、left_index(可选,默认值为False):布尔值,是否使用左侧数据框的索引作为合并键,如果为True,则on、left_on将被忽略。
576-2-7、right_index(可选,默认值为False):布尔值,是否使用右侧数据框的索引作为合并键,如果为True,则on、right_on将被忽略。
576-2-8、sort(可选,默认值为False):布尔值,是否对合并后的数据进行排序,默认不排序,合并后的数据会按照连接键的顺序保留。
576-2-9、suffixes(可选,默认值为('_x', '_y')):元组,用于指定合并后列名重复时的后缀。例如,如果两个数据框中都有名为'column1'的列,则在合并后,这两列会分别被命名为'column1_x'和'column1_y'。
576-2-10、copy(可选,默认值为None):布尔值,是否总是复制数据,如果为False,则可能直接引用原数据。
576-2-11、indicator(可选,默认值为False):布尔值或字符串,如果为True,合并结果中将添加一列_merge,用以指示每行来源于哪个数据框。它的值可以是'left_only' 、'right_only'或'both'。
576-2-12、validate(可选,默认值为None):字符串,用于验证合并的类型,确保输入数据框符合预期的合并规则,可选值:
- 'one_to_one':验证是否为一对一合并。
- 'one_to_many':验证左侧数据框是否为一对多合并。
- 'many_to_one':验证右侧数据框是否为多对一合并。
- 'many_to_many':验证是否为多对多合并。
576-3、功能
用于将两个数据框按照指定的列或索引进行合并,类似SQL中的JOIN
操作,根据指定的how参数,它可以实现内连接、左连接、右连接或外连接。
576-4、返回值
返回值是一个新的DataFrame,包含合并后的数据,并按照合并规则保留相关列和行。
576-5、说明
无
576-6、用法
576-6-1、数据准备
无
576-6-2、代码示例
# 576、pandas.DataFrame.merge方法
import pandas as pd
# 创建两个示例数据框
df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Myelse', 'Bryce', 'Jimmy']
})
df2 = pd.DataFrame({
'id': [2, 3, 4],
'age': [43, 6, 14]
})
# 合并数据框,使用内连接
merged_df = pd.merge(df1, df2, on='id', how='inner')
print(merged_df)
576-6-3、结果输出
# 576、pandas.DataFrame.merge方法
# id name age
# 0 2 Bryce 43
# 1 3 Jimmy 6
577、pandas.DataFrame.update方法
577-1、语法
# 577、pandas.DataFrame.update方法
pandas.DataFrame.update(other, join='left', overwrite=True, filter_func=None, errors='ignore')
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Parameters:
other
DataFrame, or object coercible into a DataFrame
Should have at least one matching index/column label with the original DataFrame. If a Series is passed, its name attribute must be set, and that will be used as the column name to align with the original DataFrame.
join
{‘left’}, default ‘left’
Only left join is implemented, keeping the index and columns of the original object.
overwrite
bool, default True
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values with values from other.
False: only update values that are NA in the original DataFrame.
filter_func
callable(1d-array) -> bool 1d-array, optional
Can choose to replace values other than NA. Return True for values that should be updated.
errors
{‘raise’, ‘ignore’}, default ‘ignore’
If ‘raise’, will raise a ValueError if the DataFrame and other both contain non-NA data in the same place.
Returns:
None
This method directly changes calling object.
Raises:
ValueError
When errors=’raise’ and there’s overlapping non-NA data.
When errors is not either ‘ignore’ or ‘raise’
NotImplementedError
If join != ‘left’
577-2、参数
577-2-1、other(必须):DataFrame或类字典结构(dict-like),用于更新当前DataFrame的数据,other的列标签需要与当前数据框匹配,更新是基于相同的行索引,类字典结构中的键被视为列,值被视为要更新的对应数据。
577-2-2、join(可选,默认值为'left'):字符串,定义如何对齐并更新数据框的索引,选项有:
- 'left':使用当前DataFrame的索引进行更新,对于other中的索引若不存在则忽略其内容。
- 'right':使用other的索引进行更新,可能添加新行。
- 'outer':结合两个数据框的索引,可能增加新行或列。
- 'inner':仅依据两个数据框的公共索引进行更新。
577-2-3、overwrite(可选,默认值为True):布尔值,决定是否用other中的值覆盖当前数据框中的现有值,若为True(默认),other中的缺失值(NaN)也会覆盖原数据框中的值;若为False,则只更新那些存在于other中且非NaN的值,而保留原数据框中已存在的非空值。
577-2-4、filter_func(可选,默认值为None):可迭代对象,接受一个过滤器函数,只更新返回True的元素,你可以使用函数,如lambda x : x > 0,来限制哪些元素可以被更新。
577-2-5、errors(可选,默认值为'ignore'):字符串,指定遇到无效索引或列标签时的处理方式,选项有:
- 'ignore'(默认):忽略错误,不抛出异常。
- 'raise':遇到错误时抛出异常。
577-3、功能
通过other的数据更新当前数据框的值,方法会直接修改调用者数据框,因此能够减少内存的使用和提升效率,它可以通过join的不同参数控制如何对齐和合并数据,使用overwrite参数控制是否允许用缺失值(NaN)覆盖原有数据,filter_func提供了一种灵活的方式,用于控制哪些数据可以被更新,errors参数能够帮助处理在更新过程中可能出现的错误,从而在运行时增强稳定性。
577-4、返回值
该方法在原始数据框上就地进行更新,而不返回新的数据框。
577-5、说明
无
577-6、用法
577-6-1、数据准备
无
577-6-2、代码示例
# 577、pandas.DataFrame.update方法
import pandas as pd
# 原始数据框
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# 用于更新的DataFrame
other = pd.DataFrame({
'A': [None, 0, None],
'B': [7, None, None]
})
# 更新df
df.update(other) # 默认参数使用
print(df)
577-6-3、结果输出
# 577、pandas.DataFrame.update方法
# A B
# 0 1 7
# 1 0 5
# 2 3 6
578、pandas.DataFrame.asfreq方法
578-1、语法
# 578、pandas.DataFrame.asfreq方法
pandas.DataFrame.asfreq(freq, method=None, how=None, normalize=False, fill_value=None)
Convert time series to specified frequency.
Returns the original data conformed to a new index with the specified frequency.
If the index of this Series/DataFrame is a PeriodIndex, the new index is the result of transforming the original index with PeriodIndex.asfreq (so the original index will map one-to-one to the new index).
Otherwise, the new index will be equivalent to pd.date_range(start, end, freq=freq) where start and end are, respectively, the first and last entries in the original index (see pandas.date_range()). The values corresponding to any timesteps in the new index which were not present in the original index will be null (NaN), unless a method for filling such unknowns is provided (see the method parameter below).
The resample() method is more appropriate if an operation on each group of timesteps (such as an aggregate) is necessary to represent the data at the new frequency.
Parameters:
freq
DateOffset or str
Frequency DateOffset or string.
method
{‘backfill’/’bfill’, ‘pad’/’ffill’}, default None
Method to use for filling holes in reindexed Series (note this does not fill NaNs that already were present):
‘pad’ / ‘ffill’: propagate last valid observation forward to next valid
‘backfill’ / ‘bfill’: use NEXT valid observation to fill.
how
{‘start’, ‘end’}, default end
For PeriodIndex only (see PeriodIndex.asfreq).
normalize
bool, default False
Whether to reset output index to midnight.
fill_value
scalar, optional
Value to use for missing values, applied during upsampling (note this does not fill NaNs that already were present).
Returns:
Series/DataFrame
Series/DataFrame object reindexed to the specified frequency.
578-2、参数
578-2-1、freq(必须):str或DateOffset,指定新的时间频率,支持多种频率字符串,如'D'(日)、'H'(小时)、'T'(分钟)等,该参数决定了输出数据框的时间间隔。
578-2-2、method(可选,默认值为None):字符串,该参数用于处理缺失值。
- 'pad'或'ffill':使用前一个有效值填充缺失值。
- 'backfill'或'bfill':使用后一个有效值填充缺失值。
- 'nearest':使用最近的有效值进行填充。
578-2-3、how(可选,默认值为None):字符串,该参数已不再推荐使用,建议使用method参数来进行相应操作,在未来版本中可能会删除。
578-2-4、normalize(可选,默认值为False):布尔值,指定是否将时间戳标准化到日期,即将时间部分归零。例如,对于日频率的数据,其时间将被设定为00:00:00。
578-2-5、fill_value(可选,默认值为None):单一值,指定在生成的新索引中缺失值的填充值,如果不设置且method参数设定为None,则会产生缺失值(NaN)。
578-3、功能
对数据进行重采样,通过制定的频率,用户能容易地创建一个有序的时间序列,对于不存在于原数据索引中的时间点,方法根据指定的填充方式(如使用前后值)来填补缺失的数值,该方法在处理时间序列时特别有效,例如使得某个数据集的时间间隔一致,且便于之后的分析与建模。
578-4、返回值
返回一个新的数据框,索引是按照指定的频率重新排列的,包含相应的数据,填补缺失值后形成的完整时间序列。
578-5、说明
无
578-6、用法
578-6-1、数据准备
无
578-6-2、代码示例
# 578、pandas.DataFrame.asfreq方法
import pandas as pd
# 创建一个带有时间索引的数据框
date_rng = pd.date_range(start='2024-01-01', end='2024-01-05', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df.set_index('date', inplace=True)
df['data'] = [1, 2, 3, 4, 5]
# 删除一些数据点以创建缺失值
df = df.drop(df.index[[1, 3]])
print("原始数据框:")
print(df)
# 使用asfreq进行重采样
result = df.asfreq(freq='D', method='ffill')
print("\n重采样后的数据框:")
print(result)
578-6-3、结果输出
# 578、pandas.DataFrame.asfreq方法
# 原始数据框:
# data
# date
# 2024-01-01 1
# 2024-01-03 3
# 2024-01-05 5
#
# 重采样后的数据框:
# data
# date
# 2024-01-01 1
# 2024-01-02 1
# 2024-01-03 3
# 2024-01-04 3
# 2024-01-05 5
579、pandas.DataFrame.asof方法
579-1、语法
# 579、pandas.DataFrame.asof方法
pandas.DataFrame.asof(where, subset=None)
Return the last row(s) without any NaNs before where.
The last row (for each element in where, if list) without any NaN is taken. In case of a DataFrame, the last row without NaN considering only the subset of columns (if not None)
If there is no good value, NaN is returned for a Series or a Series of NaN values for a DataFrame
Parameters:
wheredate or array-like of dates
Date(s) before which the last row(s) are returned.
subsetstr or array-like of str, default None
For DataFrame, if not None, only use these columns to check for NaNs.
Returns:
scalar, Series, or DataFrame
The return can be:
scalar : when self is a Series and where is a scalar
Series: when self is a Series and where is an array-like, or when self is a DataFrame and where is a scalar
DataFrame : when self is a DataFrame and where is an array-like.
579-2、参数
579-2-1、where(必须):scalar或array-like,用于查询的值,可以是一个单一的标量值,也可以是数组,方法将返回所有列中离该值最近的前一个有效值。
579-2-2、subset(可选,默认值为None):列表,指定要考虑的列名列表,如果提供该参数,则asof方法仅在指定的列中查找最近的有效值。
579-3、功能
用于在数据框中查找相对于特定值或时间戳的最上面(最近的)有效值,该方法通常用于时间序列数据的向后填充,尤其是在处理具有时间戳的索引时。
579-4、返回值
返回值Series或DataFrame,如果where是单个值,返回对应行的Series;
579-5、说明
无
579-6、用法
579-6-1、数据准备
无
579-6-2、代码示例
# 579、pandas.DataFrame.asof方法
import pandas as pd
# 创建一个带有时间索引的数据框
date_rng = pd.date_range(start='2024-01-01', end='2024-01-05', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df.set_index('date', inplace=True)
df['data'] = [1, 2, 3, 4, 5]
# 假设在索引中有一些缺失值
df.loc['2024-01-02'] = None
print("原始数据框:")
print(df)
# 使用asof查找特定值
result = df.asof(pd.Timestamp('2024-01-03'))
print("\n查找2024-01-03的最近有效值:")
print(result)
579-6-3、结果输出
# 579、pandas.DataFrame.asof方法
# 原始数据框:
# data
# date
# 2024-01-01 1.0
# 2024-01-02 NaN
# 2024-01-03 3.0
# 2024-01-04 4.0
# 2024-01-05 5.0
#
# 查找2024-01-03的最近有效值:
# data 3.0
# Name: 2024-01-03 00:00:00, dtype: float64
580、pandas.DataFrame.shift方法
580-1、语法
# 580、pandas.DataFrame.shift方法
pandas.DataFrame.shift(periods=1, freq=None, axis=0, fill_value=_NoDefault.no_default, suffix=None)
Shift index by desired number of periods with an optional time freq.
When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.
Parameters:
periods
int or Sequence
Number of periods to shift. Can be positive or negative. If an iterable of ints, the data will be shifted once by each int. This is equivalent to shifting by one value at a time and concatenating all resulting frames. The resulting columns will have the shift suffixed to their column names. For multiple periods, axis must not be 1.
freq
DateOffset, tseries.offsets, timedelta, or str, optional
Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.
axis
{0 or ‘index’, 1 or ‘columns’, None}, default None
Shift direction. For Series this parameter is unused and defaults to 0.
fill_value
object, optional
The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.
suffix
str, optional
If str and periods is an iterable, this is added after the column name and before the shift value for each shifted column name.
Returns:
DataFrame
Copy of input object, shifted.
580-2、参数
580-2-1、periods(可选,默认值为1):整数,指定要移动的时期数,正值表示向下或向右移动,负值表示向上或向左移动。
580-2-2、freq(可选,默认值为None):str或DateOffset对象,指定移动的频率,通常用于时间序列数据,该参数主要在有时间索引时使用,可以用来按特定的时间单位进行移动。注意,只有在索引为时间序列数据时,freq参数才能有效。
580-2-3、axis(可选,默认值为0):整数或字符串,指定沿哪个轴移动数据,0表示行(向下移动数据),1表示列(向右移动数据),可以使用字符串'index'和'columns'代替。
580-2-4、fill_value(可选):任意类型,用于填充在移动后出现的缺失(NaN)值,如果没有指定填充值,缺失值将默认为NaN。
580-2-5、suffix(可选,默认值为None):字符串,如果提供,新的列名称将有该后缀,与移动后的数据列结合使用,以便于区分原始数据和移动后的数据。
580-3、功能
根据指定的时期将数据沿指定轴进行移动,常见用例包括生成滞后变量、观察时间序列变化、计算与前期的差异等,它在信号处理、时间序列分析和金融数据建模等领域广泛应用。
580-4、返回值
返回沿指定轴移动后的数据框,被移动位置的值将被填充为NaN(或指定的fill_value)。
580-5、说明
无
580-6、用法
580-6-1、数据准备
无
580-6-2、代码示例
# 580、pandas.DataFrame.shift方法
import pandas as pd
# 创建一个数据框
data = {
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
}
df = pd.DataFrame(data)
print("原始数据框:")
print(df)
# 使用shift方法移动数据
shifted_df = df.shift(periods=1)
print("\n向下移动1期的数据框:")
print(shifted_df)
580-6-3、结果输出
# 580、pandas.DataFrame.shift方法
# 原始数据框:
# A B
# 0 1 5
# 1 2 6
# 2 3 7
# 3 4 8
#
# 向下移动1期的数据框:
# A B
# 0 NaN NaN
# 1 1.0 5.0
# 2 2.0 6.0
# 3 3.0 7.0