问题描述
我有一个来自熊猫的 DataFrame
:
import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}]
df = pd.DataFrame(inp)
print (df)
如果与上一行相比,行的字符串值发生了变化,我想在单独的行 Cng-Address中进行标识,并且如果行的数值发生变化,则在 Cng-Year列中进行标识。如果没有更改,则将其标识为零。
If a change in a row's string value occurs comparing to previous row, I want to identify it in a separate row "Cng-Address", and if row's numeric value changes identify it in "Cng-Year" column. If there is no change identify it as zero.
索引为名称,这意味着应对与人员姓名相关的所有行进行上述计算。如果名称更改了(例如,约翰更改为史蒂夫),则应重置 Cng-Address和 Cng-Year的计算。列年份按升序排列。
The index is "Name" meaning that the above calculations should be done for all rows associated to person name. If a "Name" changes (i.e. John to Steve) then calculations for "Cng-Address" and "Cng-Year" should reset. Column year sorted ascending.
作为最终报告,我想获得:
As a final report I want to get:
- 约翰更改年份 1次并更改位置 2次
- 史蒂夫更改年份 2次并更改位置 2次
- 2019年的总更改地址是 2次
当前输出:
+-------+------+---------------+
| Name | Year | Address |
+-------+------+---------------+
| John | 2018 | Beverly hills |
+-------+------+---------------+
| John | 2018 | Beverly hills |
+-------+------+---------------+
| John | 2019 | Beverly hills |
+-------+------+---------------+
| John | 2019 | Orange county |
+-------+------+---------------+
| John | 2019 | New York |
+-------+------+---------------+
| Steve | 2018 | Canada |
+-------+------+---------------+
| Steve | 2019 | Canada |
+-------+------+---------------+
| Steve | 2019 | Canada |
+-------+------+---------------+
| Steve | 2020 | California |
+-------+------+---------------+
| Steve | 2020 | Canada |
+-------+------+---------------+
理想的输出:
+-------+------+---------------+----------+-------------+
| Name | Year | Address | Cng-Year | Cng-Address |
+-------+------+---------------+----------+-------------+
| John | 2018 | Beverly hills | 0 | 0 |
+-------+------+---------------+----------+-------------+
| John | 2018 | Beverly hills | 0 | 0 |
+-------+------+---------------+----------+-------------+
| John | 2019 | Beverly hills | 1 | 0 |
+-------+------+---------------+----------+-------------+
| John | 2019 | Orange county | 0 | 1 |
+-------+------+---------------+----------+-------------+
| John | 2019 | New York | 0 | 1 |
+-------+------+---------------+----------+-------------+
| Steve | 2018 | Canada | 0 | 0 |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada | 1 | 0 |
+-------+------+---------------+----------+-------------+
| Steve | 2019 | Canada | 0 | 0 |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | California | 1 | 1 |
+-------+------+---------------+----------+-------------+
| Steve | 2020 | Canada | 0 | 1 |
+-------+------+---------------+----------+-------------+
推荐答案
你可以用groupby来做:
YOu can do with groupby:
groups = df.groupby('Name')
for col in ['Year', 'Address']:
df[f'cng-{col}'] = groups[col].shift().fillna(df[col]).ne(df[col]).astype(int)
输出:
Name Year Address cng-Year cng-Address
0 John 2018 Beverly hills 0 0
1 John 2018 Beverly hills 0 0
2 John 2019 Beverly hills 1 0
3 John 2019 Orange county 0 1
4 John 2019 New York 0 1
5 Steve 2018 Canada 0 0
6 Steve 2019 Canada 1 0
7 Steve 2019 Canada 0 0
8 Steve 2020 California 1 1
9 Steve 2020 Canada 0 1
这篇关于与数据帧中的前一行相比,如何识别一行中的字符串变化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!