我有一个很长很宽的数据框。我想在该数据框中创建一个新列,该值取决于df中的许多其他列。此新列中的值所需的计算(也要更改)取决于其他某个列中的值。
this question和this question的答案很接近,但对我来说还不太有效。
最终我将可以应用大约30种不同的计算,因此我不太热衷np.where
函数,对于太多条件而言,该函数可读性不强。
还强烈建议我不要对数据帧中的所有行进行for循环,因为这可能会降低性能(如果我错了,请更正我)。
我尝试做的是:
import pandas as pd
import numpy as np
# Information in my columns look something like this:
df['text'] = ['dab', 'def', 'bla', 'zdag', 'etc']
df['values1'] = [3 , 4, 2, 5, 2]
df['values2'] = [6, 3, 21, 44, 22]
df['values3'] = [103, 444, 33, 425, 200]
# lists to check against to decide upon which calculation is required
someList = ['dab', 'bla']
someOtherList = ['def', 'zdag']
someThirdList = ['etc']
conditions = [
(df['text'] is None),
(df['text'] in someList),
(df['text'] in someOtherList),
(df['text'] in someThirdList)]
choices = [0,
round(df['values2'] * 0.5 * df['values3'], 2),
df['values1'] + df['values2'] - df['values3'],
df['values1'] + 249]
df['mynewvalue'] = np.select(conditions, choices, default=0)
print(df)
我希望基于
df['text']
中的行值,正确的计算将应用于df['mynewvalue']
的相同行值。相反,我收到错误
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
我该如何编程呢,以便可以使用这种条件为df ['mynewvalue']列定义正确的计算?
最佳答案
错误来自以下条件:
conditions = [
... ,
(df['text'] in someList),
(df['text'] in someOtherList),
(df['text'] in someThirdList)]
您尝试询问列表中是否有几个元素。答案是一个列表(针对每个元素)。正如错误所暗示的那样,您必须决定是否在至少一个元素验证属性(
any
)或所有元素都验证属性(any
)时验证条件。一种解决方案是对
isin
数据帧使用all
(doc)或pandas
(doc)。在这里使用
any
:import pandas as pd
import numpy as np
# Information in my columns look something like this:
df = pd.DataFrame()
df['text'] = ['dab', 'def', 'bla', 'zdag', 'etc']
df['values1'] = [3, 4, 2, 5, 2]
df['values2'] = [6, 3, 21, 44, 22]
df['values3'] = [103, 444, 33, 425, 200]
# other lists to test against whether
someList = ['dab', 'bla']
someOtherList = ['def', 'zdag']
someThirdList = ['etc']
conditions = [
(df['text'] is None),
(df['text'].isin(someList)),
(df['text'].isin(someOtherList)),
(df['text'].isin(someThirdList))]
choices = [0,
round(df['values2'] * 0.5 * df['values3'], 2),
df['values1'] + df['values2'] - df['values3'],
df['values1'] + 249]
df['mynewvalue'] = np.select(conditions, choices, default=0)
print(df)
# text values1 values2 values3 mynewvalue
# 0 dab 3 6 103 309.0
# 1 def 4 3 444 -437.0
# 2 bla 2 21 33 346.5
# 3 zdag 5 44 425 -376.0
# 4 etc 2 22 200 251.0