给定以下数据框:

  date        type       price
20150101       X           0.8
20150102       X           0.9
20150103       X           1.0
20150104       X           0.9
20150105       abc         12.3
20150106       abc         12.4
20150107       abc         12.4
20150108       X           0.7
20150109       X           0.6
20150110       X           0.9
20150111       abc         12.3
20150112       abc         12.4
20150113       X           0.5
20150114       X           0.6
20150115       abc         12.3
20150116       abc         12.4


数据由X的群集价格和abc的价格组成。我想使用以下规则基于“类型”和“价格”中的条目来计算新列(称为“位置”):

1. 'position' = 0  if 'type'=='X'
2. 'position' = 1  if 'type'=='abc' and max of price of X in the 'previous section' is >=1
3. 'position' = -1  if 'type'=='abc' and min of price of X in the 'previous section' is <=0.5
4. 'position' = 0  if 'type'=='abc' and otherwise
5.Notes: definition of "previous section" is the period with cluster of prices of "X" between two sections of 'abc' prices. For example

  for 20150105-20150107  previous section is 20150101-20150104
  for 20150111-20150112  previous section is 20150108-20150110
  for 20150115-20150116  previous section is 20150113-20150114


这样我就可以创建以下数据框:

  date        type       price     position
20150101       X           0.8         0
20150102       X           0.9         0
20150103       X           1.0         0
20150104       X           0.9         0
20150105       abc         12.3        1
20150106       abc         12.4        1
20150107       abc         12.4        1
20150108       X           0.7         0
20150109       X           0.6         0
20150110       X           0.9         0
20150111       abc         12.3        0
20150112       abc         12.4        0
20150113       X           0.5         0
20150114       X           0.6         0
20150115       abc         12.3       -1
20150116       abc         12.4       -1


我面临的困难是我不知道如何定义“上一节”。我尝试使用ivot_table,它似乎更易于操作,并且我想生成相同的“位置”列,如下所示:

  date        X             abc    position
20150101      0.8           nan        0
20150102      0.9           nan        0
20150103      1.0           nan        0
20150104      0.9           nan        0
20150105      nan          12.3        1
20150106      nan          12.4        1
20150107      nan          12.4        1
20150108      0.7          nan         0
20150109      0.6          nan         0
20150110      0.9          nan         0
20150111      nan          12.3        0
20150112      nan          12.4        0
20150113      0.5          nan         0
20150114      0.6          nan         0
20150115      nan          12.3       -1
20150116      nan          12.4       -1


但我仍然不知道如何定义“上一个区间”来计算X价格各区间的最大值,最小值或任何其他值。

最佳答案

问题的一般形式是发现重复值。熊猫的本能应该是达到groupby,但是在实际的序列值上使用简单的groupby在这里将不起作用,因为它将组合不连续的相似值。相反,我喜欢为此使用Series.diffSeries.cumsum

series = pd.Series(["abc", "abc", "x", "x", "x", "abc", "abc"])


您不能在字符串上使用Series.diff,因此首先创建字符串到int的映射。这些值仅需唯一。

mapping = {v: k for k, v in enumerate(set(series))  # {"abc": 0, "x" 1}
int_series = series.map(mapping) # pd.Series([0, 0, 1, 1, 1, 0, 0])


现在您可以使用Series.diffSeries.diff给您series[n] - series[n - 1]。起始值没有上一行,因此始终为NaN

int_series.diff()  # [NaN, 0, 1, 0, 0, -1, 0]


使用Series.diff,我们可以通过测试!= 0找到每个组的开始。

starts = int_series.diff() != 0  # [True, False, True, False, False, True, False]


将其与您的原始值进行比较,以了解我们如何找到每个组的起点:

starts  # [True, False, True, False, False, True, False]
series  # ["abc", "abc", "x", "x", "x", "abc", "abc"]


我们不想只知道每个组的开始-我们想知道每一行在哪个组中。Easy-peasy-Series.cumsum将每行添加到上一行。方便地,如果您尝试在Python中添加bool,它们将被强制加入其超类int

True + True  # 2
True + False  # 1
groups = starts.cumsum()  # [1, 1, 2, 2, 2, 3, 3]


现在,您可以使用groupby(groups)独立地作用于每个组。

for _, sequence in series.groupby(groups):
     print sequence
# ["abc", "abc"]
# ["x", "x, "x"]
# ["abc", "abc"]


在您的特定情况下:

group_mins = prices.groupby(groups).min()
previous_group_below_min = (groups - 1).map(group_mins) < SOME_CONSTANT

关于python - 根据 Pandas 数据框中2列的数据计算值,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/29785621/

10-14 02:31