我有:
- 30,000 data points
- each data point is a measurement of type float
- each measurement is associated with a date
- each date has only one measurement
- no dates are without measurements
- the data comes in the form of a text file: 30,000 lines in this form:
- YYYY-MM-DD I,F (e.g. 1977-02-08 20.74)
- measurement appearing in the source file are already sorted by date
我需要:
- a time-interval T with boundaries (s,e) /* start, end */
- (s - e = 14 days) the time-interval *must* be 2 weeks
- define min as the lowest value in the interval T
- define max as the greatest value in the interval T
- the chosen T needs to have the greatest distance btwn max and min of all possible Ts
- break ties among intervals T by choosing the most recent (with the greatest s value)
- the chosen T must consider all jumps in the 14 days, not just the values @ s and e
- if the overall "variance" in the interval is great but the jump
|max-min| is not the greatest in absolute value, T is not the right choice,
even if it's an "exciting" interval
我在问:
- which algorithm to employ, considering algorithms are not my specialty
- which data structure to use to keep track of the subtotals
笔记:
- an answer in pseudo code would be preferred, "prose" is fine if pressured for time
- an answer in Python would be... splendid :)
如果需要,可以生成“虚拟”数据并运行建议的算法作为测试,或者我可以共享实际数据。
除了想要了解最快的方法来了解如何应用正确的解决方案和正确的算法之外,我在这里对性能的关心不是太多。
我认为即使是最简单的迭代算法也可以“证明”正确性,因为对于当今的计算机而言,数据集很小。
到目前为止,我正在“遍历并携带14个测量值的14个向量”,如果您可以教我如何用小和来逐步进行此操作,那将是不胜感激的。
最佳答案
滑动窗口实际上可以通过保留两个堆栈来工作(也许这有点误导,因为这可能最好以双端队列的形式实现)。保留堆栈minstack
和称为maxstack
的堆栈。该算法的症结在于,在幻灯片的所有点上,minstack都应严格为而不是,而maxstack则应严格为而不是。那么,我们该怎么做呢?
首先,将前14个点添加到堆栈中。让我们将add(point)
定义为:
为最小堆栈执行此操作:
同样,对于maxstack:
由于上述属性,前14个元素的min和max应该是minstack和maxstack的底部元素。现在滑动窗口。我们只需要注意,如果任何堆栈中的左点仍然是“有效的”,则现在必须是最低点。因此,这应该很容易,它很简单:
slide():
add(new_point)
if (left_point == bottom(minstack)) remove_bottom(minstack)
if (left_point == bottom(maxstack)) remove_bottom(maxstack)
这样做直到您的观点用尽为止。您要查找的间隔是
bottom(maxstack) - bottom(minstack)
最大的间隔。注意,任何一点最多进入minstack/maxstack一次,每个点也最多离开一次堆栈,因此,无论所需间隔的大小如何,每个点最多执行4次操作。
编辑:我刚刚注意到您想要在Python中实现。我并不是真的想要解析数据,因此该函数将值列表作为输入,并输出该数组中的索引(s,e):
import collections
def add(x, minstack, maxstack):
while minstack and x < minstack[-1]: minstack.pop()
while maxstack and x > maxstack[-1]: maxstack.pop()
minstack.append(x)
maxstack.append(x)
def get_largest_interval(points):
minstack = collections.deque()
maxstack = collections.deque()
best_diff = -1
best_interval = None
for index, elem in enumerate(points):
add(elem,minstack,maxstack)
if index >= 14:
if minstack[0] == points[index-14]: minstack.popleft()
if maxstack[0] == points[index-14]: maxstack.popleft()
if index >= 13:
this_diff = maxstack[0]-minstack[0]
if best_diff == -1 or this_diff >= best_diff:
best_interval = (index-13, index)
best_diff = this_diff
return best_interval
print get_largest_interval([0, 2, 2,2,2,2,2,2,2,2,2,2,2,2,3])
关于algorithm - 30,000个数据点,在2周的时间内发现了最大的变化,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/11043821/