问题描述
计算每个时间间隔内的日期范围内的邮件数量。我只使用python 2.6.5。
例如
开始日期:12/11/2014
结束日期:12/12 / 2014
开始时间:02:00
结束时间:02:05
间隔:每1分钟
所以这意味着如何许多消息是从开始日期12/11到结束日期12/12之间的每一分钟间隔
所以我的输出将如下所示:(不需要字符串最小和消息)
datetime(2014,12,11,2,0)min:0消息,
datetime(2014,12,11, ,1)min:1消息,
datetime(2014,12,11,2,2)min:2消息,
datetime(2014,12,11,2,3)min:1消息,
datetime(2014,12,11,2,4)min:0消息,
datetime(2014,12,11,2,5)min:0消息
我相信我完成了这一切,但是对于大型数据集来说非常慢,我认为因为它使用两个循环,如果第二个循环非常大,它需要很长时间,并为第一个循环的每次迭代需要更好的程序或算法来完成这一点?
编辑:对于没有消息的间隔,我需要包含零。我也试图找到高峰,最小和平均值。
from datetime import date,datetime,timedelta,time
def perdelta(start,end,delta):
curr = start
while curr<结束:
yield curr
curr + = delta
def rdata(table,fromDate,toDate,fromTime,toTime,interval):
date_to_alert = {}
start_date = datetime(fromDate.year,fromDate.month,fromDate.day,fromTime.hour,fromTime.minute)
end_date = datetime(toDate.year,toDate.month,toDate.day, toTime.hour,toTime.minute)
list_range_of_dates = []
在perdelta中的date_range(start_date,end_date,interval):
list_range_of_dates.append(date_range)
print list_range_of_dates
index = 0
for list_range_of_dates中的date_range:
表中的行:
print('first_alerted_time 1:%s index:%s len:% s'%(row ['first_alerted_time'],index,len(list_range_of_dates)-1))
如果row ['first_alerted_time']和row ['first_alerted_time']> = list_range_of_dates [index]和row [ first_alerted_time']< list_range_of_dates [index + 1]:
print('开始日期:%s'%list_range_of_dates [index])
print('first_alerted_time:%s'%row ['first_alerted_time'])
print('end date:%s'%list_range_of_dates [index + 1])$ b $ b如果list_range_of_dates [index] in date_to_alert:
date_to_alert [list_range_of_dates [index]]。append(row)
else :
date_to_alert [list_range_of_dates [index]] = [row]
elif row ['first_alerted_time']:
print('first_alerted_time 2:%s'%row ['first_alerted_time '])
index = index + 1
打印date_to_alert键
date_to_alert.items()中的值:
date_to_alert [key] = len(value)
打印date_to_alert
t1 = []
如果date_to_alert:
avg = sum(date_ to_alert.values())/ len(date_to_alert.keys())
for date_period,date_to_alert.items()中的num_of_alerts:
#[date_period] = date_to_alert [date_period]
t1.append ([date_period,num_of_alerts,avg])
print t1
return t1
def main():
example_table = [
{'first_alerted_time' datetime(2014,12,11,2,1,45)},
{'first_alerted_time':datetime(2014,12,11,2,2,33)},
{'first_alerted_time' datetime(2014,12,11,2,2,45)},
{'first_alerted_time':datetime(2014,12,11,2,3,45)},
]
example_table.sort()
print example_table
print rdata(example_table,date(2014,12,11),date(2014,12,12),time(00,00,00),time(00 ,00,00),timedelta(分钟= 1))
更新:
首次尝试要改进:
默认字典方法
de f default_dict_approach(table,fromDate,toDate,fromTime,toTime,interval):
从集合导入defaultdict
t1 = []
start_date = datetime.combine(fromDate,fromTime)
end_date = datetime.combine(toDate,toTime)+ interval
times =(d ['first_alerted_time'] for d in table)
counter = defaultdict int)
代表dt的次数:
如果start_date counter [to_s(dt - start_date)// to_s(interval)] + = 1
date_to_alert = {}
date_to_alert = dict((ts * interval + start_date ,count)for ts,count in counter.iteritems())
max_num,min_num,avg = 0,0,0
list_of_dates = list(perdelta(start_date,end_date,interval))
如果date_to_alert:
freq_values = date_to_alert.values()
size_freq_values = len(freq_values)
avg = sum(freq_values)/ size_freq_values
max_num = max(freq_values)
如果size_freq_values == len(list_of_dates):
min_num = min(freq_values)
else:
min_num = 0
在list_of_dates中的date_period:
if date_period in date_to_alert:
t1.append([date_period.strftime(%Y-%m-%d%H:%M),date_to_alert [date_period],avg,max_num,min_num])
else:
t1.append([date_period.strftime(%Y - %m-%d%H:%M),0,avg,max_num,min_num])
return(t1,max_num,min_num,avg)
numpy方法
def numpy_approach ,fromDate,toDate,fromTime,toTime,interval):
date_to_alert = {}
start_date = datetime.combine(fromDate,fromTime)
end_date = datetime.combine(toDate,toTime)+ interval
list_range_of_dates = []
在perdelta中的date_range(start_date,end_date,interval):
list_range_of_dates.append(date_range)
#print list_range_of_dates
index = 0
times = np.fromiter((d ['first_alerted_time'] for d in table),
dtype ='datetime64 [us]',count = len(table))
打印时间
bins = np.fromiter(list_range_of_dates,
dtype = times.dtype)
打印bin
a,bins = np.histogram(times, bin)
print(dict( zip(bins [a.nonzero()]。tolist(),a [a.nonzero()])))
你想实现日期:
import numpy as np
times = np.fromiter((d ['first_alerted_time'] for d in example_table),
dtype ='datetime64 [us]',count = len (example_table))
bins = np.fromiter(date_range(start_date,end_date + step,step),
dtype = times.dtype)
a,bins = np.histogram(times,bin)
print(dict(zip(bins [a.nonzero()]。tolist(),a [a.nonzero()])))
{datetime.datetime(2014,12,11,2,0):3,
datetime.datetime(2014,12,11,2,3): 1}
numpy.historgram()
wo即使步骤不是常数,并且 times
数组是未排序的。否则,如果您决定使用 numpy
,则可以优化通话。
有两种一般方法可以使用在Python 2.6中实现 numpy.historgram
:
-
itertools.groupby
基于:输入应该被排序,但它允许实现单程,常量存储器算法 -
集合。 defaultdict
- 基于:输入可能是未排序的,它也是一个线性算法,但它是内存中的O(number_of_nonempty_bins)
groupby()
解决方案:
来自itertools import groupby
times =(d ['first_alerted_time'] for d in example_table)
bins = date_range(start_date,end_date +
def key(dt,end = [next(bins)]):
while end [0]< = dt:
end [0] = next(bins)
return end [0]
print dict((end-step,sum(1 for _ in g))for end,g in groupby(times ,key = key))
它生成与 histogram()相同的输出,
注意:小于 start_date
的所有日期都被放在在第一个bin中。
defaultdict()
解决方案
from collections import defaultdict
def to_s(td):#for Python 2.6
return td.days * 86400 + td.seconds#注意:忽略微秒
times =(d ['first_alerted_time'] for d in example_table)
counter = defaultdict(int)
for dt in times:
if start_date< = dt< end_date:
counter [to_s(dt - start_date)// to_s(step)] + = 1
print dict((ts * step + start_date,count)for ts,count in counter .iteritems())
输出与其他两个解决方案相同。
Count the number of messages within a date range per interval. I"m using python 2.6.5 only.
For exampleStart date: 12/11/2014End date: 12/12/2014Start time: 02:00End time: 02:05Interval: Per 1 min
So this translates to how many messages are between each interval of a minute from start date 12/11 to end date 12/12.So my out put will look like this: (does not need to have strings min and messages)
datetime(2014, 12, 11, 2, 0) min : 0 messages,
datetime(2014, 12, 11, 2, 1) min: 1 message,
datetime(2014, 12, 11, 2, 2) min: 2 messages,
datetime(2014, 12, 11, 2, 3) min: 1 message,
datetime(2014, 12, 11, 2, 4) min : 0 messages,
datetime(2014, 12, 11, 2, 5) min : 0 messages
I believe I accomplish this but its very slow with large datasets. I think because it uses two loops and if the the second loop is extremely large then it takes very long time and does it for each iteration of the first loop. I need better procedure or alrogithm to accomplish this?
Edit: I need to include zero for intervals that do not have messages. I'm also trying to find peak,min and average.
from datetime import date,datetime, timedelta, time
def perdelta(start, end, delta):
curr = start
while curr < end:
yield curr
curr += delta
def rdata(table, fromDate, toDate, fromTime, toTime, interval):
date_to_alert = {}
start_date = datetime(fromDate.year, fromDate.month, fromDate.day, fromTime.hour, fromTime.minute)
end_date = datetime(toDate.year, toDate.month, toDate.day, toTime.hour, toTime.minute)
list_range_of_dates = []
for date_range in perdelta(start_date ,end_date ,interval):
list_range_of_dates.append(date_range)
print list_range_of_dates
index = 0
for date_range in list_range_of_dates:
for row in table:
print('first_alerted_time 1: %s index: %s len: %s' % ( row['first_alerted_time'], index, len(list_range_of_dates)-1))
if row['first_alerted_time'] and row['first_alerted_time'] >= list_range_of_dates[index] and row['first_alerted_time'] < list_range_of_dates[index + 1]:
print('Start date: %s' % list_range_of_dates[index] )
print('first_alerted_time: %s' % row['first_alerted_time'])
print('end date: %s' % list_range_of_dates[index + 1])
if list_range_of_dates[index] in date_to_alert:
date_to_alert[list_range_of_dates[index]].append(row)
else:
date_to_alert[list_range_of_dates[index]] = [row]
elif row['first_alerted_time']:
print('first_alerted_time 2: %s' % row['first_alerted_time'])
index = index + 1
print date_to_alert
for key, value in date_to_alert.items():
date_to_alert[key] = len(value)
print date_to_alert
t1 = []
if date_to_alert:
avg = sum(date_to_alert.values())/len(date_to_alert.keys())
for date_period, num_of_alerts in date_to_alert.items():
#[date_period] = date_to_alert[date_period]
t1.append( [ date_period, num_of_alerts, avg] )
print t1
return t1
def main():
example_table = [
{'first_alerted_time':datetime(2014, 12, 11, 2, 1,45)},
{'first_alerted_time':datetime(2014, 12, 11, 2, 2,33)},
{'first_alerted_time':datetime(2014, 12, 11, 2, 2,45)},
{'first_alerted_time':datetime(2014, 12, 11, 2, 3,45)},
]
example_table.sort()
print example_table
print rdata(example_table, date(2014,12,11), date(2014,12,12), time(00,00,00), time(00,00,00), timedelta(minutes=1))
Update:First attempt for improvement:
Default Dictionary approach
def default_dict_approach(table, fromDate, toDate, fromTime, toTime, interval):
from collections import defaultdict
t1 = []
start_date = datetime.combine(fromDate, fromTime)
end_date = datetime.combine(toDate, toTime)+ interval
times = (d['first_alerted_time'] for d in table)
counter = defaultdict(int)
for dt in times:
if start_date <= dt < end_date:
counter[to_s(dt - start_date) // to_s(interval)] += 1
date_to_alert = {}
date_to_alert = dict((ts*interval + start_date, count) for ts, count in counter.iteritems())
max_num,min_num,avg = 0,0,0
list_of_dates = list(perdelta(start_date,end_date,interval))
if date_to_alert:
freq_values = date_to_alert.values()
size_freq_values = len(freq_values)
avg = sum(freq_values)/ size_freq_values
max_num = max(freq_values)
if size_freq_values == len(list_of_dates):
min_num = min(freq_values)
else:
min_num = 0
for date_period in list_of_dates:
if date_period in date_to_alert:
t1.append([ date_period.strftime("%Y-%m-%d %H:%M"), date_to_alert[date_period], avg, max_num, min_num])
else:
t1.append([ date_period.strftime("%Y-%m-%d %H:%M"), 0, avg, max_num, min_num])
return (t1,max_num,min_num,avg)
numpy approach
def numpy_approach(table, fromDate, toDate, fromTime, toTime, interval):
date_to_alert = {}
start_date = datetime.combine(fromDate, fromTime)
end_date = datetime.combine(toDate, toTime)+ interval
list_range_of_dates = []
for date_range in perdelta(start_date ,end_date ,interval):
list_range_of_dates.append(date_range)
#print list_range_of_dates
index = 0
times = np.fromiter((d['first_alerted_time'] for d in table),
dtype='datetime64[us]', count=len(table))
print times
bins = np.fromiter(list_range_of_dates,
dtype=times.dtype)
print bin
a, bins = np.histogram(times, bins)
print(dict(zip(bins[a.nonzero()].tolist(), a[a.nonzero()])))
You want to implement numpy.histogram()
for dates:
import numpy as np
times = np.fromiter((d['first_alerted_time'] for d in example_table),
dtype='datetime64[us]', count=len(example_table))
bins = np.fromiter(date_range(start_date, end_date + step, step),
dtype=times.dtype)
a, bins = np.histogram(times, bins)
print(dict(zip(bins[a.nonzero()].tolist(), a[a.nonzero()])))
Output
{datetime.datetime(2014, 12, 11, 2, 0): 3,
datetime.datetime(2014, 12, 11, 2, 3): 1}
numpy.historgram()
works even if the step is not constant and times
array is unsorted. Otherwise the call can be optimized if you decide to use numpy
.
There are two general approaches that you could use on Python 2.6 to implement numpy.historgram
:
itertools.groupby
-based: the input should be sorted but it allows to implement a single-pass, constant memory algorithmcollections.defaultdict
-based: the input may be unsorted and it is also a linear algorithm but it isO(number_of_nonempty_bins)
in memory
groupby()
-based solution:
from itertools import groupby
times = (d['first_alerted_time'] for d in example_table)
bins = date_range(start_date, end_date + step, step)
def key(dt, end=[next(bins)]):
while end[0] <= dt:
end[0] = next(bins)
return end[0]
print dict((end-step, sum(1 for _ in g)) for end, g in groupby(times, key=key))
It produces the same output as histogram()
-based approach.
Note: all dates that are less than start_date
are put in the first bin.
defaultdict()
-based solution
from collections import defaultdict
def to_s(td): # for Python 2.6
return td.days*86400 + td.seconds #NOTE: ignore microseconds
times = (d['first_alerted_time'] for d in example_table)
counter = defaultdict(int)
for dt in times:
if start_date <= dt < end_date:
counter[to_s(dt - start_date) // to_s(step)] += 1
print dict((ts*step + start_date, count) for ts, count in counter.iteritems())
The output is the same as the other two solutions.
这篇关于Python-在每个动态间隔内计算日期范围内的邮件频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!