问题描述
我在一年的时间里从三个不同的传感器获得了timeseries
数据,这些传感器大约每3分钟产生一个数据点,传感器不同步,因此它们在彼此相对的不同时间产生一个数据点输出.
此数据在大约50万条记录的一个表中的sqlite db中.我打算使用javascript图表库dygraph显示此数据,我已经通过按传感器名称执行sql查询并分别保存到csv来分别为每个传感器生成timeseries
图表.我希望有一个图表显示所有数据点,并用一条线代表每个传感器.
我创建了一个名为'minutes_array'的字符串类型的numpy 2d数组,第一列为unix时间戳,四舍五入到从传感器时间序列开始到末尾每分钟的最近分钟,并用三个空列填充来自三个传感器中每个传感器的可用数据.
minutes_array
[['1316275620' '' '' '']
['1316275680' '' '' '']
['1316275740' '' '' '']
...,
['1343206920' '' '' '']
['1343206980' '' '' '']
['1343207040' '' '' '']]
然后将传感器时间序列数据也四舍五入到最接近的分钟,我使用numpy.in1d并从上述"minutes_array"和"sensor_data"数组中获取时间戳,并为与该传感器相关的记录创建掩码. /p>
sensor_data
[['1316275680' '215.2']
['1316275860' '227.0']
['1316276280' '212.2']
...,
['1343206380' '187.7']
['1343206620' '189.4']
['1343206980' '192.9']]
mask = np.in1d(minutes_array[:,0], sensor_data[:,0])
[False True False ..., False True False]
然后,我希望修改minutes_array中对于该掩码正确的记录,并将sensor_data值放置在minutes_array中时间戳之后的第一列中.从我的尝试来看,在向其应用掩码时似乎无法更改原始的"minutes_array",有没有办法在numpy中实现此结果而无需单独使用for循环和匹配时间戳?
已解决
基于以下来自@eumiro的答案,我使用了上述Pandas Docs和a'sensor_data'numpy数组的解决方案
sensors = ['s1','s2','s3']
sensor_results = {}
for sensor in sensors:
sensor_data = get_array(db_cursor, sensor)
sensor_results[sensor] = pd.Series(sensor_data[:,1], \
index=sensor_data[:,0])
df = pd.DataFrame(buoy_results)
df.to_csv("output.csv")
半百万不是您用python字典无法管理的数字.
从数据库中读取所有传感器的数据,填写字典,然后构建一个numpy数组,甚至更好,将其转换为 pandas.DataFrame :
import pandas as pd
inp1 = [(1316275620, 1), (1316275680, 2)]
inp2 = [(1316275620, 10), (1316275740, 20)]
inp3 = [(1316275680, 100), (1316275740, 200)]
inps = [('s1', inp1), ('s2', inp2), ('s3', inp3)]
data = {}
for name, inp in inps:
d = data.setdefault(name, {})
for timestamp, value in inp:
d[timestamp] = value
df = pd.DataFrame.from_dict(data)
df
现在是:
s1 s2 s3
1316275620 1 10 NaN
1316275680 2 NaN 100
1316275740 NaN 20 200
I have timeseries
data from three different sensors over the period of a year, the sensors produce a data point roughly every 3 minutes, the sensors are not synchronized so they produce a datapoint output at different times relative to each other.
This data is in an sqlite db in one table of approximately half a million records. I intend to display this data using the javascript chart library dygraph, I have already produced timeseries
charts for each of these sensors individually by doing an sql query by sensor name and save to csv. I wish to have one chart which displays all the data points, with a line representing each sensor.
I have created a numpy 2d array of type string called 'minutes_array' with the first column as unix timestamps rounded to the nearest minute covering every minute from the start of the sensor timeseries to the end with three empty columns to be filled with data from each of the 3 sensors where available.
minutes_array
[['1316275620' '' '' '']
['1316275680' '' '' '']
['1316275740' '' '' '']
...,
['1343206920' '' '' '']
['1343206980' '' '' '']
['1343207040' '' '' '']]
The sensor timeseries data is then also rounded to the nearest minute and I use numpy.in1d and take the timestamps from the above 'minutes_array' and the 'sensor_data' array and create a mask for the records relating to that sensor.
sensor_data
[['1316275680' '215.2']
['1316275860' '227.0']
['1316276280' '212.2']
...,
['1343206380' '187.7']
['1343206620' '189.4']
['1343206980' '192.9']]
mask = np.in1d(minutes_array[:,0], sensor_data[:,0])
[False True False ..., False True False]
I then wish to modify the records in minutes_array which are true for that mask and place the sensor_data value into the first column following the timestamp in minutes_array. From my attempts it does not seem possible to alter the original 'minutes_array' when a mask is applied to it, is there a way to achieve this outcome in numpy without using for loops and matching timestamps individually?
Solved
Based on the answer below from @eumiro I used a solution from the Pandas Docs and the 'sensor_data' numpy array described above
sensors = ['s1','s2','s3']
sensor_results = {}
for sensor in sensors:
sensor_data = get_array(db_cursor, sensor)
sensor_results[sensor] = pd.Series(sensor_data[:,1], \
index=sensor_data[:,0])
df = pd.DataFrame(buoy_results)
df.to_csv("output.csv")
Half a million is not a number you could not manage with a python dictionary.
Read data for all sensors from database, fill a dictionary and then build a numpy array, or even better, convert it to pandas.DataFrame:
import pandas as pd
inp1 = [(1316275620, 1), (1316275680, 2)]
inp2 = [(1316275620, 10), (1316275740, 20)]
inp3 = [(1316275680, 100), (1316275740, 200)]
inps = [('s1', inp1), ('s2', inp2), ('s3', inp3)]
data = {}
for name, inp in inps:
d = data.setdefault(name, {})
for timestamp, value in inp:
d[timestamp] = value
df = pd.DataFrame.from_dict(data)
df
is now:
s1 s2 s3
1316275620 1 10 NaN
1316275680 2 NaN 100
1316275740 NaN 20 200
这篇关于将多个时间序列数据组合到一个2d numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!