问题描述
过去几天,我对这个问题进行了大量研究,但仍然找不到针对我的问题的建议.
I've researched this question heavily for the past few days and I still cannot find suggestions to my problem.
下面是我的数据框名为"dfs"的示例.大约有80列,在下面的示例中仅显示4列.
Below is an example of my dataframe titled 'dfs'. There are around 80 columns, only 4 shown in the below example.
dfs是一个大型数据框,包含超过15个月每15分钟报告的数据行(即2015-08-01 00:00:00至2016-09-30 23:45:00). Datetime列的格式为datetime.
dfs is a large dataframe consisting of rows of data reported every 15 minutes for over 12 months (i.e. 2015-08-01 00:00:00 to 2016-09-30 23:45:00). The Datetime column is in the format datetime.
...
...
我想导出(或写入)多个每月的csv文件,这些文件是从原始大型csv文件(dfs)中获取的每月数据的摘要.对于每个月,我都希望写入一个包含原始数据,白天数据(上午6点至下午6点)和夜间数据(下午6点至上午6点)的文件.我还希望每个月度文件的名称都是自动的,这样它就可以根据其包含的数据来称呼自己为dfs_%Y%m,dfs_day_%Y%m或dfs_night_%Y%m.
I want to export (or write) multiple monthly csv files, which are snippets of monthly data taken from the original large csv file (dfs). For each month, I want a file to be written that contains the the raw data, day data (6am-6pm) and night data (6pm-6am). I also want the name of each monthly file to be automated so it knows whether to call itself dfs_%Y%m, or dfs_day_%Y%m, or dfs_night_%Y%m depending on the data it contains.
目前,我正在写出180行以上的代码来导出每个csv文件.
At the moment I am writing out over 180 lines of code to export each csv file.
例如:
我通过从索引日期时间"列中获取下面列出的日期时间之间的数据来创建每月的原始,白天和夜晚文件
dfs201508 = dfs.ix['2015-08-01 00:00:00':'2015-08-31 23:45:00']
dfs201508Day = dfsDay.ix['2015-08-01 00:00:00':'2015-08-31 23:45:00']
dfs201508Night = dfsNight.ix['2015-08-01 00:00:00':'2015-08-31 23:45:00']
然后,我将这些文件导出到它们各自的输出路径并为其提供文件名
dfs201508 = dfs201508.to_csv(outputpath+"dfs201508.csv")
dfs201508Day = dfs201508Day.to_csv(outputpathDay+"dfs_day_201508.csv")
dfs201508Night = dfs201508Night.to_csv(outputpathNight+"dfs_night_201508.csv")
我要写的是这样的
dfs_%Y%m = dfs.ix["%Y%m"]
dfs_day_%Y%m = dfs.ix["%Y%m(between 6am-6pm)"]
dfs_night_%Y%m = dfs.ix["%Y%m(between 6pm-6am)"]
dfs_%Y%m = dfs_%Y%m.to_csv(outputpath +"dfs_%Y%m.csv")
dfs_day_%Y%m = dfs_day_%Y%m.to_csv(outputpath%day +"dfs_day_%Y%m.csv")
dfs_night_%Y%m = dfs_night_%Y%m.to_csv(outputpath%night +"dfs_night_%Y%m.csv")
任何有关自动执行此过程的代码的建议,将不胜感激.
Any suggestions on the code to automate this process would be greatly appreciated.
以下是我研究过的页面的一些链接:
Here are some links to pages I researched:
https://www.youtube.com/watch?v=aeZKJGEfD7U
推荐答案
您可以使用for
循环遍历dfs
中包含的年月.在下面的示例中,我创建了一个名为DF
的虚拟数据框,其中仅包含三个示例列:
You can use a for
loop to iterate over the years and months contained within dfs
. I created a dummy dataframe called DF
in the below example, which contains just three sample columns:
dates Egen1_kwh Egen2_kwh
2016-01-01 00:00:00 15895880 15877364
2016-01-01 00:15:00 15895880 15877364
2016-01-01 00:30:00 15895880 15877364
2016-01-01 00:45:00 15895880 15877364
2016-01-01 01:00:00 15895880 15877364
下面的代码将每年每个月中的主数据框DF
过滤为较小的数据框(NIGHT
和DAY
),并将其保存为.csv
,并具有与日期相对应的名称(例如和2016_1_NIGHT
分别表示2016年1月的白天和2016年1月的晚上).
The below code filters the main dataframe DF
into smaller dataframes (NIGHT
and DAY
) for each month within each year and saves them to as .csv
with a name corresponding to their date (e.g. 2016_1_DAY
and 2016_1_NIGHT
for Jan 2016 Day and Jan 2016 Night).
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from random import randint
# I defined a sample dataframe with dummy data
start = datetime.datetime(2016,1,1,0,0)
dates = [start + relativedelta(minutes=15*i) for i in range(0,10000)]
Egen1_kwh = randint(15860938,15898938)
Egen2_kwh = randint(15860938,15898938)
DF = pd.DataFrame({
'dates': dates,
'Egen1_kwh': Egen1_kwh,
'Egen2_kwh': Egen2_kwh,
})
# define when day starts and ends (MUST USE 24 CLOCK)
day = {
'start': datetime.time(6,0), # start at 6am (6:00)
'end': datetime.time(18,0) # ends at 6pm (18:00)
}
# capture years that appear in dataframe
min_year = DF.dates.min().year
max_year = DF.dates.max().year
if min_year == max_year:
yearRange = [min_year]
else:
yearRange = range(min_year, max_year+1)
# iterate over each year and each month within each year
for year in yearRange:
for month in range(1,13):
# filter to show NIGHT and DAY dataframe for given month within given year
NIGHT = DF[(DF.dates >= datetime.datetime(year, month, 1)) &
(DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) &
((DF.dates.apply(lambda x: x.time()) <= day['start']) | (DF.dates.apply(lambda x: x.time()) >= day['end']))]
DAY = DF[(DF.dates >= datetime.datetime(year, month, 1)) &
(DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) &
((DF.dates.apply(lambda x: x.time()) > day['start']) & (DF.dates.apply(lambda x: x.time()) < day['end']))]
# save to .csv with date and time in file name
# specify the save path of your choice
path_night = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}_{1}_NIGHT.csv'.format(year, month)
path_day = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}_{1}_DAY.csv'.format(year, month)
# some of the above NIGHT / DAY filtering will return no rows.
# Check for this, and only save if the dataframe contains rows
if NIGHT.shape[0] > 0:
NIGHT.to_csv(path_night, index=False)
if DAY.shape[0] > 0:
DAY.to_csv(path_day, index=False)
这篇关于Python-从一个大CSV文件写入多个每月CSV文件-自动文件名以反映月份的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!