本文介绍了如何减少内存使用并加速代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在使用具有5列和超过9000万行的庞大数据集.该代码可以部分数据正常工作,但是当涉及到全部数据时,我会遇到内存错误".我阅读了有关发电机的信息,但对我来说似乎很复杂.我可以根据此代码获得解释吗?
I am using huge dataset with 5 columns and more that 90 million rows. The code works fine with part of the data, but when it comes to the whole I get Memory Error. I read about generators, but it appears very complex for me. Can I get explanation based on this code?
df = pd.read_csv('D:.../test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"])
df = df[:-1]
df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
pd.set_option('float_format', '{:f}'.format)
df['epoch'] = df.loc[:, 'timestamp'].astype('int64')//1e9
df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name
del df['timestamp']
for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
day_df = df.loc[df['day_of_week'] == day]
day_df.to_csv(f'{day}.csv', index=False,)
上次for loop
操作中出现错误
d4ace40905729245a5a0bc3fb748d2b3 1 2016-06-01T08:18:46.000Z 22.9484 56.7728
d4ace40905729245a5a0bc3fb748d2b3 2 2016-06-01T08:28:05.000Z 22.9503 56.7748
已更新
我这样做了:
UPDATED
I did this:
chunk_list = []
for chunk in df_chunk:
chunk_list.append(chunk)
df_concat = pd.concat(chunk_list)
我不知道现在如何进行?如何应用其余代码?
I have no idea how to proceed now? How to apply the rest of the code?
推荐答案
复杂的改进:
- 要懒惰地遍历(可能很大)文件,而不是将整个文件读入内存-指定
chunksize
至read_csv
调用(指定要在一次迭代中读取的行数) -
df = df[:-1]
语句不适用于 iterator 方法,并假定最后一行的格式不正确99695386 [space] NaN NaN NaN NaN
-我们可以处理该语句,并通过指定选项error_bad_lines=False
跳过> - 还可以通过使用
parse_dates=['timestamp']
作为pd.read_csv
调用的选项来消除语句df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
- 我们将使用
mode='a'
附加到现有的目标csv文件(附加到文件)
- to iterate through a (potentially very large) file lazily rather than reading the entire file into memory - specify a
chunksize
toread_csv
call (specifying a number of rows to read at one iteration) - the statement
df = df[:-1]
is not applicable in iterator approach and assuming that the last line is in bad format99695386 [space] NaN NaN NaN NaN
- we can handle it and skip by specifying optionerror_bad_lines=False
- the statement
df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
can also be eliminated by usingparse_dates=['timestamp']
as an option forpd.read_csv
call - we'll append to an existing target csv file applying
mode='a'
(append to a file)
实践中:
n_rows = 10 * 6 # adjust empirically
reader = pd.read_csv('test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"],
parse_dates=['timestamp'], chunksize=n_rows, error_bad_lines=False)
day_names = ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
for df in reader:
if not df.empty:
df['epoch'] = df.loc[:, 'timestamp'].astype('int64') // 1e9
df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name
del df['timestamp']
for day in day_names:
day_df = df.loc[df['day_of_week'] == day]
if not day_df.empty:
day_df.to_csv(f'{day}.csv', index=False, header=False, mode='a')
https://pandas.pydata.org /pandas-docs/stable/user_guide/io.html#io-chunking
这篇关于如何减少内存使用并加速代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!