本文介绍了如何减少内存使用并加速代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用具有5列和超过9000万行的庞大数据集.该代码可以部分数据正常工作，但是当涉及到全部数据时，我会遇到内存错误".我阅读了有关发电机的信息，但对我来说似乎很复杂.我可以根据此代码获得解释吗?

I am using huge dataset with 5 columns and more that 90 million rows. The code works fine with part of the data, but when it comes to the whole I get Memory Error. I read about generators, but it appears very complex for me. Can I get explanation based on this code?

df = pd.read_csv('D:.../test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"])

df = df[:-1]
df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
pd.set_option('float_format', '{:f}'.format)
df['epoch'] = df.loc[:, 'timestamp'].astype('int64')//1e9
df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name
del df['timestamp']

for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
    day_df = df.loc[df['day_of_week'] == day]
    day_df.to_csv(f'{day}.csv', index=False,)

上次for loop操作中出现错误

d4ace40905729245a5a0bc3fb748d2b3    1   2016-06-01T08:18:46.000Z    22.9484 56.7728
d4ace40905729245a5a0bc3fb748d2b3    2   2016-06-01T08:28:05.000Z    22.9503 56.7748

已更新

我这样做了:

UPDATED

I did this:

chunk_list = []

for chunk in df_chunk:
    chunk_list.append(chunk)
df_concat = pd.concat(chunk_list)

我不知道现在如何进行?如何应用其余代码?

I have no idea how to proceed now? How to apply the rest of the code?

推荐答案

复杂的改进:

要懒惰地遍历(可能很大)文件，而不是将整个文件读入内存-指定chunksize至read_csv调用(指定要在一次迭代中读取的行数)
df = df[:-1]语句不适用于 iterator 方法，并假定最后一行的格式不正确99695386 [space] NaN NaN NaN NaN-我们可以处理该语句，并通过指定选项error_bad_lines=False
还可以通过使用parse_dates=['timestamp']作为pd.read_csv调用的选项来消除语句df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
我们将使用mode='a'附加到现有的目标csv文件(附加到文件)

to iterate through a (potentially very large) file lazily rather than reading the entire file into memory - specify a chunksize to read_csv call (specifying a number of rows to read at one iteration)
the statement df = df[:-1] is not applicable in iterator approach and assuming that the last line is in bad format 99695386 [space] NaN NaN NaN NaN - we can handle it and skip by specifying option error_bad_lines=False
the statement df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp']) can also be eliminated by using parse_dates=['timestamp'] as an option for pd.read_csv call
we'll append to an existing target csv file applying mode='a' (append to a file)

实践中:

n_rows = 10 * 6  # adjust empirically
reader = pd.read_csv('test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"],
                     parse_dates=['timestamp'], chunksize=n_rows, error_bad_lines=False)
day_names = ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')

for df in reader:
    if not df.empty:
        df['epoch'] = df.loc[:, 'timestamp'].astype('int64') // 1e9
        df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name
        del df['timestamp']
        for day in day_names:
            day_df = df.loc[df['day_of_week'] == day]
            if not day_df.empty:
                day_df.to_csv(f'{day}.csv', index=False, header=False, mode='a')

https://pandas.pydata.org /pandas-docs/stable/user_guide/io.html#io-chunking

这篇关于如何减少内存使用并加速代码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！