高效地将 Pandas 数据帧写入 Google BigQuery

本文介绍了高效地将 Pandas 数据帧写入 Google BigQuery的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 pandas.DataFrame.to_gbq() 记录的函数将 pandas.DataFrame 上传到 Google Big Query/pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_gbq.html#pandas-dataframe-to-gbq" rel="nofollow noreferrer">此处.问题是 to_gbq() 需要 2.3 分钟，而直接上传到 Google Cloud Storage 需要不到一分钟.我计划上传一堆数据帧(约 32 个)，每个数据帧的大小相似，所以我想知道什么是更快的替代方案.

I'm trying to upload a pandas.DataFrame to Google Big Query using the pandas.DataFrame.to_gbq() function documented here. The problem is that to_gbq() takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.

这是我正在使用的脚本:

This is the script that I'm using:

dataframe.to_gbq('my_dataset.my_table',
                 'my_project_id',
                 chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)
                 if_exists='append',
                 verbose=False
                 )

dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)

我的问题是，什么更快?

My question is, what is faster?

使用pandas.DataFrame.to_gbq()函数上传Dataframe
将 Dataframe 保存为 CSV，然后使用 Python API
将 Dataframe 保存为 CSV，然后使用此过程将文件上传到 Google Cloud Storage然后从 BigQuery 读取它

Upload Dataframe using pandas.DataFrame.to_gbq() function
Saving Dataframe as CSV and then upload it as a file to BigQuery using the Python API
Saving Dataframe as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQuery

更新:

备选方案 1 似乎比备选方案 2 快，(使用 pd.DataFrame.to_csv() 和 load_data_from_file() 17.9 秒平均更多，有 3 个循环):

Alternative 1 seems faster than Alternative 2 , (using pd.DataFrame.to_csv() and load_data_from_file() 17.9 secs more in average with 3 loops):

def load_data_from_file(dataset_id, table_id, source_file_name):
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)

    with open(source_file_name, 'rb') as source_file:
        # This example uses CSV, but you can use other formats.
        # See https://cloud.google.com/bigquery/loading-data
        job_config = bigquery.LoadJobConfig()
        job_config.source_format = 'text/csv'
        job_config.autodetect=True
        job = bigquery_client.load_table_from_file(
            source_file, table_ref, job_config=job_config)

    job.result()  # Waits for job to complete

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_id, table_id))

推荐答案

我使用以下代码在 Datalab 中对替代方案 1 和替代方案 3 进行了比较:

I did the comparison for alternative 1 and 3 in Datalab using the following code:

from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time

# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
    my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])

#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable',
                 Context.default().project_id,
                 chunksize=10000,
                 if_exists='append',
                 verbose=False
                 )
end = time.time()
print("time alternative 1 " + str(end - start))

#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))

这里是 n = {10000,100000,1000000} 的结果:

and here are the results for n = {10000,100000,1000000}:

n       alternative_1  alternative_3
10000   30.72s         8.14s
100000  162.43s        70.64s
1000000 1473.57s       688.59s

从结果来看，方案 3 比方案 1 快.

Judging from the results, alternative 3 is faster than alternative 1.

这篇关于高效地将 Pandas 数据帧写入 Google BigQuery的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！