问题描述
我有一个熊猫数据框,正在使用SQLAlchemy引擎和 to_sql
函数将其写到Snowflake。它工作正常,但由于雪花限制,我不得不使用 chunksize
选项。这对于较小的数据帧也很好。但是,某些数据帧有500k +行,每块记录15k记录,要永久完成对Snowflake的写入。
I have a Pandas dataframe that I'm writing out to Snowflake using SQLAlchemy engine and the to_sql
function. It works fine, but I have to use the chunksize
option because of some Snowflake limit. This is also fine for smaller dataframes. However, some dataframes are 500k+ rows, and at a 15k records per chunk, it takes forever to complete writing to Snowflake.
我做了一些研究,发现了 Snowflake提供的pd_writer
方法显然可以更快地加载数据帧。我的Python脚本的完成速度更快,我看到它创建了一个具有所有正确的列和正确的行数的表,但是每一行中每一列的值都是NULL。
I did some research and came across the pd_writer
method provided by Snowflake, which apparently loads the dataframe much faster. My Python script does complete faster and I see it creates a table with all the right columns and the right row count, but every single column's value in every single row is NULL.
我认为是 NaN
到 NULL
的问题,并尝试了所有可能的方法来替换 NaN
带有 None
,并且在数据帧内进行替换时,到表时,所有内容都变为NULL。
I thought it was a NaN
to NULL
issue and tried everything possible to replace the NaN
s with None
, and while it does the replacement within the dataframe, by the time it gets to the table, everything becomes NULL.
如何使用 pd_writer
将这些巨大的数据帧正确地写入Snowflake?还有其他可行的选择吗?
How can I use pd_writer
to get these huge dataframes written properly into Snowflake? Are there any viable alternatives?
编辑:在克里斯回答之后,我决定尝试使用官方示例。这是我的代码和结果集:
Following Chris' answer, I decided to try with the official example. Here's my code and the result set:
import os
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer
def create_db_engine(db_name, schema_name):
return create_engine(
URL(
account=os.environ.get("DB_ACCOUNT"),
user=os.environ.get("DB_USERNAME"),
password=os.environ.get("DB_PASSWORD"),
database=db_name,
schema=schema_name,
warehouse=os.environ.get("DB_WAREHOUSE"),
role=os.environ.get("DB_ROLE"),
)
)
def create_table(out_df, table_name, idx=False):
engine = create_db_engine("dummy_db", "dummy_schema")
connection = engine.connect()
try:
out_df.to_sql(
table_name, connection, if_exists="append", index=idx, method=pd_writer
)
except ConnectionError:
print("Unable to connect to database!")
finally:
connection.close()
engine.dispose()
return True
df = pd.DataFrame([("Mark", 10), ("Luke", 20)], columns=["name", "balance"])
print(df.head)
create_table(df, "dummy_demo_table")
代码工作正常,没有任何障碍,但是当我查看创建的表时,它全都是NULL。再次。
The code works fine with no hitches, but when I look at the table, which gets created, it's all NULLs. Again.
推荐答案
结果证明,文档(可以说是Snowflake的最薄弱点)与现实不同步。这是真正的问题:。它所需要的只是在列名中使用单个字符将其大写,并且可以正常使用。
Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.
我的解决方法是简单地做到: df.columns = map( str.upper,df.columns)
,然后调用 to_sql
。
My workaround is to simply do: df.columns = map(str.upper, df.columns)
before invoking to_sql
.
这篇关于雪花 pandas pd_writer用NULL写出表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!