从 R 写表到 SAP HANA 的有效方法

本文介绍了从 R 写表到 SAP HANA 的有效方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个大约 50,000 行和 12 列的表 (df) 可写入 SAP HANA.我使用RJDBC库，逐行写成如下:

I have a table (df) of about 50,000 rows and 12 columns to write to SAP HANA.I use the RJDBC library and write row by row as follows:

# Returns the sql statement to insert one row
 build_insert_string <- function(db_output, row) {
  row_string <- paste(row, collapse="','")
  statement <- paste('INSERT INTO "', db_output$SCHEMA, '"."',db_output$table_name,'" (',db_output$string_of_columns,') VALUES (\'', row_string, '\');', sep='')
  return(statement)
}

# Insert row by row
for(i in 1:nrow(df)){
    tryCatch({ dbGetQuery(jdbcConnection, build_insert_string(db_output, df[i,])) }, error = function(e) {handle_db_errors(e)})
  }

其中 db_output 是一个包含输出常量(架构、表和列)的列表.

where db_output is a list containing the output constants (schema, table and columns).

目前，写表几乎需要半天时间.貌似HANA不支持批量插入比如:

Currently, it takes almost half of day to write the table.It seems that HANA does not support batch inserts such as:

INSERT INTO example
  (example_id, name, value, other_value)
VALUES
  (100, 'Name 1', 'Value 1', 'Other 1'),
  (101, 'Name 2', 'Value 2', 'Other 2'),
  (102, 'Name 3', 'Value 3', 'Other 3'),
  (103, 'Name 4', 'Value 4', 'Other 4');

有没有人遇到过这个挑战，如果有，你有没有找到绕过它并提高写作效率的方法?

Did anyone encounter this challenge, and if so, did you find a way to circumvent it and improve the writing efficiency?

推荐答案

我会留在这里给后代:

虽然 dbGetQuery 是大表的干净解决方案 - 它执行查询，然后在每次插入后清除结果集，但速度也很慢.

While dbGetQuery is the clean solution for large tables – it executes the query and then clears the resultset after each insertion, it is also slow.

显然，多个 INSERT 从 SQL 编辑器发送到 SAP HANA 时成功，但从 R 发送时则不成功.

Apparently, multiple INSERTs into SAP HANA are successful when sent from the SQL editor but not when sent from R.

一个(真正)快速的解决方案将提供:

A (really) fast solution would be provided by:

dbWriteTable (
conn = jdbcConnection,
name= paste0(db_output$SCHEMA, ".",db_output$table_name),
value = df,
row.names = FALSE,
field.types = db_output$string_of_columns,
append=TRUE
)

然而，dbWriteTable() 不适用于大表(它会引发内存限制错误).可以通过修改Xmx Java 选项来增加内存分配池来规避此限制，例如:options(java.parameters="- Xmx5000m").使用它有你自己的危险，特别是如果你的目标是自动化编写越来越大的表格.

However, dbWriteTable() is not meant for large tables (it will throw a memory limit error). This limitation can be circumvented by increasing the memory allocation pool by modifying the Xmx Java option, such as: options(java.parameters="- Xmx5000m"). Use it at your own peril, especially if you aim to automate the writing of increasingly big tables.

我们探索的另一个潜在解决方案是将 R 输出导出为 .csv(或多个 .csv，以防超过100 万行)，然后发送查询以将 .csv 导入到 SAP HANA.大型 csv 导入到 SAP HANA 的速度非常快，但此解决方案需要一个额外的步骤(中间 .csv 输出)并且更容易出现不正确的数据导入.

Another potential solution we explored was to export the R ouput as .csv (or multiple .csvs in case of a more than 1 million rows), and then send a query to import the .csvs to SAP HANA. Large csvs get imported very fast to SAP HANA, but this solution entails an extra step (an intermediary .csv output) and it is more prone to incorrect data importation.

这篇关于从 R 写表到 SAP HANA 的有效方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！