问题描述
我正在尝试使用 R/RODBC 从 Microsoft SQL Server 数据库查询变量.RODBC 将字符串截断为 8000 个字符.
I am trying to query a variable from a Microsoft SQL Server database using R/RODBC. RODBC is truncating the character string at 8000 characters.
原始代码:截断 255 个字符(根据 RODBC 文档)库(RODBC)con_string <- odbcConnect("DSN")query_string <- "SELECT text_var FROM table_name"数据 <- sqlQuery(con_string,query_string,stringsAsFactors=FALSE)
Original code: truncates at 255 characters (as per RODBC documentation)library(RODBC)con_string <- odbcConnect("DSN")query_string <- "SELECT text_var FROM table_name"dat <- sqlQuery(con_string, query_string, stringsAsFactors=FALSE)
部分解决方案:修改查询字符串截断7999个字符后的文本.库(RODBC)con_string <- odbcConnect("DSN")query_string <- "SELECT [text_var]=CAST(text_var AS VARCHAR(8000)) FROM table_name"数据 <- sqlQuery(con_string,query_string,stringsAsFactors=FALSE)
Partial solution: modifying query string truncate text after 7999 characters.library(RODBC)con_string <- odbcConnect("DSN")query_string <- "SELECT [text_var]=CAST(text_var AS VARCHAR(8000)) FROM table_name"dat <- sqlQuery(con_string, query_string, stringsAsFactors=FALSE)
表/变量包含长达 250,000 个字符的文本字符串.我真的很想处理 R 中的所有文本.这可能吗?
The table/variable contains text strings at long as 250,000 characters. I really want to work with all the text in R. Is this possible?
@BrianRipley 在以下文档的第 18 页讨论了这个问题(但没有解决方案):https://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf
@BrianRipley discusses the problem (but no solution) on page 18 of following document:https://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf
@nutterb 在 GitHub 上讨论了 RODBCext
包的类似问题:https://github.com/zozlak/RODBCext/issues/6
@nutterb dicusses similar issues with RODBCext
package on GitHub:https://github.com/zozlak/RODBCext/issues/6
在 SO 上看到过类似的讨论,但没有使用 RODBC 和 VARCHAR>8000 的解决方案.
Have seen similar discussion on SO, but no solution using RODBC with VARCHAR>8000.
RODBC sqlQuery() 在应该返回 varchar(MAX) 时返回 varchar(255)
注意:
- R 3.3.2
- Microsoft SQL Server 2012
- Linux RHEL 7.1
- 用于 SQL Server 的 Microsoft ODBC 驱动程序
推荐答案
由于这是 Microsoft 提供的 ODBC 驱动程序的一个限制,因此在他们对驱动程序进行更改之前几乎没有什么可做的.@zozlak 在您链接到的 GitHub 问题中解释了原因.
Since this is a limitation of the ODBC driver provided by Microsoft, there's little to be done until they make a change to the driver. @zozlak explains why in the GitHub issue you linked to.
我倾向于在需要时使用存储过程来解决这个问题,但这通常需要为每个特定实例编写一个存储过程.在某些时候,我可能会想出一种更通用的方法在存储过程中执行此操作,但我发现在存储过程中构造查询的过程既乏味又令人沮丧.
I tend to use stored procedures to get around this when I need to, but that usually requires writing a stored procedure for each particular instance. At some point, I may come up with a way to do this in a stored procedure more generically, but I find the process of constructing queries in a stored procedure to be tedious and frustrating.
出于这个原因,我只是花了一些时间来构建一个函数,该函数将执行涉及 VARCHAR(MAX) 变量的有限查询.这是一种蛮力方法,对于 17000 个字符的变量将其导出为三个变量并将它们粘贴到 R 中.这很粗糙,可能不是很有效,但我已经提出了最好的解决方案.
For that reason, I just spent some time building a function that will do limited queries that involve VARCHAR(MAX) variables. It's a brute force approach that for a variable of 17000 characters will export it in three variables and paste them together in R. It's crude, probably not very efficient, but the best solution I've come up with yet.
另一个限制是它不允许您在查询中重命名变量.您将被数据库中命名的变量所困扰.如果您只涉及几张桌子,那可能不是问题.在非常复杂的数据库中,这可能是有问题的.但是,至少有了这个,你可以只查询 VARCHAR(MAX) 变量和一些在 R 中执行合并所需的 ID.
The other limitation is that it doesn't allow you to rename variables within the query. You'll be stuck with the variables as they are named in the database. If you're only involving a couple of tables, that may not be a problem. In very complex database, this can be problematic. But, at least with this you could query just the VARCHAR(MAX) variables with a handful of necessary IDs on which to perform a merge in R.
如 GitHub 问题中所述,最好尽可能避免使用 VARCHAR(MAX).如果确实需要未知长度,则 VARBINARY(MAX) 更容易查询.
As discussed in the GitHub issue, it's probably best to avoid VARCHAR(MAX) as much as possible. If an unknown length is truly necessary, VARBINARY(MAX) is a little easier to query.
channel <- odbcDriverConnect(...)
query_varchar_max(channel = channel,
id = c("idvar"),
varchar_max = c("varchar_max_var", "varchar_max_var2"),
from = "FROM dbo.table_name WHERE group = ?",
data = list(group = "A"))
功能代码
#' @name query_varchar_max
#' @title Query a VARCHAR(MAX) Variable from SQL Server
#'
#' @description The RODBC driver to SQL Server (SQL Server Native Client 11.0)
#' reports the lenght of a VARCHAR(MAX) variable to be zero. This presents
#' difficulties in extracting long text values from the database. Often, the
#' ODBC will assume a length of 255 characters and truncate the text to that
#' many characters. The approach taken here searches the VARCHAR(MAX) variables
#' for the longest length, and extracts the data in segments to be pasted
#' together in R.
#'
#' @param channel A valid ODBC channel to a SQL Server database.
#' @param id A character vector of ID variables that may be used to merge the
#' data from this query into another dataset.
#' @param varchar_max a character vector of variable names that are to be
#' treated as if they are VARCHAR(MAX) variables.
#' @param from A single character string providing the remainder of the query
#' to be run, beginning with the \code{FROM} statement.
#' @param stringsAsFactors \code{logical(1)}. Should character strings returned
#' from the database be converted to factors?
#' @param ... Additional arguments to \code{sqlExecute} when running the full
#' query.
#'
#' @details \code{query_varchar_max} operates by determining how many columns of up to
#' 8000 characters each are required to export a complete VARCHAR(MAX) variable.
#' It then creates the necessary number of intermediate variables and queries the
#' data using the SQL Server \code{SUBSTRING} command, extracting the VARCHAR(MAX)
#' variable in increments of 8000 characters. After completing the query,
#' the intemediary variables are concatenated and removed from the data.
#'
#' The function makes accommodation for multi-part queries as far as [TABLE].[VARIABLE]
#' formats are concerned. It is not intended for use in [SCHEMA].[TABLE].[VARIABLE]
#' formats. This at least allows \code{from} to include joins for more complex
#' queries. Parameterized queries are also supported through \code{sqlExecute}.
#'
#' @export
query_varchar_max <- function(channel, id, varchar_max, from,
stringsAsFactors = FALSE, ...)
{
coll <- checkmate::makeAssertCollection()
checkmate::assert_class(x = channel,
classes = "RODBC",
add = coll)
checkmate::assert_character(x = id,
add = coll)
checkmate::assert_character(x = varchar_max,
add = coll)
checkmate::assert_character(x = from,
len = 1,
add = coll)
checkmate::assert_logical(x = stringsAsFactors,
len = 1,
add = coll)
checkmate::reportAssertions(coll)
varchar_max_len <-
paste0(
sprintf("MAX(LEN(%s)) AS len_%s",
varchar_max,
sub("[.]", "_", varchar_max)),
collapse = ", "
)
varchar_len <-
unlist(
RODBCext::sqlExecute(
channel = channel,
query = sprintf("SELECT %s %s",
varchar_max_len,
from),
fetch = TRUE
)
)
varchar_max_cols <-
unlist(
mapply(expand_varchar_max,
varchar_max,
varchar_len,
SIMPLIFY = FALSE)
)
Prelim <-
RODBCext::sqlExecute(
channel = channel,
query = sprintf("SELECT %s, %s %s",
paste0(id, collapse = ", "),
paste0(varchar_max_cols, collapse = ", "),
from),
fetch = TRUE,
stringsAsFactors = stringsAsFactors,
...
)
var_stub_to_combine <-
unique(
sub(
"(part)(\\d{1,3})",
"\\1",
sub(".+AS ", "", varchar_max_cols)
)
)
col_to_combine <-
lapply(var_stub_to_combine,
grep,
names(Prelim))
Prelim[sub(".+[.]", "", varchar_max)] <-
lapply(col_to_combine,
function(col) apply(Prelim[col], 1, paste0, collapse = ""))
Prelim[-unlist(col_to_combine)]
}
expand_varchar_max <- function(varchar_max, varchar_len)
{
nvar <- varchar_len %/% 8000 + 1
var_list <- vector("character", length = nvar)
for (i in seq_along(var_list))
{
var_list[i] <-
sprintf("SUBSTRING(%s, %s, %s) AS %s_part%s",
varchar_max,
1 + (i - 1) * 8000,
8000,
paste0(sub("[.]", "_", varchar_max)),
i)
}
var_list
}
这篇关于RODBC 和 Microsoft SQL Server:截断长字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!