为什么 SSIS TOKEN 函数无法计算相邻的列分隔符?

本文介绍了为什么 SSIS TOKEN 函数无法计算相邻的列分隔符?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在名为 TOKEN() 的表达式编辑器中遇到了 SQL Server Integration Services 2012 新字符串函数的问题.

这应该可以帮助您解析分隔记录.如果记录来自平面文件，您可以使用平面文件源执行此操作.在这种情况下，我正在处理以字符串形式存储在数据库 VARCHAR 字段中的旧分隔导入记录.现在需要将它们提取、处理并重新导出为分隔字符串.例如:

1^苹果^0001^01/01/2010^食蚁兽^A1
2^香蕉^0002^03/15/2010^熊^B2
3^蔓越莓^0003^4/15/2010^Crow^C3

如果这些字符串在名为 OldImportRecord 的列中，分隔符是一个脱字符(如图所示)，并且我们希望将第五个字段放入派生列中，我们将使用如下表达式:

TOKEN(OldImportRecord,"^",5)

这会返回Anteater、Bear、Crow等.其实我们可以为这条记录中的每一个字段创建Derived Columns(注意索引是基于一个的)，根据需要改变它们，然后再构建一个分隔的列导出记录.

问题来了.如果我们的某些数据包含一些空字符串(或呈现为空字符串的 Null)怎么办?

4^^0004^6/15/2010^Duck^D4

TOKEN() 无法计算相邻的列分隔符，这会导致列数丢失.现在它只看到五列而不是六列.我们的 TOKEN(OldImportRecord,"^",5) 返回D4"而不是预期的Duck".当我们提取第四列时，我们最终试图将Duck"放入日期列中，各种乐趣随之而来.

这是部分解决方法:

TOKEN(REPLACE(OldImportRecord,"^^","^ ^"),"^",5)

请注意，这会错过每一对分隔符，因此对于像5^^^^Emu^E5"这样的字符串(在 REPLACE() 之后看起来像5^ ^^ ^Emu^E5"这样的字符串，它会失败.列数还是错误.

这是我的完整解决方法.这包括两个嵌套的 REPLACE 语句()、一个用于删除多余空格的 RTRIM() 和一个 DT_STR 强制转换，因为我想将结果保留在 VARCHAR 中:

(DT_STR,255,1252)RTRIM(TOKEN(REPLACE(REPLACE(OldImportRecord,"^^","^ ^"),"^^","^ ^"),"^",5))

我发布此信息以供参考，因为其他人也可能遇到此问题.

有没有人有更好的解决方法，甚至是真正的解决方案?

解决方案

问题原因:

SSIS中的

TOKEN方法使用了C++中strtok函数的实现.我在阅读Microsoft® SQL Server® 2012 集成服务一书时收集了这些信息>.它在 113 页的注释中提到(我喜欢这本书！很多不错的信息.).

我搜索了 strtok 函数的实现，我找到了以下链接.

INFO: strtok(): C Function -- 文档补充 - 中的代码示例此链接显示该函数确实忽略了连续的分隔符.

以下 SO 问题的答案指出 strtok 函数旨在忽略连续的分隔符.

需要知道当使用 strtok() 的两个标记分隔符之间没有出现数据时

使用连续分隔符的 strtok_s 行为

我认为 TOKEN 和 TOKENCOUNT 函数按照设计运行，但 SSIS 是否应该如此运行可能是 Microsoft SSIS 团队的一个问题.>

原始帖子 - 以上部分是更新:

我根据您的数据输入在 SSIS 2012 中创建了一个简单的包.正如您在问题中所描述的， TOKEN 函数的行为不符合预期.我同意您的看法，该功能似乎不起作用.这篇文章不是对您的原始问题的回答.

这是一种以相对简单的方式编写表达式的替代方法.这仅在您输入记录中的最后一段始终具有值时才有效(例如 A1、B2、C3 等).

表达式可以改写为:

此语句将输入记录作为参数，分隔符插入符号 (^) 作为第二个参数.第三个参数计算按分隔符分割时记录中的总段数.如果最后一个段中有数据，则保证有两个段.然后您可以减去 1 以获取倒数第二个片段.

(DT_STR,50,1252)TOKEN(OldImportRecord,"^",TOKENCOUNT(OldImportRecord,"^") - 1)

我创建了一个带有数据流任务的简单包.OLE DB 源检索数据，派生的转换按照下面的屏幕截图解析和拆分数据.然后将输出插入到目标表中.您可以在最后一个屏幕截图中看到源表和目标表.目标表有两列.第一列存储倒数第二个段数据和基于分隔符的段计数(这又是不正确的).您可以注意到最后一条记录没有获取正确的结果.如果最后一条记录没有值 8，则上述表达式将失败，因为该表达式的计算结果为零索引.

希望有助于简化您的表达.

如果您没有收到其他人的来信，我建议您在 Microsoft Connect 网站.

创建表格并填充脚本:

创建表 [dbo].[SourceTable]([OldImportRecord] [varchar](50) 非空) 在 [主要]走创建表 [dbo].[DestinationTable]([NewImportRecord] [varchar](50) NOT NULL,[CaretCount] [int] 非空) 在 [主要]走插入 dbo.SourceTable (OldImportRecord) 值('1^Apple^0001^01/01/2010^Anteater^A1'),('2^Banana^0002^03/15/2010^Bear^B2'),('3^蔓越莓^0003^4/15/2010^Crow^C3'),('4^^0004^6/15/2010^Duck^D4'),('5^^^^Emu^E5'),('6^^^^鹅^F6'),('^^^^野鸡^G7'),('8^^^^麻雀^');走

数据流任务中的派生列转换:

源表和目标表中的数据:

I ran into a problem with SQL Server Integration Services 2012's new string function in the Expression Editor called TOKEN().

This is supposed to help you parse a delimited record. If the record comes out of a flat file, you can do this with the Flat File Source. In this case, I am dealing with old delimited import records that were stored as strings in a database VARCHAR field. Now they need to be extracted, massaged, and re-exported as delimited strings. For example:

If these strings are in a column called OldImportRecord, the delimiter is a caret (as shown), and we wish to put the fifth field into a Derived Column, we would use an expression like:

TOKEN(OldImportRecord,"^",5)

This returns Anteater, Bear, Crow, etc. In fact, we can create Derived Columns for each of the fields in this record (note that the index is one-based), change them as needed, and then build another delimited record for export.

Here's the problem. What if some of our data includes some empty strings (or Nulls rendered as empty strings)?

The TOKEN() fails to count the adjacent column delimiters, which throws off the column count. Now it only sees five columns instead of six columns. Our TOKEN(OldImportRecord,"^",5) returns "D4" instead of the intended "Duck". When we extract the fourth column, we wind up trying to put "Duck" into a Date column, and all sorts of fun ensues.

Here's a partial workaround:

TOKEN(REPLACE(OldImportRecord,"^^","^ ^"),"^",5)

Notice this misses every second delimiter pair, so it will fail for a string like "5^^^^Emu^E5", which looks like"5^ ^^ ^Emu^E5" after the REPLACE(). The column count is still wrong.

So here's my full workaround. This includes two nested REPLACE statements(), an RTRIM() to remove the superfluous spaces, and a DT_STR cast because I would like to keep the result in VARCHAR:

(DT_STR,255,1252)RTRIM(TOKEN(REPLACE(REPLACE(OldImportRecord,"^^","^ ^"),"^^","^ ^"),"^",5))

I am posting this for information, since others may also run into this problem.

Does anyone have a better workaround, or even a real solution?

解决方案

Reason for the issue:

TOKEN method in SSIS uses the implementation of strtok function in C++. I gathered this information while reading the book Microsoft® SQL Server® 2012 Integration Services. It is mentioned as note on page 113 (I like this book! Lots of nice information.).

I searched for the implementation of strtok function and I found the following links.

INFO: strtok(): C Function -- Documentation Supplement - The code sample in this link shows that the function does ignore consecutive delimiter characters.

The answers to the following SO questions point out that strtok function is designed to ignore consecutive delimiters.

Need to know when no data appears between two token separators using strtok()

strtok_s behaviour with consecutive delimiters

I think that the TOKEN and TOKENCOUNT functions are working as per design but whether that is how SSIS should behave might be a question for the Microsoft SSIS team.

Original Post - Above section is an update:

I created a simple package in SSIS 2012 based on your data inputs. As you had described in your question, the TOKEN function does not behave as intended. I agree with you that the function doesn't seem to work. This post is not an answer to your original issue.

Here is an alternative way to write the expression in a relatively simpler fashion. This will only work if the last segment in your input record will always have a value (say A1, B2, C3 etc.).

Expression can be rewritten as:

This statement will take the input record as the parameter, the delimiter caret (^) as the second parameter. The third parameter calculates the total number segments in the records when split by the delimiter. If you have data in the last segment, you are guaranteed to have two segments. You can then subtract 1 to fetch the penultimate segment.

(DT_STR,50,1252)TOKEN(OldImportRecord,"^",TOKENCOUNT(OldImportRecord,"^") - 1)

I created a simple package with data flow task. OLE DB source retrieves the data and the derived transformation parses and splits the data as per the screenshot below. The output is then inserted into the destination table. You can see the source and destination tables in the last screenshot. Destination table has two columns. The first column stores the penultimate segment data and the segments count based on the delimiter (which again isn't correct). You can notice that the last record didn't fetch the correct results. If the last record didn't have the value 8, then the above expression will fail because the expression will evaluate to zero index.

Hope that helps to simplify your expression.

If you don't hear from anyone else, I would recommend logging this issue in Microsoft Connect website.

Create table and populate scripts:

CREATE TABLE [dbo].[SourceTable](
    [OldImportRecord] [varchar](50) NOT NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[DestinationTable](
    [NewImportRecord] [varchar](50) NOT NULL,
    [CaretCount] [int] NOT NULL
) ON [PRIMARY]
GO

INSERT INTO dbo.SourceTable (OldImportRecord) VALUES
    ('1^Apple^0001^01/01/2010^Anteater^A1'),
    ('2^Banana^0002^03/15/2010^Bear^B2'),
    ('3^Cranberry^0003^4/15/2010^Crow^C3'),
    ('4^^0004^6/15/2010^Duck^D4'),
    ('5^^^^Emu^E5'),
    ('6^^^^Geese^F6'),
    ('^^^^Pheasant^G7'),
    ('8^^^^Sparrow^');
GO

Derived column transformation inside data flow task:

Data in source and destination tables:

这篇关于为什么 SSIS TOKEN 函数无法计算相邻的列分隔符?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！