问题描述
我对 SSIS 很陌生.我正在使用 SSIS 2008.我看到有许多工具可以执行与某些 SQL 运算符相同的功能.什么时候应该使用 SSIS 工具和 TSQL 操作符?另外,这里有关于更有效解决方案的任何建议吗?
I am very new to SSIS. I am using SSIS 2008. I see that there are many tools which perform same function as some SQL operators. When should I use the SSIS tools vs. TSQL operators? Also, any advice on a more efficient solution here?
下面是我从 SSIS 导入/导出向导中选择的 tsql 查询.所以我当前的解决方案除了一个数据流源和一个数据流目的地之外不使用任何 SSIS 工具.
Below is the tsql query that I selected from the SSIS Import/Export wizard. So my current solution does not use any SSIS tools other than one Data flow source and one data flow destination.
SELECT
en.uniqueid_c AS enrollment_id,
CONVERT(nvarchar (20),c.clientcode_c) AS client_id, --Legacy CDT# (note this has to be the same value as client_id on the other tables)
CONVERT(nvarchar (20),CASE --Program codes included here for enrollment data (Excludes enrollments with modifiers)
WHEN en.agency_c = 'ADO' THEN 'ADO'
WHEN en.agency_c = 'ADOT' THEN 'ADO'
WHEN en.agency_c in ('MRDD/IHS','MRDD/PSH','MRDD/REP','MRDD/RTC') THEN 'CHOICES'
WHEN en.agency_c = 'FPP' THEN 'CM'
WHEN en.agency_c = 'CMGT' THEN 'CM'
WHEN en.agency_c = 'EDU' THEN 'EDU'
WHEN en.agency_c = 'COM' THEN 'GH'
WHEN en.agency_c = 'CSUP' THEN 'INT'
WHEN en.agency_c = 'IHS' THEN 'INT'
WHEN en.agency_c = 'IHST' THEN 'INT'
WHEN en.agency_c = 'MST' THEN 'MST'
WHEN en.agency_c = 'OMHS' THEN 'OMHS'
WHEN en.agency_c = 'ORTC' THEN 'RESA'
WHEN en.agency_c = 'MRTC' THEN 'RESA'
WHEN en.agency_c = 'RTC' THEN 'RESA'
WHEN en.agency_c = 'RFC' THEN 'RFC'
WHEN en.agency_c in ('SCCR','SCMN','SCRP') THEN 'SCS'
WHEN en.agency_c = 'SUB' THEN 'SUB' --uncertain about this one KMH - 06/23/10
WHEN en.agency_c = 'STFC' THEN 'TFC'
WHEN en.agency_c = 'MTFC' THEN 'TFC'
WHEN en.agency_c = 'TFC' THEN 'TFC'
WHEN en.agency_c = 'TL' THEN 'TL'
WHEN en.agency_c = 'TLT' THEN 'TL'
ELSE en.agency_c
END) AS program_code,
-------------------------------------------------------------------------------------------------------------------------------------------
--2nd Program_code entry handles program_modifier_code.
--The codes need to be grouped and cased out to match the Evolv codes. --This was fixed.
-- NOTE!!!! The codes below will need to be replaced with the finance modifiers just for TN. --per deneen and diane.
UPPER(CONVERT(nvarchar(20), CASE --PROGRAM MODIFIERS --This is pulled into the program_code in the 2nd run. These should exclude non-modifiers.
WHEN en.agency_c in ('ADO','ADOT','IHST','TLT') THEN 'TRANS'
WHEN en.agency_c in ('CHOM','CGMT','COM','CSUP','EDU','IHS','LHS','MRTC','MST','MTFC','RTC', --use this to exclude recs
'SCCR','SCRP','SUB','TFC','TL','ZADMIN','ZAWOL','ZDET','ZHOSP') THEN en.agency_c
WHEN en.agency_c = 'FPP' THEN 'FPP'
WHEN en.agency_c = 'RFC' THEN 'RFC'
WHEN en.agency_c = 'TFC' THEN 'TFC'
WHEN en.agency_c = 'MRDD/IHS' THEN 'MRIHS'
WHEN en.agency_c = 'MRDD/PSH' THEN 'MRPSH'
WHEN en.agency_c = 'MRDD/REP' THEN 'MRREP'
WHEN en.agency_c = 'MRDD/RTC' THEN 'MRRTC'
WHEN en.agency_c = 'PD40' THEN 'PD40'
WHEN en.agency_c = 'PDET' THEN 'PDET'
WHEN en.agency_c = 'PINT' THEN 'PINT'
WHEN en.agency_c = 'PLV4' THEN 'PLV4'
WHEN en.agency_c = 'PWIL' THEN 'PWIL'
WHEN en.agency_c = 'PYDC' THEN 'PYDC'
WHEN en.agency_c = 'SCMN' THEN 'SCMN'
WHEN en.agency_c = 'STFC' THEN 'STFC'
ELSE en.agency_c
END)) AS program_modifier_code,
-------------------------------------------------------------------------------------------------------------------------------------------
/*
Group Homes and Inner Harbour locations were added on 7/26/10 - KMH
*/
CONVERT(nvarchar(20),
CASE
WHEN en.location_c = 'ANNI' THEN 'AL-ANNI'
WHEN en.location_c = 'ASHE' THEN 'NC-ASHE'
WHEN en.location_c = 'ATL' THEN 'GA-ATL'
WHEN en.location_c = 'RMBT' THEN 'RTC-TN-BC'
WHEN en.location_c = 'BIL' THEN 'MS-BIL'
WHEN en.location_c = 'BIRM' THEN 'AL-BIRM'
WHEN en.location_c = 'BOST' THEN 'MA-BOST'
WHEN en.location_c = 'CIRT' THEN 'RTC-TN-CIRT'
WHEN en.location_c = 'CHAR' THEN 'NC-CHAR'
WHEN en.location_c = 'CHAT' THEN 'TN-CHAT'
WHEN en.location_c = 'CLAR' THEN 'TN-CHAR'
WHEN en.location_c = 'COL' THEN 'TN-COL'
WHEN en.location_c = 'CMS' THEN 'MS-COL'
WHEN en.location_c = 'CCRD' THEN 'NC-CCRD'
WHEN en.location_c = 'COOK' THEN 'TN-COOK'
WHEN en.location_c = 'DAL' THEN 'TX-DAL'
WHEN en.location_c = 'RDV' THEN 'RTC-TN-DV'
WHEN en.location_c = 'DKSN' THEN 'TN-DKSN'
WHEN en.location_c = 'RDW' THEN 'RTC-TN-DW'
WHEN en.location_c = 'DOTH' THEN 'AL-DOTH'
WHEN en.location_c = 'DUR' THEN 'NC-DURH'
WHEN en.location_c = 'DYER' THEN 'TN-DYER'
WHEN en.location_c = 'FAYE' THEN 'NC-FAYE'
WHEN en.location_c = 'GCRT' THEN 'RTC-TN-GCRT'
WHEN en.location_c = 'GRNB' THEN 'NC-GRNB'
WHEN en.location_c = 'GRNV' THEN 'NC-GRNV'
WHEN en.location_c = 'HMS' THEN 'MS-HMS'
WHEN en.location_c = 'DMS' THEN 'MD-DMS'
WHEN en.location_c = 'HICK' THEN 'NC-HICK'
WHEN en.location_c = 'HILL' THEN 'NC-HILL'
WHEN en.location_c = 'HUNT' THEN 'AL-HUNT'
WHEN en.location_c = 'INNH' THEN 'RTC-GA-INNH'
WHEN en.location_c = 'JMS' THEN 'MS-JMS'
WHEN en.location_c = 'JTN' THEN 'TN-JTN'
WHEN en.location_c = 'JCTN' THEN 'TN-JCTN'
WHEN en.location_c = 'KNOX' THEN 'TN-KNOX'
WHEN en.location_c = 'LAKE' THEN 'FL-LAKE'
WHEN en.location_c = 'LAWR' THEN 'MA-LAWR'
WHEN en.location_c = 'MANC' THEN 'NH-MANC'
WHEN en.location_c = 'MCB' THEN 'MS-MCC'
WHEN en.location_c = 'MEM' THEN 'TN-MEM'
WHEN en.location_c = 'MMS' THEN 'MS-MMS'
WHEN en.location_c = 'MIAM' THEN 'FL-MIAM'
WHEN en.location_c = 'MIDM' THEN 'TN-MIDM'
WHEN en.location_c = 'MOBI' THEN 'AL-MOBI'
WHEN en.location_c = 'MONT' THEN 'AL-MONT'
WHEN en.location_c = 'MRSN' THEN 'TN-MRSN'
WHEN en.location_c = 'NASH' THEN 'TN-NASH'
WHEN en.location_c = 'OCAL' THEN 'FL-OCAL'
WHEN en.location_c = 'PAR' THEN 'TN-PAR'
WHEN en.location_c = 'PINE' THEN 'NC-PINE'
WHEN en.location_c = 'ROAN' THEN 'VA-ROAN'
WHEN en.location_c = 'SPRG' THEN 'MA-SPRI/HOLY'
WHEN en.location_c = 'PETE' THEN 'FL-STPET'
WHEN en.location_c = 'TAMP' THEN 'FL-TAMP'
WHEN en.location_c = 'TUP' THEN 'MS-TUP'
WHEN en.location_c = 'WDC' THEN 'DC-WDC'
WHEN en.location_c = 'WILM' THEN 'NC-WILM'
WHEN en.location_c = 'WBRN' THEN 'MA-WBRN'
WHEN en.location_c = 'WORC' THEN 'MA-WORC'
WHEN en.location_c = 'GM' THEN 'TN-GM'
WHEN en.location_c = 'GN' THEN 'TN-GN'
ELSE en.location_c
END)
as service_facility_code,
en.startdate_d AS start_date,
en.enddate_d AS end_date,
c.refdate_d AS referral_date,
ep.enddate_d AS overall_discharge_date, --Episode end date
CONVERT(nvarchar(20),c.altclientcode_vc) AS org_id,-- TNKIDS#
UPPER(CONVERT(nvarchar(50), CASE
WHEN en.enddate_d = ep.enddate_d THEN ep.accountnumber_vc
WHEN en.enddate_d < ep.enddate_d THEN 'TWA'
END)) AS discharged_to_type,
UPPER(CONVERT (nvarchar(20), CASE
WHEN ep.accountnumber_vc in ('DORM','INDEP/SUP','INDEP/SELF','INDEP/NR','INDEP/FR') THEN 07
WHEN ep.accountnumber_vc in ('JAIL','DET') THEN 01
WHEN ep.accountnumber_vc in ('BIOL') THEN 02
WHEN ep.accountnumber_vc in ('ADOPT/DCS','ADOPT/PAR','ADOPT/YV') THEN 06
WHEN ep.accountnumber_vc in ('REL') THEN 03
WHEN ep.accountnumber_vc in ('PSYCH','EMER','RTC') THEN 04
ELSE 99
END)) AS discharged_to_type_code,
CONVERT(nvarchar(300),'cd.enrollments') AS original_table_name,
CONVERT(nvarchar (400), en.alerts_vc) AS remarks,
CONVERT(varchar(50), CASE
WHEN en.disreason_c = 'ADMI' THEN 'Administrative'
WHEN en.disreason_c = 'AMA' THEN 'Against Medical Advice'
WHEN en.disreason_c = 'AWOL' THEN 'Absent Without Leave'
WHEN en.disreason_c = 'DCSD' THEN 'Deceased'
WHEN en.disreason_c = 'JC' THEN 'Juvenille Court'
WHEN en.disreason_c = 'NP' THEN 'No Progress'
WHEN en.disreason_c = 'TMED' THEN 'Transfer to Medical Treatment Facility'
WHEN en.disreason_c = 'TPSY' THEN 'Transfer to Inpatient Psychiatric Facility'
WHEN en.disreason_c = 'TW' THEN 'Transfer within Agency'
WHEN en.disreason_c = 'WMA' THEN 'With Medical Advice'
ELSE 'Other'
END)AS outcome,
CONVERT(varchar(5), CASE
WHEN en.disreason_c in ('ADMI','AMA','AWOL','NP') THEN 'CBT'
WHEN en.disreason_c in ('DCSD','WMA') THEN 'DLR'
WHEN en.disreason_c in ('JC') THEN 'RSF'
WHEN en.disreason_c in ('TMED','TPSY') THEN 'DMR'
WHEN en.disreason_c in ('TW') THEN 'RPA'
ELSE 'CBT'
END) AS outcome_code,
--Populate service_facility_unit table and add case statement for loading CDT program_c into client_enrollment room_number 7/27/10 KMH
UPPER(CONVERT(varchar(10), CASE
WHEN en.program_c = 'BT1L' THEN 'BC1L'
WHEN en.program_c = 'BT1R' THEN 'BC1R'
WHEN en.program_c = 'BT2L' THEN 'BC2L'
WHEN en.program_c = 'BT2R' THEN 'BC2R'
WHEN en.program_c = 'BT3' THEN 'BC3'
WHEN en.program_c = 'BT3L' THEN 'BC3L'
WHEN en.program_c = 'BT3R' THEN 'BC3R'
WHEN en.program_c = 'BT4L' THEN 'BC4L'
WHEN en.program_c = 'BT4R' THEN 'BC4R'
WHEN en.program_c = 'BT5' THEN 'BC5'
WHEN en.program_c = 'BT6' THEN 'BC6'
WHEN en.program_c = 'CRT1' and en.location_c = 'CIRT' THEN 'BCRT1'
WHEN en.program_c = 'CRT2' and en.location_c = 'CIRT' THEN 'BCRT2'
WHEN en.program_c = 'CRT3' and en.location_c = 'CIRT' THEN 'BCRT3'
WHEN en.program_c = 'CRT4' and en.location_c = 'CIRT' THEN 'BCRT4'
WHEN en.program_c = 'CRT1' and en.location_c = 'GCRT' THEN 'GCRT1'
WHEN en.program_c = 'CRT2' and en.location_c = 'GCRT' THEN 'GCRT2'
WHEN en.program_c = 'CRT3' and en.location_c = 'GCRT' THEN 'GCRT3'
WHEN en.program_c = 'CRT4' and en.location_c = 'GCRT' THEN 'GCRT4'
WHEN en.program_c = 'DVC' THEN 'DV1'
WHEN en.program_c = 'DVM' THEN 'DV2'
WHEN en.program_c = 'DVN' THEN 'DV3'
WHEN en.program_c = 'DVP' THEN 'DV4'
WHEN en.program_c in ('DW1','DW2','DW3','DW4','DW5','DW6','DW7','DW8') and en.location_c = 'RDW' THEN en.program_c
WHEN en.program_c in ('IH01','IH02','IH03','IH04','IH05','IH06','IH07') and en.location_c = 'INNH' THEN 'IH3'
WHEN en.program_c = 'IH08' and en.location_c = 'INNH' THEN 'IH1'
WHEN en.program_c = 'IH09' and en.location_c = 'INNH' THEN 'IH2'
ELSE 'NA'
END)) as room_number
FROM
ar.client c
INNER JOIN cd.enrollments en ON (c.uniqueid_c = en.clientid_c)
INNER JOIN cd.episode ep ON (ep.uniqueid_c = en.episodeid_c and ep.clientid_c = c.uniqueid_c)
WHERE
(ep.enddate_d is NULL OR ep.enddate_d >= getdate()-730) and
en.location_c in (select code from dbo.yv_LKUP_OfficeLocation where state in ('TX', 'FL'))
order by 2
推荐答案
在我等待需求澄清时的通用 SSIS 建议.
General purpose SSIS advice while I wait for clarification of needs.
人们经常倾向于使用开箱即用的转换,因为它看起来是正确的做法.在下拉列表中选择表,添加排序,添加另一个数据源,也排序,合并连接,可能是聚合.
People are often tempted to use the out of the box transformations as it looks like the right thing to do. Select the table in the drop down list, add a sort, add another data source, sort that too, merge join, maybe an aggregate.
当问题域很小时,比如几万到几十万,处理上的差异可以忽略不计.如果一个包在 2 分钟内运行而不是 1 分钟,或者在处理过程中消耗 80% 的服务器内存 vs 40%,人们可能不会注意到.
When the problem domain is small, say tens to hundreds of thousands, the difference in processing is negligible. If a package runs in 2 minutes instead of 1 or consumes 80% of server memory vs 40% during processing, people may not notice.
然而,当数据量达到临界点时,糟糕的包装设计决定会吃掉你的午餐.
When the data volume hits the tipping point though, poor package design decisions are going to eat your lunch.
当你的源RDBMS有排序数据的请求时,数据库中可能有聚集索引之类的,可以节省实际排序数据的时间.当 SSIS 收到对已排序数据的请求时,您将为该操作支付多次费用.
When your source RDBMS has a request for sorted data, there may be a clustered index or something in the database that can save the time of actually sorting the data. When SSIS gets a request for sorted data, you're going to pay many times over for that operations.
SSIS 中的排序是一种完全阻塞的异步操作.这意味着流经该点的所有数据都必须到达该转换,在将其发送到下游之前对其进行操作.有 bajillion 行通过或非常慢的源,当它遇到这些操作之一时,您会真正注意到它.也许您会说,我可以等待,因为我确实需要对数据进行排序,但时间并不是您花费的唯一资源.由于异步转换需要将数据从一个缓冲区复制到另一个缓冲区,因此您的内存需求也增加了一倍.
A sort in SSIS is a fully-blocking, asynchronous operation. That means all the data flowing through that point must arrive at that transformation, be operated on before it can be sent downstream. Have a bajillion rows coming through or a really slow source, you'll really notice it when it hits one of these operations. Maybe you say, I can wait because I really need the data sorted but time's not the only resource you're spending. You are also doubling your memory requirement as asynchronous transformations require copying data from one buffer to another.
也许您仍然接受为便于使用 OOB 项目而付出的时间和内存使用成本,但您可能还没有完成支付.您的服务器有 32GB 的内存,SSIS 可以使用它们.每行花费 1000 字节,并且您有 1600 万行数据流经您的数据流.它命中排序,数据开始堆积.当最后一行到达时,您已经为原始数据消耗了 16GB 的内存.排序操作开始排序,它将 16GB 复制到另一个 16GB 的内存,哎呀,SSIS 内存不足.您现在支付临时文件存储的第三个价格.当执行引擎处于内存压力下时,它最终会开始分页到磁盘.一旦发生这种情况,如果您关心性能,游戏就结束了,但您的痛苦可能并非如此.如果您尚未为每个数据流设置 BlobTempStoragePath 值,则该文件将被写入默认的临时存储位置,可能是 C:\something 或其他.您的系统管理员切出一个非常精简的 C 分区,因为只有操作系统在那里运行,因此突然将 16GB 交换文件写入该驱动器会消耗所有可用空间,然后操作系统会变得不满意,程序包失败并且指指点点开始.
Maybe you still accept the costs time and memory usage for the ease of using an OOB item but you may not be finished paying. Your server has 32GB of memory and SSIS gets to use them all. Each row costs a thousand bytes and you have 16M rows of data flowing through your data flow. It hits the sort and data starts piling up. Once the last row arrives, you've consumed 16GB of memory for the original data. The sort operations starts to sort and it copies 16GB to another 16GB of memory and oops, SSIS is out of memory. You now pay the third price of temporary file storage. When the execution engine is under memory pressure, it will eventually start paging to disk. Once that happens, the game is over if you care about performance but your suffering may not be so. If you haven't set the BlobTempStoragePath value for each data flow, that file's going to be written to the default temporary storage location which is probably C:\something or other. Your systems admins sliced out a very lean C partition as only the OS goes on there so a 16GB swap file suddenly being written to that drive consumes all the available space and then the OS is going to get unhappy, the package fails and the finger pointing begins.
尽可能在源系统中做所有事情.上述场景适用于排序,但该课程适用于所有共享"运算符.
As much as possible, do everything in the source system. The above scenario is for Sorts but the lesson is applicable across all the "shared" operators.
至于如何清理查询,这些映射会让我发疯.您是否有机会创建 N 个查找表(或内联表值函数)来提供存储值和显示值之间的映射?然后,您可以抽象出所有案例逻辑.
As to how you can clean up your query, those mappings would drive me insane. Any chance you can create N lookup tables (or inline table valued function) to provide a mapping between the stored value and the presented value? You could then abstract away all the case logic.
- 阻止,半阻塞和非阻塞组件
- SSIS 卡玛经:爱 ETL 指南 <- 我关于 SSIS 的演讲
- Blocking, Semi-blocking and non-blocking components
- Kama Sutra of SSIS: A guide to loving ETL <- my talk on SSIS
最后,这篇文章中的数字惊人地依赖于硬件和工作负载
Finally, the numbers in this post are amazingly dependent on hardware and workloads
这篇关于如何创建 SSIS 导入原始文件包?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!