本文介绍了在日期过滤器中使用多个月时,行不合并 R 中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码按列汇总我的数据

library(data.table, warn.conflicts = FALSE)图书馆(润滑,warn.conflicts = FALSE)################## 参数 ################### 设置原始交易数据的主要源文件夹路径in_directory <- "C:/Users/NAME/Documents/Raw Data/"# 列出子文件夹的名称(目前按 CUST_ID 的前两个字符分组)in_subfolders <- 列表(AA-CA",CB-HZ")# 设置输出位置out_directory <- "C:/Users/NAME/Documents/YTD Master/"out_filename <- "OUTPUT.csv"# 设置要收集的日期范围的开始和结束 - 年-月-日格式date_range <- 间隔(as.Date("2017-01-01"), as.Date("2017-01-31"))# 启用或禁用对原始文件的过滤以仅抓取在特定月份内购买的物品以节省空间.# 如果为 false,则将扫描所有文件以查找唯一项目,这将花费更长的时间并且文件更大.date_filter <- TRUE############ 代码 ############开始时间 <- Sys.time()主表 <- NULLfor (j in 1:length(in_subfolders)) {子文件夹 <- in_subfolders[j]sub_directory <- paste0(in_directory, subfolder, "/")## 导入数据in_filenames <- dir(sub_directory, pattern =".txt")for (i in 1:length(in_filenames)) {# 禁用快速过滤时提供的默认值.read_this_file <- TRUE# 为了快速过滤数据,我们根据第一行的日期选择包含或排除整个文件.# 警告:这只有在按整月过滤时才有效,因为这是每个文件中存储的数据量.如果(日期过滤器){temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses=c(CUSTOMER_TIER = "character"),na.strings = "", nrows = 1)temptable[, INVOICE_DT := as.Date(INVOICE_DT)]# 如果日期匹配,将读取标志设置为 TRUE.如果日期不匹配,则将读取标志设置为 FALSE.read_this_file <- temptable[, INVOICE_DT] %within% date_range}如果(read_this_file){打印(Sys.time()-开始时间)print(paste0("读入", in_filenames[i]))temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses = c(CUSTOMER_TIER = "character"),na.strings = "")temptable <- temptable[,lapply(.SD, sum), by = .(CUST_ID),.SDcols = c("Ext Sale")]# 合并成完整列表mastertable <- rbindlist(list(mastertable, temptable), use.names = TRUE)# 释放不需要的内存rm(诱惑)}}}# 保存决赛桌print("保存主表")fwrite(mastertable, paste0(out_directory, out_filename))rm(主表)打印(Sys.time()-开始时间)

在 1 月份运行上述脚本后,我收到的输出如下,这是我期望的输出.

CUST_ID 外部销售AK0010001 209.97CO0020001 1540.3

当我使用多个月时会出现问题.以下是我在运行 Jan-Feb date_range <- interval(as.Date("2017-01-01"), as.Date("2017-02-28"))

CUST_ID 外部销售AK0010001 209.97AK0010001 217.833CO0020001 1540.3CO0010001 -179.765

如您在上面的输出中所见,CUST_ID 不再合并.

有人知道为什么会这样吗?

下面我提供了一些数据来重现我正在使用的内容.只需将文件保存到 4 个单独的文本文件和文件夹中,就像我在代码中使用的那样.

我有 2 个单独的文件夹保存为AA-CA"和CB-HZ"

文件 1 保存为AA-CA 2017-01.txt"

INVOICE_DT,BRANCH_CODE,INVOICE_NO,INV_SEQ_NO,INV_ITEM_ID,ITEM_DESCR,STD_ITEM,PRIVATE_LABEL,CATEGORY_PATH1,CATEGORY_PATH2,CUST_ID,CUSTOMER_TIER,IS_VENDING,SALE_PRICE,TOTAL_COST,POS_COST,CE100,CE110,OLDCE120,QTYCE20,CE120,QTYCE20PACKSLIP_WHSL,PRICING_GROUP,PGG_MIN_PRICE,PGY_MIN_PRICE,PGR_MIN_PRICE,Ext Sale,Ext Total Cost2017-01-27,AK001,AK0016997,4,12772-00079,"3.75"""""""" 4.12""""""""软管外径",N,N,08.5-Fleet &汽车,01.6-DOT 软管 &油管,AK0010001,Tier 3,No,42.74,22.438335,22.438335,21.37,,,0,,3,,PGR,168.2875125,134.63001,112.191675,128.22,67.3150052017-01-27,AK001,AK0016997,3,12772-00022,"2.5"""""""" 2.87""""""""HOSE OD C",N,N,08-液压&气动,02-软管和软管卷盘,AK0010001,Tier 3,No,27.25,14.143396,14.143396,13.47,,,,0,,3,,PGR,106.07547,84.860376,70.71698,81.75,42.430188

文件 2 保存为AA-CA 2017-02.txt"

INVOICE_DT,BRANCH_CODE,INVOICE_NO,INV_SEQ_NO,INV_ITEM_ID,ITEM_DESCR,STD_ITEM,PRIVATE_LABEL,CATEGORY_PATH1,CATEGORY_PATH2,CUST_ID,CUSTOMER_TIER,IS_VENDING,SALE_PRICE,TOTAL_COST,POS_COST,CE100,CE110,OLDCE120,QTYCE20,CE120,QTYCE20PACKSLIP_WHSL,PRICING_GROUP,PGG_MIN_PRICE,PGY_MIN_PRICE,PGR_MIN_PRICE,Ext Sale,Ext Total Cost2017-02-28,AK001,AK0017107,1,12772-00307,3-WAY MALE HOUSING,N,N,09-电气,05.5-端子和电线连接器,AK0010001,Tier 3,No,95.21,74.591453,74.591453,71.04,,,0,,1,,PGG,0,0,0,95.21,74.5914532017-02-28,AK001,AK0017105,3,99523968,PC58570 1/2 PRS BALL,Y,N,,,AK0010001,Tier 3,No,24.5246,12.356039,12.356039,11.767743,,,0,,5,,PGG,0,0,0,122.623,61.780195

文件 3 保存为CB-HZ 2017-01.txt"

INVOICE_DT,BRANCH_CODE,INVOICE_NO,INV_SEQ_NO,INV_ITEM_ID,ITEM_DESCR,STD_ITEM,PRIVATE_LABEL,CATEGORY_PATH1,CATEGORY_PATH2,CUST_ID,CUSTOMER_TIER,IS_VENDING,SALE_PRICE,TOTAL_COST,POS_COST,CE100,CE110,OLDCE120,QTYCE20,CE120,QTYCE20PACKSLIP_WHSL,PRICING_GROUP,PGG_MIN_PRICE,PGY_MIN_PRICE,PGR_MIN_PRICE,Ext Sale,Ext Total Cost2017-01-31,CO002,CO0023603,19,13117-00095,8-32X5/16 BHSCS MAG,N,N,18-工单零件,成品,CO0020001,Tier 3,No,0.1858,0.037528,0.037528,0.01833,,,0,,6000,,PGG,0,0,0,1114.8,225.1682017-01-31,CO002,CO0023603,20,13117-00186,"#8-16X3/4"""""""" 6-LOBE PA",N,N,01-紧固件,03-螺丝,CO0020001,Tier 3,No,0.0851,0.029652,0.029652,,,,0,,5000,,PGG,0,0,0,425.5,148.26

文件 4 保存为CB-HZ 2017-02.txt"

INVOICE_DT,BRANCH_CODE,INVOICE_NO,INV_SEQ_NO,INV_ITEM_ID,ITEM_DESCR,STD_ITEM,PRIVATE_LABEL,CATEGORY_PATH1,CATEGORY_PATH2,CUST_ID,CUSTOMER_TIER,IS_VENDING,SALE_PRICE,TOTAL_COST,POS_COST,CE100,CE110,OLDCE120,QTYCE20,CE120,QTYCE20PACKSLIP_WHSL,PRICING_GROUP,PGG_MIN_PRICE,PGY_MIN_PRICE,PGR_MIN_PRICE,Ext Sale,Ext Total Cost2017-02-03,CO001,CO0019017,1,MN2550000A20000,M6-1.0 HEX NUT A-2,Y,N,01-紧固件,04-螺母,CO0010001,NA,No,0.0313,0.00767,0.00767,0.036215,0.009,,0.001241,,-50,0.1058,,,,,-1.565,-0.38352017-02-16,CO001,CO0019018,1,11516769,RS37518BlkRndSpacer,Y,N,01.5-硬件,电子硬件,CO0010001,NA,No,0.0396,0.011245,0.011245,0.01071,,,0,,-4500,0.,,,,,-178.2,-50.6025

我将数据保存在 2 个单独的文件夹中.

解决方案

OP 想知道为什么如果处理了超过一个月的数据,为什么没有合并 CUST_ID 的结果.

原因是每个月的文件都被读取并逐个汇总,但需要最后的汇总步骤来整合所有月份.

下面的代码是双 for 循环的简化替换.我省略了测试快速过滤"的代码.

第一部分创建要处理的文件列表.第二部分进行处理.

# 创建要处理的文件名向量in_filenames <- list.files(文件路径(in_directory,in_subfolders),模式=\.txt$",全名=真,递归=真)# 分别读取和聚合每个文件主表 <- rbindlist(lapply(in_filenames,函数(fn){#快速过滤器"的代码;测试在这里消息(正在阅读",fn)诱惑 <- fread(fn,colClasses = c(CUSTOMER_TIER = "字符"),na.strings = """)# 总计的temptable[, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c(Ext Sale")]}))[# 这是缺少的步骤:# 总体总计的第二次聚合, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c(Ext Sale")]
处理文件:Raw Data/AA-CA/AA-CA 2017-01.txt处理文件:原始数据/AA-CA/AA-CA 2017-02.txt处理文件:原始数据/CB-HZ/CB-HZ 2017-01.txt处理文件:原始数据/CB-HZ/CB-HZ 2017-02.txt

主表
 CUST_ID 外部销售1:AK0010001 427.8032:CO0020001 1540.3003:CO0010001 -179.765

注意这里使用了 data.table 表达式的链接.


编辑 1:

应 OP 的要求,这是完整的代码(快速过滤"内容除外).还有一些额外的行被修改了.它们标有 ### MODIFIED.

library(data.table, warn.conflicts = FALSE)图书馆(润滑,warn.conflicts = FALSE)################## 参数 ################### 设置原始交易数据的主要源文件夹路径in_directory 


编辑 2

OP 已要求包含快速过滤器".为简洁起见,我省略了代码.

但是,我有不同的方法.我的方法不是读取每个文件的第一行来检查 INVOICE_DT 是否在给定的 date_range 内,而是过滤 文件名.文件名包含 ISO 8601 格式的年月.

因此,允许的年月字符串向量是从给定的 date_range 构造的.只有那些包含允许的年月字符串的文件名才会被选择进行进一步处理.

但是,选择合适的文件只是第一步.由于 date-range 可能在一个月的中间开始或结束,我们还需要过滤每个已处理文件的行.OP 的代码中缺少这一步.

library(data.table, warn.conflicts = FALSE)库(magrittr)###已修改# library(lubridate, warn.conflicts = FALSE) ### MODIFIED################## 参数 ################### 设置原始交易数据的主要源文件夹路径in_directory 
 CUST_ID QTR Ext Sale1:AK0010001 1 209.9702:CO0020001 1 1540.3003:CO0010001 1 -1.565

请注意,date_range <- c("2017-01-01", "2017-02-14") 现在在 2 月中旬结束.

I am using the following code to summarize my data by a column

library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "C:/Users/NAME/Documents/Raw Data/"

# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ")

# Set location for output
out_directory <- "C:/Users/NAME/Documents/YTD Master/"
out_filename <- "OUTPUT.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2017-01-01"), as.Date("2017-01-31"))

# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
date_filter <- TRUE


##########
## CODE ##
##########

starttime <- Sys.time()
mastertable <- NULL

for (j in 1:length(in_subfolders)) {
  subfolder <- in_subfolders[j]
  sub_directory <- paste0(in_directory, subfolder, "/")

  ## IMPORT DATA
  in_filenames <- dir(sub_directory, pattern =".txt")

  for (i in 1:length(in_filenames)) {

    # Default value provided for when fast filtering is disabled.
    read_this_file <- TRUE

    # To fast filter the data, we choose to include or exclude an entire file based on the date of its first line.
    # WARNING: This is only a valid method if filtering by entire months, since that is the amount of data housed in each file.
    if (date_filter) {
      temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses=c(CUSTOMER_TIER = "character"),
                         na.strings = "", nrows = 1)
      temptable[, INVOICE_DT := as.Date(INVOICE_DT)]

      # If date matches, set read flag to TRUE.  If date does not match, set read flag to FALSE.
      read_this_file <- temptable[, INVOICE_DT] %within% date_range
    }


    if (read_this_file) {
      print(Sys.time()-starttime)
      print(paste0("Reading in ", in_filenames[i]))
      temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses = c(CUSTOMER_TIER = "character"),
                         na.strings = "")


      temptable <- temptable[,lapply(.SD, sum), by = .(CUST_ID),
                                         .SDcols = c("Ext Sale")]

      # Combine into full list
      mastertable <- rbindlist(list(mastertable, temptable), use.names = TRUE)
      # Release unneeded memory
      rm(temptable)

    }

  }

}

# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
rm(mastertable)

print(Sys.time()-starttime)

The output i receive after running the above script for the month of January is as below and this is the output I expect.

CUST_ID Ext Sale
AK0010001   209.97
CO0020001   1540.3

The problem arises when i use multiple months. Below is the output I receive when I run Jan-Feb date_range <- interval(as.Date("2017-01-01"), as.Date("2017-02-28"))

CUST_ID Ext Sale
AK0010001   209.97
AK0010001   217.833
CO0020001   1540.3
CO0010001   -179.765

As you can see in the output above the CUST_ID is no longer consolidating.

Does anyone know why this would be happening?

Below I have provided some data to reproduce what I am working with. Just save the files into 4 separate text file and into folders like I have it in my code.

I have 2 separate folders saved as "AA-CA" and "CB-HZ"

File 1 saved as "AA-CA 2017-01.txt"

INVOICE_DT,BRANCH_CODE,INVOICE_NO,INV_SEQ_NO,INV_ITEM_ID,ITEM_DESCR,STD_ITEM,PRIVATE_LABEL,CATEGORY_PATH1,CATEGORY_PATH2,CUST_ID,CUSTOMER_TIER,IS_VENDING,SALE_PRICE,TOTAL_COST,POS_COST,CE100,CE110,CE120,CE200,CORP_PRICE,QTY_SOLD,PACKSLIP_WHSL,PRICING_GROUP,PGG_MIN_PRICE,PGY_MIN_PRICE,PGR_MIN_PRICE,Ext Sale,Ext Total Cost
2017-01-27,AK001,AK0016997,4,12772-00079,"3.75"""""""" 4.12"""""""" HOSE OD",N,N,08.5-Fleet & Automotive,01.6-DOT Hose & Tubing,AK0010001,Tier 3,No,42.74,22.438335,22.438335,21.37,,,0,,3,,PGR,168.2875125,134.63001,112.191675,128.22,67.315005
2017-01-27,AK001,AK0016997,3,12772-00022,"2.5"""""""" 2.87"""""""" HOSE OD C",N,N,08-Hydraulics & Pneumatics,02-Hose and Hose Reels,AK0010001,Tier 3,No,27.25,14.143396,14.143396,13.47,,,0,,3,,PGR,106.07547,84.860376,70.71698,81.75,42.430188

File 2 saved as "AA-CA 2017-02.txt"

INVOICE_DT,BRANCH_CODE,INVOICE_NO,INV_SEQ_NO,INV_ITEM_ID,ITEM_DESCR,STD_ITEM,PRIVATE_LABEL,CATEGORY_PATH1,CATEGORY_PATH2,CUST_ID,CUSTOMER_TIER,IS_VENDING,SALE_PRICE,TOTAL_COST,POS_COST,CE100,CE110,CE120,CE200,CORP_PRICE,QTY_SOLD,PACKSLIP_WHSL,PRICING_GROUP,PGG_MIN_PRICE,PGY_MIN_PRICE,PGR_MIN_PRICE,Ext Sale,Ext Total Cost
2017-02-28,AK001,AK0017107,1,12772-00307,3-WAY MALE HOUSING,N,N,09-Electrical,05.5-Terminals and Wire Connectors,AK0010001,Tier 3,No,95.21,74.591453,74.591453,71.04,,,0,,1,,PGG,0,0,0,95.21,74.591453
2017-02-28,AK001,AK0017105,3,99523968,PC58570 1/2 PRS BALL,Y,N,,,AK0010001,Tier 3,No,24.5246,12.356039,12.356039,11.767743,,,0,,5,,PGG,0,0,0,122.623,61.780195

File 3 saved as "CB-HZ 2017-01.txt"

INVOICE_DT,BRANCH_CODE,INVOICE_NO,INV_SEQ_NO,INV_ITEM_ID,ITEM_DESCR,STD_ITEM,PRIVATE_LABEL,CATEGORY_PATH1,CATEGORY_PATH2,CUST_ID,CUSTOMER_TIER,IS_VENDING,SALE_PRICE,TOTAL_COST,POS_COST,CE100,CE110,CE120,CE200,CORP_PRICE,QTY_SOLD,PACKSLIP_WHSL,PRICING_GROUP,PGG_MIN_PRICE,PGY_MIN_PRICE,PGR_MIN_PRICE,Ext Sale,Ext Total Cost
2017-01-31,CO002,CO0023603,19,13117-00095,8-32X5/16 BHSCS MAG,N,N,18-Work Order Parts,Finished Products,CO0020001,Tier 3,No,0.1858,0.037528,0.037528,0.01833,,,0,,6000,,PGG,0,0,0,1114.8,225.168
2017-01-31,CO002,CO0023603,20,13117-00186,"#8-16X3/4"""""""" 6-LOBE PA",N,N,01-Fasteners,03-Screws,CO0020001,Tier 3,No,0.0851,0.029652,0.029652,,,,0,,5000,,PGG,0,0,0,425.5,148.26

File 4 saved as "CB-HZ 2017-02.txt"

INVOICE_DT,BRANCH_CODE,INVOICE_NO,INV_SEQ_NO,INV_ITEM_ID,ITEM_DESCR,STD_ITEM,PRIVATE_LABEL,CATEGORY_PATH1,CATEGORY_PATH2,CUST_ID,CUSTOMER_TIER,IS_VENDING,SALE_PRICE,TOTAL_COST,POS_COST,CE100,CE110,CE120,CE200,CORP_PRICE,QTY_SOLD,PACKSLIP_WHSL,PRICING_GROUP,PGG_MIN_PRICE,PGY_MIN_PRICE,PGR_MIN_PRICE,Ext Sale,Ext Total Cost
2017-02-03,CO001,CO0019017,1,MN2550000A20000,M6-1.0 HEX NUT A-2,Y,N,01-Fasteners,04-Nuts,CO0010001,NA,No,0.0313,0.00767,0.00767,0.006215,0.000593,,0.001241,,-50,0.1058,,,,,-1.565,-0.3835
2017-02-16,CO001,CO0019018,1,11516769,RS37518BlkRndSpacer,Y,N,01.5-Hardware,Electronic Hardware,CO0010001,NA,No,0.0396,0.011245,0.011245,0.01071,,,0,,-4500,0.0543,,,,,-178.2,-50.6025

I have the data saved in 2 separate folders.

解决方案

The OP is wondering why the result is not consolidated for CUST_ID if more than one month of data is processed.

The reason is that the monthly files are read in and aggregated one by one but a final aggregation step is needed to consolidate over all months.

The code below is a simplified replacement of the double for loops. I have left out the code for testing for "fast filtering".

The first part creates a list of files to be processed. The second part does the processing.

# create vector of filenames to be processed
in_filenames <- list.files(
  file.path(in_directory, in_subfolders),
  pattern = "\.txt$",
  full.names = TRUE,
  recursive = TRUE)

# read and aggregate each file separately
mastertable <- rbindlist(
  lapply(in_filenames, function(fn) {
    # code for "fast filter" test goes here
    message("Reading in ", fn)
    temptable <- fread(fn,
                       colClasses = c(CUSTOMER_TIER = "character"),
                       na.strings = "")
    # aggregate
    temptable[, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
  })
)[
  # THIS IS THE MISSING STEP:
  # second aggregation for overall totals
  , lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
mastertable

Note that chaining of data.table expressions is used here.


Edit 1:

By request of the OP, here is the complete code (except for the "fast filtering" stuff). There are some additional lines which where modified. They are marked with ### MODIFIED.

library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "Raw Data"   ### MODIFIED

# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ")

# Set location for output
out_directory <- "YTD Master"   ### MODIFIED
out_filename <- "OUTPUT.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2017-01-01"), as.Date("2017-02-28"))   ### MODIFIED

# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
date_filter <- TRUE


##########
## CODE ##
##########

starttime <- Sys.time()

# create vector of filenames to be processed
in_filenames <- list.files(
  file.path(in_directory, in_subfolders),
  pattern = "\.txt$",
  full.names = TRUE,
  recursive = TRUE)

# read and aggregate each file separetely
mastertable <- rbindlist(
  lapply(in_filenames, function(fn) {
    # code for fast filter test goes here
    message("Processing file: ", fn)
    temptable <- fread(fn,
                       colClasses = c(CUSTOMER_TIER = "character"),
                       na.strings = "")
    # aggregate by month
    temptable[, lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]
  })
)[
  # second aggregation overall
  , lapply(.SD, sum), by = .(CUST_ID), .SDcols = c("Ext Sale")]

# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
# rm(mastertable)   ### MODIFIED

print(Sys.time()-starttime)


Edit 2

The OP has asked to include the "fast filter" code which I had omitted for brevity.

However, I have a different approach. Instead of reading the first line of each file to check if INVOICE_DT is within the given date_range my approach filters the file names. The file names contain the year-month in ISO 8601 format.

So, a vector of allowed year-month strings is constructed from the given date_range. Only those file names which contain one of the allowed year-month strings are selected for further processing.

However, selecting the proper files is only the first step. As the date-range may start or end right in the middel of a month, we need also to filter the rows of each processed file. This step is missing from OP's code.

library(data.table, warn.conflicts = FALSE)
library(magrittr)   ### MODIFIED
# library(lubridate, warn.conflicts = FALSE)   ### MODIFIED

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "Raw Data"   ### MODIFIED

# List names of sub-folders (currently grouped by first two characters of CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ")

# Set location for output
out_directory <- "YTD Master"   ### MODIFIED
out_filename <- "OUTPUT.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- c("2017-01-01", "2017-02-14")   ### MODIFIED

# Enable or disable filtering of raw files to only grab items bought within certain months to save space.
# If false, all files will be scanned for unique items, which will take longer and be a larger file.
# date_filter <- TRUE   ### MODIFIED


##########
## CODE ##
##########

starttime <- Sys.time()

# create vector of filenames to be processed
in_filenames <- list.files(
  file.path(in_directory, in_subfolders),
  pattern = "\.txt$",
  full.names = TRUE,
  recursive = TRUE)

# filter filenames, only
selected_in_filenames <-
  seq(as.Date(date_range[1]),
      as.Date(date_range[2]), by = "1 month") %>%
  format("%Y-%m") %>%
  lapply(function(x) stringr::str_subset(in_filenames, x)) %>%
  unlist()

# read and aggregate each file separetely
mastertable <- rbindlist(
  lapply(selected_in_filenames, function(fn) {
    message("Processing file: ", fn)
    temptable <- fread(fn,
                       colClasses = c(CUSTOMER_TIER = "character"),
                       na.strings = "")
    # aggregate file but filtered for date_range
    temptable[INVOICE_DT %between% date_range,
              lapply(.SD, sum), by = .(CUST_ID, QTR = quarter(INVOICE_DT)),
              .SDcols = c("Ext Sale")]
  })
)[
  # second aggregation overall
  , lapply(.SD, sum), by = .(CUST_ID, QTR), .SDcols = c("Ext Sale")]

# Save Final table
print("Saving master table")
fwrite(mastertable, file.path(out_directory, out_filename))
# rm(mastertable)   ### MODIFIED

print(Sys.time()-starttime)

mastertable

Note that date_range <- c("2017-01-01", "2017-02-14") now ends mid of February.

这篇关于在日期过滤器中使用多个月时,行不合并 R 中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 00:34
查看更多