将文本数据导入R并删除无关的标题和其他不需要的文本

将文本数据导入R并删除无关的标题和其他不需要的文本

本文介绍了将文本数据导入R并删除无关的标题和其他不需要的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型文本文件,其中包含统一犯罪报告中的数据。理想情况下,我想要做的只是导入数据并省略文件中的其他无关的东西。实际数据由空格分隔,当数据进入另一个页面时,标题信息自身重复。我首先尝试使用以下代码导入数据(并且只导入数据)并手动添加我自己的标题:

I have a large text file that contains data from the uniform crime report. Ideally, what I would like to do is only import the data and leave out the other extraneous stuff in the file. The actual data is delimited by spaces and as the data goes onto another "page" the header information repeats itself. I first tried to import the data (and only the data) using the following code and to add my own headers manually:

  data <- read.fwf("2010SHRall.txt",
        c(-4,3,8,2,4,5,6,5,4,3,3,4,4,3,3,4,6,5,3,6,26,3),
        skip=5,
        col.names=c("AGE","AGENCY","G","MO","HOM","INC","SIT","VA","VS","VR","VE","OA","OS","OR","OE","WEAP","REL","CIR","SUB","AGENCYNAME","STATE"),
        strip.white=FALSE)

这是有效的,然后在第51行它退出。我绝对是一个新手R程序员,我尝试谷歌的答案,以及搜索Stack Overflow,但我不知道从哪里去。以下是。同样,我试图导入数据并删除任何具有标题信息或完整数据集不需要的其他部分的行。

This works and then at line 51 it quits. I'm definitely a novice R programmer and I tried to Google the answer as well as to search Stack Overflow but I am at a loss for where to go from here. Here is a link to the text file that I am trying to import. Again, I am trying to import the data and remove any rows that have header info or other pieces that are not needed for the complete dataset.

任何人都可以提供任何帮助非常感谢。

Any help anyone could offer would be greatly appreciated.

推荐答案

这应该可行:

text <- readLines('/tmp/2010SHRall.txt')
group.start <- '^      AGENCY'
group.end <- '(^B)|(^0END OF GROUP)'
data <- character()
inside.group <- FALSE
for (line in text) {
  if (inside.group) {
    if (grepl(group.end, line))
      inside.group <- FALSE
    else
      data <- append(data, line)
  } else if (grepl(group.start, line)) {
    inside.group <- TRUE
  }
}
read.fwf(textConnection(data),
         widths=c(-4,3,8,2,4,5,6,5,4,3,3,4,4,3,3,4,6,5,3,6,26,3),
         header=FALSE,
         col.names=c("AGE","AGENCY","G","MO","HOM","INC","SIT","VA","VS","VR","VE","OA","OS","OR","OE","WEAP","REL","CIR","SUB","AGENCYNAME","STATE"),
         strip.white=TRUE)

它保留所有行之间的行与 group.start group.end 正则表达式匹配,并丢弃其余部分。

It keeps all lines in between lines that match the group.start and group.end regular expressions and discards the rest.

这篇关于将文本数据导入R并删除无关的标题和其他不需要的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 08:38