本文介绍了如何有效地从文本文件的每一行读取第一个字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 下面是一个示例文件: x Afklgjsdf; bosfu09 [45y94hn9igf,Basfgsdbsfgn, Djakfl09w50968509,E3434t) writeLines(x,test.txt) pre> 我可以用 readLines 并使用 substring 得到第一个字符: lines substring(lines,1,1) ## [1]ABC DE 有没有办法说服R只读第一个字符,而不是放弃它们? 我怀疑应该有一些咒语使用 scan ,但我可以找不到。一个替代方案可能是低级别的文件操作(也许有 seek )。 由于性能只与较大的文件相关,因此,用于基准测试的更大的测试文件: set.seed(2015) nch x2 nch, function(nch) { paste0( sample(letters, nch,replace = TRUE), collapse =)}, character(1)) writeLines(x2,bigtest。 txt) 更新:好像可以不要扫描整个文件。最好的速度增长似乎是使用一个更快的替代方案 readLines ( Richard Scriven's stringi :: stri_read_lines 解决方案和 Josh O'Brien的 data.table :: fread 解决方案),或将文件视为二进制文件( Martin Morgan的 readBin 解决方案)。 解决方案 04/2015编辑将更好的解决方案带到最前面。 更新2 更改在一个打开的连接上运行 scan()方法,而不是在每次迭代时打开和关闭,允许逐行读取并消除循环。时机改善了不少。 ## scan()打开连接 conn substr(scan(conn,what =,sep =\\\,quiet = TRUE),1, close(conn) 我还在 stringi 中发现了 stri_read_lines()它的帮助文件说这是目前的实验,但速度非常快。 $ b $ ## stringi :: stri_read_lines() library(stringi) stri_sub(stri_read_lines(bigtest .txt),1,1) 以下是这两种方法的时间点。 > ##计时 library(microbenchmark) microbenchmark( scan = { conn< - file(bigtest.txt,rt) substr(scan(conn,what =,sep =\\\,quiet = TRUE),1,1 ) close(conn)}, stringi = { stri_sub(stri_read_lines(bigtest.txt),1,1)} )#单位:毫秒#expr分钟lq平均中位数uq max neval #scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100 #stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100 原始[较慢]回答: 您可以尝试 read.fwf()(fixed widt h文件),将宽度设置为1,以捕获每行的第一个字符。 pre $ read $ f $ f $ [1]ABCDE 当然,但是适用于测试文件,并且是获取子字符串而不必读取整个文件的一个很好的函数。 更新1 read.fwf()不是很有效,调用 scan()和 read.table()内部。我们可以跳过中间人,直接尝试 scan()。 lines 读< - 函数(n){ ch #[1]ABCDE $ hr $ $ $ $ $ $ $ $ $ $ $ $ $ [$] code> I'd like to read only the first character from each line of a text file, ignoring the rest.Here's an example file:x <- c( "Afklgjsdf;bosfu09[45y94hn9igf", "Basfgsdbsfgn", "Cajvw58723895yubjsdw409t809t80", "Djakfl09w50968509", "E3434t")writeLines(x, "test.txt")I can solve the problem by reading everything with readLines and using substring to get the first character: lines <- readLines("test.txt")substring(lines, 1, 1)## [1] "A" "B" "C" "D" "E"This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?I suspect that there ought to be some incantation using scan, but I can't find it. An alternative might be low level file manipulation (maybe with seek).Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:set.seed(2015)nch <- sample(1:100, 1e4, replace = TRUE)x2 <- vapply( nch, function(nch) { paste0( sample(letters, nch, replace = TRUE), collapse = "" ) }, character(1))writeLines(x2, "bigtest.txt")Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines (Richard Scriven's stringi::stri_read_lines solution and Josh O'Brien's data.table::fread solution), or to treat the file as binary (Martin Morgan's readBin solution). 解决方案 01/04/2015 Edited to bring the better solution to the top.Update 2 Changing the scan() method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit. ## scan() on open connectionconn <- file("bigtest.txt", "rt")substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)close(conn)I also discovered the stri_read_lines() function in the stringi package, Its help file says it's experimental at the moment, but it is very fast. ## stringi::stri_read_lines()library(stringi)stri_sub(stri_read_lines("bigtest.txt"), 1, 1)Here are the timings for these two methods.## timingslibrary(microbenchmark)microbenchmark( scan = { conn <- file("bigtest.txt", "rt") substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1) close(conn) }, stringi = { stri_sub(stri_read_lines("bigtest.txt"), 1, 1) })# Unit: milliseconds# expr min lq mean median uq max neval# scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100Original [slower] answer :You could try read.fwf() (fixed width file), setting the width to a single 1 to capture the first character on each line. read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]# [1] "A" "B" "C" "D" "E"Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.Update 1 : read.fwf() is not very efficient, calling scan() and read.table() internally. We can skip the middle-men and try scan() directly. lines <- count.fields("test.txt") ## length is num of lines in fileskip <- seq_along(lines) - 1 ## set up the 'skip' arg for scan()read <- function(n) { ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE) substr(ch, 1, 1)}vapply(skip, read, character(1L))# [1] "A" "B" "C" "D" "E"version$platform# [1] "x86_64-pc-linux-gnu" 这篇关于如何有效地从文本文件的每一行读取第一个字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
09-14 08:51