如何有效地从文本文件的每一行读取第一个字符？

本文介绍了如何有效地从文本文件的每一行读取第一个字符？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述下面是一个示例文件： x Afklgjsdf; bosfu09 [45y94hn9igf，Basfgsdbsfgn， Djakfl09w50968509，E3434t） writeLines（x，test.txt） pre> 我可以用 readLines 并使用 substring 得到第一个字符： lines substring（lines，1,1） ## [1]ABC DE 有没有办法说服R只读第一个字符，而不是放弃它们？我怀疑应该有一些咒语使用 scan ，但我可以找不到。一个替代方案可能是低级别的文件操作（也许有 seek ）。由于性能只与较大的文件相关，因此，用于基准测试的更大的测试文件： set.seed（2015） nch x2 nch， function（nch） { paste0（ sample（letters， nch，replace = TRUE）， collapse =）}， character（1）） writeLines（x2，bigtest。 txt）更新：好像可以不要扫描整个文件。最好的速度增长似乎是使用一个更快的替代方案 readLines （ Richard Scriven's stringi :: stri_read_lines 解决方案和 Josh O'Brien的 data.table :: fread 解决方案），或将文件视为二进制文件（ Martin Morgan的 readBin 解决方案）。解决方案 04/2015编辑将更好的解决方案带到最前面。更新2 更改在一个打开的连接上运行 scan（）方法，而不是在每次迭代时打开和关闭，允许逐行读取并消除循环。时机改善了不少。 ## scan（）打开连接 conn substr（scan（conn，what =，sep =\\\，quiet = TRUE），1， close（conn）我还在 stringi 中发现了 stri_read_lines（）它的帮助文件说这是目前的实验，但速度非常快。 $ b $ ## stringi :: stri_read_lines（） library（stringi） stri_sub（stri_read_lines（bigtest .txt），1，1）以下是这两种方法的时间点。 > ##计时 library（microbenchmark） microbenchmark（ scan = { conn< - file（bigtest.txt，rt） substr（scan（conn，what =，sep =\\\，quiet = TRUE），1,1 ） close（conn）}， stringi = { stri_sub（stri_read_lines（bigtest.txt），1，1）} ）＃单位：毫秒＃expr分钟lq平均中位数uq max neval ＃scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100 ＃stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100 原始[较慢]回答：您可以尝试 read.fwf（）（fixed widt h文件），将宽度设置为1，以捕获每行的第一个字符。 pre $ read $ f $ f $ [1]ABCDE 当然，但是适用于测试文件，并且是获取子字符串而不必读取整个文件的一个很好的函数。更新1 read.fwf（）不是很有效，调用 scan（）和 read.table（）内部。我们可以跳过中间人，直接尝试 scan（）。 lines 读< - 函数（n）{ ch ＃[1]ABCDE $ hr $ $ $ $ $ $ $ $ $ $ $ $ $ [$] code> I'd like to read only the first character from each line of a text file, ignoring the rest.Here's an example file:x <- c( "Afklgjsdf;bosfu09[45y94hn9igf", "Basfgsdbsfgn", "Cajvw58723895yubjsdw409t809t80", "Djakfl09w50968509", "E3434t")writeLines(x, "test.txt")I can solve the problem by reading everything with readLines and using substring to get the first character: lines <- readLines("test.txt")substring(lines, 1, 1)## [1] "A" "B" "C" "D" "E"This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?I suspect that there ought to be some incantation using scan, but I can't find it. An alternative might be low level file manipulation (maybe with seek).Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:set.seed(2015)nch <- sample(1:100, 1e4, replace = TRUE)x2 <- vapply( nch, function(nch) { paste0( sample(letters, nch, replace = TRUE), collapse = "" ) }, character(1))writeLines(x2, "bigtest.txt")Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines (Richard Scriven's stringi::stri_read_lines solution and Josh O'Brien's data.table::fread solution), or to treat the file as binary (Martin Morgan's readBin solution). 解决方案 01/04/2015 Edited to bring the better solution to the top.Update 2 Changing the scan() method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit. ## scan() on open connectionconn <- file("bigtest.txt", "rt")substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)close(conn)I also discovered the stri_read_lines() function in the stringi package, Its help file says it's experimental at the moment, but it is very fast. ## stringi::stri_read_lines()library(stringi)stri_sub(stri_read_lines("bigtest.txt"), 1, 1)Here are the timings for these two methods.## timingslibrary(microbenchmark)microbenchmark( scan = { conn <- file("bigtest.txt", "rt") substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1) close(conn) }, stringi = { stri_sub(stri_read_lines("bigtest.txt"), 1, 1) })# Unit: milliseconds# expr min lq mean median uq max neval# scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100Original [slower] answer :You could try read.fwf() (fixed width file), setting the width to a single 1 to capture the first character on each line. read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]# [1] "A" "B" "C" "D" "E"Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.Update 1 : read.fwf() is not very efficient, calling scan() and read.table() internally. We can skip the middle-men and try scan() directly. lines <- count.fields("test.txt") ## length is num of lines in fileskip <- seq_along(lines) - 1 ## set up the 'skip' arg for scan()read <- function(n) { ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE) substr(ch, 1, 1)}vapply(skip, read, character(1L))# [1] "A" "B" "C" "D" "E"version$platform# [1] "x86_64-pc-linux-gnu" 这篇关于如何有效地从文本文件的每一行读取第一个字符？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！