问题描述
我想从不同的文本或json或csv文件中读取数据.我应该遵循哪种方法?
我已阅读以下博客文件已读取,读取具有小RAM的2GB文本文件文件读取方法.
不同的方法:
*分块读取文件*同时读取文件块*将整个文件读入内存*将长字符串拆分为单词*逐字扫描
无法找到读取RAM较小的文件的最快方法.
基本上,有两种方法可以解析文件:文档解析和流解析.
文档解析会从文件中读取数据,并将其转换为可查询的一大组对象,例如HTML DOM 在浏览器中.好处是您可以轻松获得完整的数据,这通常更简单.缺点是您必须将所有内容都存储在内存中.
dom = parse(stuff)//现在使用dom随心所欲
相反,流解析一次读取一个元素,并提供给您立即使用,然后移至下一个元素.
元素的 :=范围流(填充){一次检查一个元素}
优点是您不必将整个内容加载到内存中.缺点是您必须处理数据流.这对于搜索或其他需要逐一处理的事情非常有用.
幸运的是,Go提供了一些库来为您处理常见格式.
一个简单的示例正在处理CSV文件.
程序包主要进口(编码/csv""fmt"日志""os""io")func main(){文件,错误:= os.Open("test.csv")如果err!= nil {log.Fatal(错误)}解析器:= csv.NewReader(file)...}
我们可以将整个内容作为大的 [] [] string
吸收到内存中.
记录,err:= parser.ReadAll()如果err!= nil {log.Fatal(错误)}对于_,record:=范围记录{fmt.Println(记录)}
或者我们可以节省大量内存并一次处理一行.
由于CSV的每一行在功能上都是相同的,所以一次处理一行是最有意义的.
JSON和XML更复杂,因为它们是大型的嵌套结构,但它们也可以进行流传输.中有流式传输示例.
如果您的代码不是简单循环怎么办?如果您想利用并发性怎么办?使用通道和goroutine将其与程序的其余部分同时提供.
记录:= make(chan [] string)转到func(){解析器:= csv.NewReader(file)推迟关闭(记录)为了 {记录,错误:= parser.Read()如果err == io.EOF {休息}如果err!= nil {log.Fatal(错误)}记录<-记录}}();
现在,您可以将记录
传递给可以处理它们的函数.
func print_records(records chan [] string){记录:=范围记录{fmt.Println(记录)}}
I want to read the data from the different text or json or csv files. Which is the approach should I follow?
I have read these blogs File read, read 2GB text file with small RAM for the different approach for file reading.
Different approach:
* Reading a file in chunks
* Reading file chunks concurrently
* Reading the entire file into memory
* Splitting a long string into words
* Scanning word by word
Not able to find out the fastest way of reading the file with small RAM.
There's basically two different ways to approach parsing a file: document parsing and stream parsing.
Document parsing reads the data from the file and turns it into a big set of objects that you can query, like the HTML DOM in a browser. The advantage is you have the complete data at your fingertips, this is often simpler. The disadvantage is you have to store it all in memory.
dom = parse(stuff)
// now do whatever you like with the dom
Stream parsing instead reads a single element at a time and presents it to you for immediate use, then it moves on to the next one.
for element := range stream(stuff) {
...examine one element at a time...
}
The advantage is you don't have to load the whole thing into memory. The disadvantage is you must work with the data as it goes by. This is very useful for searches or anything else which needs to process one by one.
Fortunately Go provides libraries to handle the common formats for you.
A simple example is handling a CSV file.
package main
import(
"encoding/csv"
"fmt"
"log"
"os"
"io"
)
func main() {
file, err := os.Open("test.csv")
if err != nil {
log.Fatal(err)
}
parser := csv.NewReader(file)
...
}
We can slurp the whole thing into memory as a big [][]string
.
records, err := parser.ReadAll()
if err != nil {
log.Fatal(err)
}
for _,record := range records {
fmt.Println(record)
}
Or we can save a bunch of memory and deal with the rows one at a time.
for {
record, err := parser.Read()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
fmt.Println(record)
}
Since every line of a CSV is functionally the same, processing it one row at a time makes the most sense.
JSON and XML are more complex because they are large, nested structures, but they can also be streamed. There's an example of streaming in the encoding/json documentation.
What if your code isn't a simple loop? What if you want to take advantage of concurrency? Use a channel and a goroutine to feed it concurrent with the rest of the program.
records := make( chan []string )
go func() {
parser := csv.NewReader(file)
defer close(records)
for {
record, err := parser.Read()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
records <- record
}
}();
Now you can pass records
to a function which can process them.
func print_records( records chan []string ) {
for record := range records {
fmt.Println(record)
}
}
这篇关于用小RAM读取go lang中的大文件的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!