js中解析巨大的日志文件

js中解析巨大的日志文件

本文介绍了在Node.js中解析巨大的日志文件-逐行阅读的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在Javascript/Node.js中解析大型(5-10 Gb)日志文件(我正在使用Cube).

I need to do some parsing of large (5-10 Gb)logfiles in Javascript/Node.js (I'm using Cube).

日志行看起来像:

10:00:43.343423 I'm a friendly log message. There are 5 cats, and 7 dogs. We are in state "SUCCESS".

我们需要读取每一行,进行一些解析(例如,剥离57SUCCESS),然后将这些数据泵入多维数据集( https://github.com/square/cube )使用其JS客户端.

We need to read each line, do some parsing (e.g. strip out 5, 7 and SUCCESS), then pump this data into Cube (https://github.com/square/cube) using their JS client.

首先,在Node中逐行读取文件的规范方式是什么?

Firstly, what is the canonical way in Node to read in a file, line by line?

在线上这似乎是一个相当普遍的问题:

It seems to be fairly common question online:

  • http://www.quora.com/What-is-the-best-way-to-read-a-file-line-by-line-in-node-js
  • Read a file one line at a time in node.js?

许多答案似乎都指向一堆第三方模块:

A lot of the answers seem to point to a bunch of third-party modules:

  • https://github.com/nickewing/line-reader
  • https://github.com/jahewson/node-byline
  • https://github.com/pkrumins/node-lazy
  • https://github.com/Gagle/Node-BufferedReader

但是,这似乎是一项相当基本的任务-当然,在stdlib中有一种简单的方法可以逐行读取文本文件吗?

However, this seems like a fairly basic task - surely, there's a simple way within the stdlib to read in a textfile, line-by-line?

第二,然后我需要处理每一行(例如,将时间戳转换为Date对象,并提取有用的字段).

Secondly, I then need to process each line (e.g. convert the timestamp into a Date object, and extract useful fields).

实现最大吞吐量的最佳方法是什么?有什么方法可以阻止每行的读取或将其发送到多维数据集吗?

What's the best way to do this, maximising throughput? Is there some way that won't block on either reading in each line, or on sending it to Cube?

第三-我猜想使用字符串拆分,并且contains(IndexOf!= -1?)的JS等效项比正则表达式快很多?有没有人在Node.js中解析大量文本数据方面有丰富的经验?

Thirdly - I'm guessing using string splits, and the JS equivalent of contains (IndexOf != -1?) will be a lot faster than regexes? Has anybody had much experience in parsing massive amounts of text data in Node.js?

干杯,维克多

推荐答案

我搜索了一种使用流逐行解析非常大的文件(gbs)的解决方案.所有第三方库和示例都不符合我的需要,因为它们不是逐行处理文件(例如1,2,3,4 ..)或将整个文件读取到内存中

I searched for a solution to parse very large files (gbs) line by line using a stream. All the third-party libraries and examples did not suit my needs since they processed the files not line by line (like 1 , 2 , 3 , 4 ..) or read the entire file to memory

以下解决方案可以使用流&逐行解析非常大的文件.管道.为了进行测试,我使用了一个具有17.000.000条记录的2.1 GB文件.公羊的使用不超过60 mb.

The following solution can parse very large files, line by line using stream & pipe. For testing I used a 2.1 gb file with 17.000.000 records. Ram usage did not exceed 60 mb.

首先,安装事件流程序包:

npm install event-stream

然后:

var fs = require('fs')
    , es = require('event-stream');

var lineNr = 0;

var s = fs.createReadStream('very-large-file.csv')
    .pipe(es.split())
    .pipe(es.mapSync(function(line){

        // pause the readstream
        s.pause();

        lineNr += 1;

        // process line here and call s.resume() when rdy
        // function below was for logging memory usage
        logMemoryUsage(lineNr);

        // resume the readstream, possibly from a callback
        s.resume();
    })
    .on('error', function(err){
        console.log('Error while reading file.', err);
    })
    .on('end', function(){
        console.log('Read entire file.')
    })
);

请让我知道它的进展!

这篇关于在Node.js中解析巨大的日志文件-逐行阅读的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 13:14