问题描述
我想使用LogStash解析一个WARC文件。我想输入到ElasticSearch,以便我可以使用Kibana来显示它。我已经尝试过:
I would like to parse a WARC file using LogStash. I want to feed the input to ElasticSearch, so that I can visualize it using Kibana. I have tried this:
input {
file {
path => "/tmp/access_log"
start_position => "beginning"
}
}
filter {
if [path] =~ "access" {
mutate { replace => { "type" => "apache_access" } }
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
}
date {
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
stdout { codec => rubydebug }
}
这有助于使用apache日志并显示它。我想知道如何使用WARC文件并使用Kibana可视化。
这是我想输入的示例WARC文件。
This help to take an apache log and display it. I would like to know how is it possible to use the WARC file and visualize it using the Kibana.
This is sample WARC file that I would like to input.
WARC/0.17
WARC-Type: metadata
WARC-Target-URI: http://www.archive.org/robots.txt
WARC-Date: 2008-04-30T20:48:25Z
WARC-Concurrent-To: <urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb30>
WARC-Record-ID: <urn:uuid:545709ad-90c5-4c08-9eed-092bdf2e33a7>
Content-Type: text/anvl
Content-Length: 66
via: http://www.archive.org/
hopsFromSeed: P
fetchTimeMs: 47
WARC/0.17
WARC-Type: response
WARC-Target-URI: http://www.archive.org/
WARC-Date: 2008-04-30T20:48:26Z
WARC-Payload-Digest: sha1:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV
WARC-IP-Address: 207.241.229.39
WARC-Record-ID: <urn:uuid:4042c21b-d898-43f0-9c95-b50da2d1aa42>
Content-Type: application/http; msgtype=response
Content-Length: 680
HTTP/1.1 200 OK
Date: Wed, 30 Apr 2008 20:48:25 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g
Last-Modified: Wed, 09 Jan 2008 23:18:29 GMT
ETag: "47ac-16e-4f9e5b40"
Accept-Ranges: bytes
Content-Length: 366
Connection: close
Content-Type: text/html; charset=UTF-8
<html>
<head>
<meta http-equiv="Refresh" content="0;URL=http://www.archive.org/index.php"/>
<script>
document.location="http://www.archive.org/index.php";
</script>
</head>
<body>
<img width="70" height="56" src="http://www.archive.org/images/logoc.jpg"/><br/>
Please visit our website at:
<a href="http://www.archive.org">http://www.archive.org</a>
</body>
</html>
以下是完整的文件示例:
希望很快听到你的声音..我会很高兴我得到这个查询解决。
Here is the full Sample of File: Sample WARC Text in Text File Format
Hope to hear from you soon.. I will be glad if I get this query resolved.
推荐答案
此过滤器将只保留^ WARC-Target-URI或^ HTTP / 1.1或^ Date:,然后从行中提取信息。
This filter will keep only the lines with "^WARC-Target-URI" or "^HTTP/1.1" or "^Date: ", then extract information from the lines.
input {
file {
path => "/tmp/access_log"
start_position => "beginning"
}
}
filter {
if [message] !~ "^WARC-Target-URI" and [message] !~ "^HTTP\/1.1" and [message] !~ "^Date: " {
drop {}
}
grok {
match => {
"message" => ["Date: %{GREEDYDATA:date}", "WARC-Target-URI: %{GREEDYDATA:url}", "HTTP/1.1 %{NUMBER:response}"]
}
}
# For "Wed, 30 Apr 2008 20:48:25 GMT"
date {
match => ["date", "EEE, dd MMM YYYY HH:mm:ss ZZZ"]
target => "date"
locale => "en"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "webinfo"
}
}
从示例文件中将会插入在Elasticsearch中的以下json文档:
From the sample file, it will insert in Elasticsearch the following json documents:
{"message":"WARC-Target-URI: http://www.archive.org/robots.txt","@version":"1","@timestamp":"2016-11-22T12:55:48.151Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/robots.txt"}
{"message":"WARC-Target-URI: http://www.archive.org/","@version":"1","@timestamp":"2016-11-22T12:55:48.151Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/"}
{"message":"HTTP/1.1 200 OK","@version":"1","@timestamp":"2016-11-22T12:55:48.167Z","path":"D:\\better.txt","host":"FREIFDKT0021127","response":"200"}
{"message":"Date: Wed, 30 Apr 2008 20:48:25 GMT","@version":"1","@timestamp":"2016-11-22T12:55:48.167Z","path":"D:\\better.txt","host":"FREIFDKT0021127","date":"2008-04-30T20:48:25.000Z"}
{"message":"WARC-Target-URI: http://www.archive.org/","@version":"1","@timestamp":"2016-11-22T12:55:48.183Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/"}
{"message":"WARC-Target-URI: http://www.archive.org/","@version":"1","@timestamp":"2016-11-22T12:55:48.183Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/"}
{"message":"WARC-Target-URI: http://www.archive.org/index.php","@version":"1","@timestamp":"2016-11-22T12:55:48.183Z","path":"D:\\better.txt","host":"FREIFDKT0021127","url":"http://www.archive.org/index.php"}
{"message":"HTTP/1.1 200 OK","@version":"1","@timestamp":"2016-11-22T12:55:48.198Z","path":"D:\\better.txt","host":"FREIFDKT0021127","response":"200"}
{"message":"Date: Wed, 30 Apr 2008 20:48:25 GMT","@version":"1","@timestamp":"2016-11-22T12:55:48.198Z","path":"D:\\better.txt","host":"FREIFDKT0021127","date":"2008-04-30T20:48:25.000Z"}
这篇关于使用Logstash,ElasticSearch和Kibana处理Warc文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!