本文介绍了Python脚本-将TSV转换为JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的数据块,看起来像这样:

I have a very large chunk of data that looks like this:

#Software: SGOS 5.4.3.7
#Version: 1.0
#Start-Date: 2011-07-22 20:34:51
#Date: 2011-06-02 10:20:47
#Fields: date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id
#Remark: 2610140037 "SG-42" "82.137.200.42" "main"
2011-07-22 20:34:51 282 ce6de14af68ce198 - - - OBSERVED "unavailable" http://www.surfjunky.com/members/sj-a.php?r=44864  200 TCP_NC_MISS GET text/html http www.surfjunky.com 80 /members/sj-a.php ?r=66556 php "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24" 82.137.200.42 1395 663 -
2011-07-22 20:34:51 216 6154d919f8d56690 - - - OBSERVED "unavailable" http://x31.iloveim.com/build_3.9.2.1/comet.html  200 TCP_NC_MISS GET text/html;charset=UTF-8 http x31.iloveim.com 80 /servlets/events ?1122064400327 - "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18" 82.137.200.42 473 1129 -
2011-07-22 20:34:51 307 6d98469a3f1de6f4 - - - OBSERVED "unavailable" http://www.xnxx.com/  200 TCP_MISS GET image/jpeg http img100.xvideos.com 80 /videos/thumbsl/2/e/5/2e5fd679f1118757314fb9a94c0f626c.25.jpg - jpg "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C)" 82.137.200.42 16188 415 -

我大约有10到15个文件,范围从700 MB到20 GB.我需要将它们转换为JSON,以便可以将它们用于测试某些JSON分析软件.我从头开始,但是遇到一些错误并有一些疑问

I have about 10 - 15 files that range from 700 MB to 20 GB. I need to convert those to JSON so I can use them for testing some JSON analytic software. I have the start of something but I am getting some errors and have some questions

这是我的转换脚本:

#!/usr/bin/python
import csv

jsfile = file('sg_main__test.json', 'w')
jsfile.write('[\r\n')

with open('sg_main__test.log','r') as f:
    next(f)
    reader=csv.reader(f,delimiter='\t')

    row_count = len(list(reader))
    ite = 0

    f.seek(0)
    next(f)

    for date,time,time_taken,c_ip,cs_username,cs_auth_group,x_exception_id,sc_filter_result,cs_categories,cs_Referer,sc_status,s_action,cs_method,rs_Content_Type,cs_uri_scheme,cs_host,cs_uri_port,cs_uri_path,cs_uri_query,cs_uri_extension,cs_User_Agent,s_ip,sc_bytes,cs_bytes,x_virus_id in reader:

    ite+= 1

    jsfile.write('\t{\r\n')


    d = '\t\t\"date\": \"' + date + '\",\r\n'
    t = '\t\t\"time\": \"' + time + '\",\r\n'
    tt = '\t\t\"date\": \"' + date + '\",\r\n'
    ci = '\t\t\"c_ip\": \"' + c_ip + '\",\r\n'
    c = '\t\t\"cs_username\": \"' + cs_username + '\",\r\n'
    ca = '\t\t\"cs_auth_group\": \"' + cs_auth_group + '\",\r\n'
    xe = '\t\t\"x_exception_id\": \"' + x_exception_id + '\",\r\n'
    sf = '\t\t\"sc_filter_result\": \"' + sc_filter_result + '\",\r\n'
    cc = '\t\t\"cs_categories\": \"' + cs_categories + '\",\r\n'
    cr = '\t\t\"cs_Referer\": \"' + cs_Referer + '\",\r\n'
    ss = '\t\t\"sc_status\": \"' + sc_status + '\",\r\n'
    sa = '\t\t\"s_action\": \"' + s_action + '\",\r\n'
    cm = '\t\t\"cs_method\": \"' + cs_method + '\",\r\n'
    rc = '\t\t\"rs_Content_Type\": \"' + rs_Content_Type + '\",\r\n'
    cu = '\t\t\"cs_uri_scheme\": \"' + cs_uri_scheme + '\",\r\n'
    ch = '\t\t\"cs_host\": \"' + cs_host + '\",\r\n'
    cp = '\t\t\"cs_uri_port\": \"' + cs_uri_port + '\",\r\n'
    cpa = '\t\t\"cs_uri_path\": \"' + cs_uri_path + '\",\r\n'
    cq = '\t\t\"cs_uri_query\": \"' + cs_uri_query + '\",\r\n'
    ce = '\t\t\"cs_uri_extension\": \"' + cs_uri_extension + '\",\r\n'
    cua = '\t\t\"cs_User_Agent\": \"' + cs_User_Agent + '\",\r\n'
    si = '\t\t\"s_ip\": \"' + s_ip + '\",\r\n'
    sb = '\t\t\"sc_bytes\": \"' + sc_bytes + '\",\r\n'
    cb = '\t\t\"cs_bytes\": \"' + cs_bytes + '\",\r\n'
    xv = '\t\t\"x_virus_id\": \"' + x_virus_id + '\",\r\n'

    jsfile.write(d)
    jsfile.write(t)
    jsfile.write(tt)
    jsfile.write(ci)
    jsfile.write(c)
    jsfile.write(ca)
    jsfile.write(xe)
    jsfile.write(sf)
    jsfile.write(cc)
    jsfile.write(cr)
    jsfile.write(ss)
    jsfile.write(sa)
    jsfile.write(cm)
    jsfile.write(rc)
    jsfile.write(cu)
    jsfile.write(ch)
    jsfile.write(cp)
    jsfile.write(cpa)
    jsfile.write(cq)
    jsfile.write(ce)
    jsfile.write(cua)
    jsfile.write(si)
    jsfile.write(sb)
    jsfile.write(cb)
    jsfile.write(xv)

    jsfile.write('\t}')

    if ite < row_count:
        jsfile.write('\r\n')

    jsfile.write('\r\n')

jsfile.write(']')
jsfile.close()

执行时返回错误-

问题-为什么我要返回需要超过1个值才能解压缩"的内容,我在上下文中是否有所遗漏?

Questions -Is there something in the context I'm missing as to why it is returning the "need more than 1 value to unpack"?

有没有一种方法可以让我读取目录并转换目录中的所有文件,而不必定义输入文件名?

Is there a way I can have it read a directory and convert all the files in a directory with having to define the input file name?

关于上述问题,我是否可以使用原始文件名但以.json文件扩展名保存导出的文件,而无需手动定义输出文件名?

With the question above can I have it save the exported file using the original filename but with a .json file extension without having to manually define the output filename?

推荐答案

您的示例数据格式不是tsv,格式也不正确.

Your sample data isn't in tsv format, it isn't formatted well.

#!/usr/bin/env python

import csv
import json
import collections

# aliases
OrderedDict = collections.OrderedDict

src = '/tmp/data.log'
dst = '/tmp/data.json'
header = [
    'date', 'time', 'time_taken', 'c_ip', 'cs_username', 'cs_auth_group',
    'x_exception_id', 'sc_filter_result', 'cs_categories', 'cs_Referer',
    'sc_status', 's_action', 'cs_method', 'rs_Content_Type', 'cs_uri_scheme',
    'cs_host', 'cs_uri_port', 'cs_uri_path', 'cs_uri_query',
    'cs_uri_extension', 'cs_User_Agent', 's_ip', 'sc_bytes', 'cs_bytes',
    'x_virus_id'
]

data = []
with open(src, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    for row in reader:
        if row[0].strip()[0] == '#':  #
            continue
        row = filter(None, row)
        data.append(OrderedDict(zip(header, row)))

with open(dst, 'w') as jsonfile:
    json.dump(data, jsonfile, indent=2)

女巫给了我输出:

$ cat data.json
[
  {
    "date": "2011-07-22",
    "time": "20:34:51",
    "time_taken": "282",
    "c_ip": "ce6de14af68ce198",
    "cs_username": "-",
    "cs_auth_group": "-",
    "x_exception_id": "-",
    "sc_filter_result": "OBSERVED",
    "cs_categories": "unavailable",
    "cs_Referer": "http://www.surfjunky.com/members/sj-a.php?r=44864",
    "sc_status": "200",
    "s_action": "TCP_NC_MISS",
    "cs_method": "GET",
    "rs_Content_Type": "text/html",
    "cs_uri_scheme": "http",
    "cs_host": "www.surfjunky.com",
    "cs_uri_port": "80",
    "cs_uri_path": "/members/sj-a.php",
    "cs_uri_query": "?r=66556",
    "cs_uri_extension": "php",
    "cs_User_Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24",
    "s_ip": "82.137.200.42",
    "sc_bytes": "1395",
    "cs_bytes": "663",
    "x_virus_id": "-"
  },
  {
    "date": "2011-07-22",
    "time": "20:34:51",
    "time_taken": "216",
    "c_ip": "6154d919f8d56690",
    "cs_username": "-",
    "cs_auth_group": "-",
    "x_exception_id": "-",
    "sc_filter_result": "OBSERVED",
    "cs_categories": "unavailable",
    "cs_Referer": "http://x31.iloveim.com/build_3.9.2.1/comet.html",
    "sc_status": "200",
    "s_action": "TCP_NC_MISS",
    "cs_method": "GET",
    "rs_Content_Type": "text/html;charset=UTF-8",
    "cs_uri_scheme": "http",
    "cs_host": "x31.iloveim.com",
    "cs_uri_port": "80",
    "cs_uri_path": "/servlets/events",
    "cs_uri_query": "?1122064400327",
    "cs_uri_extension": "-",
    "cs_User_Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18",
    "s_ip": "82.137.200.42",
    "sc_bytes": "473",
    "cs_bytes": "1129",
    "x_virus_id": "-"
  },
  {
    "date": "2011-07-22",
    "time": "20:34:51",
    "time_taken": "307",
    "c_ip": "6d98469a3f1de6f4",
    "cs_username": "-",
    "cs_auth_group": "-",
    "x_exception_id": "-",
    "sc_filter_result": "OBSERVED",
    "cs_categories": "unavailable",
    "cs_Referer": "http://www.xnxx.com/",
    "sc_status": "200",
    "s_action": "TCP_MISS",
    "cs_method": "GET",
    "rs_Content_Type": "image/jpeg",
    "cs_uri_scheme": "http",
    "cs_host": "img100.xvideos.com",
    "cs_uri_port": "80",
    "cs_uri_path": "/videos/thumbsl/2/e/5/2e5fd679f1118757314fb9a94c0f626c.25.jpg",
    "cs_uri_query": "-",
    "cs_uri_extension": "jpg",
    "cs_User_Agent": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; .NET4.0C)",
    "s_ip": "82.137.200.42",
    "sc_bytes": "16188",
    "cs_bytes": "415",
    "x_virus_id": "-"
  }
]

这篇关于Python脚本-将TSV转换为JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-01 15:40