我对熊猫很陌生。我有一个日志文本文件。我正在尝试从文件中获取一些数据点。以下是获得所需数据但不是所需格式的代码。我想要两列的Pandas数据框。
import os
from collections import Counter
import pandas as pd
#print(os.getcwd())
infile = "myfile.txt"
important = []
keep_phrases = ["Host",
"User-Agent"
]
with open(infile) as f:
f = f.readlines()
for line in f:
for phrase in keep_phrases:
if phrase in line:
important.append(line)
break
#print(type(important))
print(important)
#Counter(important)
pd.DataFrame(important)
这不会给我两列的输出。我正在寻找主机和用户代理作为一行。
文本文件示例如下
15 SessionOpen c aa.bb.cc.ddd 62667 :8080
15 SessionClose c pipe
15 ReqStart c aa.bb.cc.ddd 62667 442374415
15 RxURL c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag666
15 RxHeader c Host: ll.abrstream.channel4.com
15 RxHeader c Connection: keep-alive
15 RxHeader c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
15 RxHeader c X-Requested-With: ShockwaveFlash/21.0.0.216
15 RxHeader c Accept: */*
15 RxHeader c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
15 RxHeader c Accept-Encoding: gzip, deflate, sdch
15 RxHeader c Accept-Language: en-US,en;q=0.8
15 ReqEnd c 442374415 1461870946.496117592 1461870947.112555504 0.000315428 0.001363039 0.615074873
15 SessionOpen c aa1.bb1.cc1.ddd1 59409 :8080
15 SessionClose c pipe
15 ReqStart c aa1.bb1.cc1.ddd1 59409 442374416
15 RxURL c /gpsApi.php
15 RxHeader c Content-Length: 0
15 RxHeader c Host: map.yanue.net
15 RxHeader c Connection: Keep-Alive
15 RxHeader c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
15 ReqEnd c 442374416 1461870950.580444574 1461870951.139206648 0.000064135 0.001196861 0.557565212
15 SessionOpen c aa1.bb1.cc1.ddd1 52179 :8080
15 SessionClose c pipe
15 ReqStart c aa1.bb1.cc1.ddd1 52179 442374417
15 RxURL c /gpsApi.php
15 RxHeader c Content-Length: 0
15 RxHeader c Host: map.yanue.net
15 RxHeader c Connection: Keep-Alive
15 RxHeader c User-Agent: Apache-HttpClient/UNAVAILABLE (java 1.4)
15 ReqEnd c 442374417 1461870951.776547432 1461870952.448071241 0.000062943 0.001109123 0.670414686
18 SessionOpen c aa.bb.cc.ddd 62670 :8080
18 SessionClose c pipe
18 ReqStart c aa.bb.cc.ddd 62670 442374418
18 RxURL c /61665002001003_001/CH4_08_02_24_61665002001003_001_16x9_1500000_Seg1-Frag667
18 RxHeader c Host: ll.abrstream.channel4.com
18 RxHeader c Connection: keep-alive
18 RxHeader c User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
18 RxHeader c X-Requested-With: ShockwaveFlash/21.0.0.216
18 RxHeader c Accept: */*
18 RxHeader c Referer: http://www.channel4.com/programmes/the-tiny-tots-talent-agency/on-demand/61665-002
18 RxHeader c Accept-Encoding: gzip, deflate, sdch
18 RxHeader c Accept-Language: en-US,en;q=0.8
18 ReqEnd c 442374418 1461870951.920178175 1461870952.507097483 0.001731873 0.001337051 0.585582256
15 SessionOpen c aa1.bb1.cc1.ddd1 48034 :8080
15 SessionClose c pipe
最佳答案
您可以通过创建列表列表来创建数据框,然后使用数据框构造函数。
像开始一样,循环遍历文件的每一行,然后将每一行分成不同的列。您可以使用re.split创建列的列表,限制最大拆分数以将最后一列视为一个元素。另外,如果您知道每个元素总是将以相同的方式对齐,则可以使用切片来创建该列表。
import re
df_list = []
with open(infile) as f:
for line in f:
# remove whitespace at the start and the newline at the end
line = line.strip()
# split each column on whitespace
columns = re.split('\s+', line, maxsplit=4)
df_list.append(columns)
然后,可以使用this answer中的方法创建数据框。
df = pd.DataFrame(df_list)