问题描述
我正在使用Python日志记录在处理时生成日志文件,并且试图将这些日志文件读取到列表/字典中,然后将其转换为JSON并加载到nosql数据库中进行处理.
I am using Python logging to generate log files when processing and I am trying to READ those log files into a list/dict which will then be converted into JSON and loaded into a nosql database for processing.
以以下格式生成文件.
The file gets generated with the following format.
2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
File "<ipython-input-16-132cda1c011d>", line 10, in <module>
if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files
注意:实际上,您看到的每个新日期之前都有\ n休息时间,但似乎无法在这里表示出来.
NOTE: There are actually \n breaks before each NEW date you see but cant seem to represent it here.
基本上,我试图读取此文本文件并生成一个如下所示的json对象:
Basically I am trying to read in this text file and produce a json object that looks like this:
{
'Date': '2015-05-22 16:46:46,985',
'Type': 'INFO',
'Message':'Starting to Wait for Files'
}
...
{
'Date': '2015-05-22 16:48:48,180',
'Type': 'ERROR',
'Message':'Failed: Waiting for files the Files from Cloud Storage: gs://folder/anotherfolder/ Traceback (most recent call last):
File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '
}
我遇到的问题:
我可以将每一行添加到列表或字典等中,但是错误消息有时会跨越多行,因此我最终将其错误地分割了.
I can add each line into a list or dict etc BUT the ERROR message sometimes goes over multiple lines so I end up splitting it up incorrectly.
尝试过:
我尝试使用下面的代码仅将有效日期的行分开,但是我似乎无法获得跨越多行的错误消息.我也尝试过使用正则表达式,并认为这是一种可能的解决方案,但似乎找不到合适的正则表达式...不知道它是如何工作的,所以尝试了一堆复制粘贴,但是没有成功.
I have tried to use code like the below to only split the lines on valid dates but I cant seem to get the error messages that go across multiple lines. I also tried regular expressions and think that's a possible solution but cant seem to find the right regex to use...NO CLUE how it works so tried a bunch of copy paste but without any success.
with open(filename,'r') as f:
for key,group in it.groupby(f,lambda line: line.startswith('2015')):
if key:
for line in group:
listNew.append(line)
尝试了一些疯狂的正则表达式,但在这里也没有运气:
Tried some crazy regex but no luck here either:
logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)
感谢您的帮助...谢谢
Would appreciate any help...thanks
在下面为所有遇到相同问题的人发布了解决方案.
Posted a Solution below for anyone else struggling with the same thing.
推荐答案
使用@Joran Beasley的答案,我想出了以下解决方案,它似乎可行:
Using @Joran Beasley's answer I came up with the following solution and it seems to work:
要点:
- 我的日志文件始终采用相同的结构:{日期}-{类型}-{Message},所以我使用了字符串切片和拆分的方法来分解项目需要他们.例如,{日期}始终为23个字符,而我仅想要前19个字符.
- 使用line.startswith("2015")很疯狂,因为日期最终会改变,因此创建了一个新函数,该函数使用一些正则表达式来匹配我期望的日期格式.再次,我的日志日期遵循特定的模式,因此我可以变得特定.
- 将文件读入第一个函数"generateDicts()",然后调用"matchDate()"函数,以查看正在处理的行是否与我正在寻找的{Date}格式匹配.
- 每当找到有效的{Date}格式时都会创建一个新的dict,并处理所有内容,直到遇到下一个有效的{Date}.
- My log files ALWAYS follow the same structure: {Date} - {Type} -{Message} so I used string slicing and splitting to get the items broken up how Ineeded them. Example the {Date} is always 23 characters and I onlywant the first 19 characters.
- Using line.startswith("2015") is crazy as dates will change eventually so created a new function that uses some regex to match a date format I am expecting. Once again, my log Dates follow a specific pattern so I could get specific.
- The file is read into the first function "generateDicts()" and then calls the "matchDate()" function to see IF the line being processed matches a {Date} format I am looking for.
- A NEW dict is created everytime a valid {Date} format is found and everything is processed until the NEXT valid {Date} is encountered.
def generateDicts(log_fh):
currentDict = {}
for line in log_fh:
if line.startswith(matchDate(line)):
if currentDict:
yield currentDict
currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
else:
currentDict["text"] += line
yield currentDict
with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
listNew= list(generateDicts(f))
查看正在处理的行是否以与我要查找的格式匹配的{日期}开头的函数
def matchDate(line):
matchThis = ""
matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
if matched:
#matches a date and adds it to matchThis
matchThis = matched.group()
else:
matchThis = "NONE"
return matchThis
这篇关于如何在Python中解析此自定义日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!