问题描述
我有一个包含多个日期值的字符串,我想将它们全部解析出来.该字符串是自然语言,所以到目前为止我发现的最好的东西是 dateutil .
I have a string that has several date values in it, and I want to parse them all out. The string is natural language, so the best thing I've found so far is dateutil.
不幸的是,如果字符串中包含多个日期值,则dateutil会引发错误:
Unfortunately, if a string has multiple date values in it, dateutil throws an error:
>>> s = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
>>> parse(s, fuzzy=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 303, in parse
raise ValueError, "unknown string format"
ValueError: unknown string format
关于如何解析长字符串中的所有日期的任何想法?理想情况下,将创建一个列表,但是如果需要,我可以自己处理.
Any thoughts on how to parse all dates from a long string? Ideally, a list would be created, but I can handle that myself if I need to.
我正在使用Python,但在这一点上,如果其他语言能够完成工作,那么其他语言可能还可以.
I'm using Python, but at this point, other languages are probably OK, if they get the job done.
PS-我想我可以在中间递归地分割输入文件,然后尝试再试一次,直到它起作用为止,但这真是一个骇客.
PS - I guess I could recursively split the input file in the middle and try, try again until it works, but it's a hell of a hack.
推荐答案
看看它,最简单的方法是修改dateutil 解析器具有模糊多选项.
Looking at it, the least hacky way would be to modify dateutil parser to have a fuzzy-multiple option.
parser._parse
接收您的字符串,将其用_timelex
标记化,然后将这些标记与parserinfo
中定义的数据进行比较.
parser._parse
takes your string, tokenizes it with _timelex
and then compares the tokens with data defined in parserinfo
.
此处,如果令牌与parserinfo
中的任何内容都不匹配,除非fuzzy
为True,否则解析将失败.
Here, if a token doesn't match anything in parserinfo
, the parse will fail unless fuzzy
is True.
我建议您在没有任何经过处理的时间标记的情况下允许不匹配,然后当您遇到不匹配的情况时,请在此时处理已解析的数据,然后再次开始寻找时间标记.
What I suggest you allow non-matches while you don't have any processed time tokens, then when you hit a non-match, process the parsed data at that point and start looking for time tokens again.
不要花太多力气.
更新
正在等待补丁发布时...
While you're waiting for your patch to get rolled in...
这有点hacky,在库中使用非公共函数,但不需要修改库,也不是反复试验.如果您有任何可以转换为浮点数的单独令牌,则可能会产生误报.您可能需要对结果进行更多过滤.
This is a little hacky, uses non-public functions in the library, but doesn't require modifying the library and is not trial-and-error. You might have false positives if you have any lone tokens that can be turned into floats. You might need to filter the results some more.
from dateutil.parser import _timelex, parser
a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
p = parser()
info = p.info
def timetoken(token):
try:
float(token)
return True
except ValueError:
pass
return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))
def timesplit(input_string):
batch = []
for token in _timelex(input_string):
if timetoken(token):
if info.jump(token):
continue
batch.append(token)
else:
if batch:
yield " ".join(batch)
batch = []
if batch:
yield " ".join(batch)
for item in timesplit(a):
print "Found:", item
print "Parsed:", p.parse(item)
收益:
Found: 2011 04 23
Parsed: 2011-04-23 00:00:00
Found: 29 July 1928
Parsed: 1928-07-29 00:00:00
Dieter的更新
Dateutil 2.1似乎是为了与python3兼容而编写的,并使用一个名为six
的兼容性"库.某件事不正确,也没有将str
对象视为文本.
Dateutil 2.1 appears to be written for compatibility with python3 and uses a "compatability" library called six
. Something isn't right with it and it's not treating str
objects as text.
如果您将字符串作为unicode或类似文件的对象传递,则此解决方案可与dateutil 2.1一起使用:
This solution works with dateutil 2.1 if you pass strings as unicode or as file-like objects:
from cStringIO import StringIO
for item in timesplit(StringIO(a)):
print "Found:", item
print "Parsed:", p.parse(StringIO(item))
如果要在parserinfo上设置选项,请实例化一个parserinfo并将其传递给parser对象.例如:
If you want to set option on the parserinfo, instantiate a parserinfo and pass it to the parser object. E.g:
from dateutil.parser import _timelex, parser, parserinfo
info = parserinfo(dayfirst=True)
p = parser(info)
这篇关于如何使用Python(或其他语言)从文本块中解析多个日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!