本文介绍了如何使用python_dateutil 1.5'parse'函数来处理unicode?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要Python_dateutil 1.5 使用Unicode月份名称的工作。



如果使用fuzzy = True,则会跳过月份名称并生成结果,其中month = 1



当我使用没有模糊参数的时候,我得到下一个例外:

 从dateutil.parser进口parserinfo,解析器,解析

类myparserinfo(parserinfo):
个月= parserinfo.MONTHS [:]
个月[3] =(U 富,U 富,U Июнь)


个;>> test = unicode('Июнь'的第8位,'utf-8')
>>> test = parse(test,parserinfo = myparserinfo())
追溯(最近的最后一次调用):
文件< console>,第1行,< module>
文件C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser.py,第695行,解析
返回解析器(parserinfo).parse(TIMESTR,** kwargs)
档C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser。 py,第303行,解析
raise ValueError,未知字符串格式
ValueError:未知字符串格式


'的div类= h2_lin>解决方案

里克·波金是正确的,串 'Июнь' 不能是为 A月蟒-dateutil 。在 dateutil / parser.py 中挖出一点,基本的问题是这个模块只有国际化才能处理西欧拉丁语脚本语言。它不是设计为能够处理诸如俄语的语言,使用非拉丁语脚本,如西里尔语。



最大的障碍是在$ code> dateutil / parser.py:45-48 ,其中词法分析器 class _timelex 定义可用于令牌的字符,包括月和日期名称:

  class _timelex(object):
def __init __(self,instream):
#... [一些材料省略] ...
self.wordchars =( 'abcdfeghijklmnopqrstuvwxyz'
'ABCDEFGHIJKLMNOPQRSTUVWXYZ_'
'ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')
self.numchars ='0123456789'
self.whitespace ='\t\r\\\
'

因为 wordchars 不包括西里尔文字母 _timelex 将日期字符串中的每个字节作为单独的字符发出。这是Rik观察到的。



另一个很大的障碍是 dateutil 在内部使用Python字符串而不是Unicode字符串所有的处理。这意味着,即使_timelex被扩展为接受西里尔字母,那么在处理字节和字符之间仍然存在不匹配,以及由调用者和 python_dateutil 源代码。



还有其他一些小问题,比如假设每个月份的名字长度至少为3个字符(日语不是这样),很多有关公历的细节。如果存在 parserinfo ,则可以从 wordchars 字段中获取帮助,以便parserinfo可以定义正确的



python_dateutil v 2.0已被移植到Python 3,但是以上设计问题没有明显改变。 2.0和1.5之间的差异是处理Pyhon语言变化,而不是dateutil的设计和数据结构。



Oleg,你可以修改parserinfo,我怀疑你成功了,因为您的测试代码没有使用 python_dateutil parser()(和 _timelex )。您本质上提供了您自己的解析器和词法分析器。



纠正此问题需要对 python_dateutil 。如果有人要修改补丁,那么包裹维护者就可以将其纳入其中。


I need that Python_dateutil 1.5 parse() work with Unicode month names.

If use fuzzy=True it skips month name and produce result with month = 1

When I use it without fuzzy parameter I get the next exception:

from dateutil.parser import parserinfo, parser, parse

class myparserinfo(parserinfo):
    MONTHS = parserinfo.MONTHS[:]
    MONTHS[3] = (u"Foo", u"Foo", u"Июнь")


>>> test = unicode('8th of Июнь', 'utf-8')
>>> tester = parse(test, parserinfo=myparserinfo())
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser.py", line 695, in parse
    return parser(parserinfo).parse(timestr, **kwargs)
  File "C:\Python27\lib\site-packages\python_dateutil-1.5-py2.7.egg\dateutil\parser.py", line 303, in parse
    raise ValueError, "unknown string format"
ValueError: unknown string format
解决方案

Rik Poggi is right, string 'Июнь' cannot be a month for python-dateutil. Digging a little into dateutil/parser.py, the basic problem is that this module is only internationalised enough for handling Western European Latin-script languages. It is not designed up to be able to handle languages, such as Russian, using non-Latin scripts, such as Cyrillic.

The biggest obstacle is in dateutil/parser.py:45-48, where the lexical analyser class _timelex defines the characters which can be used in tokens, including month and day names:

class _timelex(object):
    def __init__(self, instream):
        # ... [some material omitted] ...
        self.wordchars = ('abcdfeghijklmnopqrstuvwxyz'
                          'ABCDEFGHIJKLMNOPQRSTUVWXYZ_'
                          'ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
                          'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')
        self.numchars = '0123456789'
        self.whitespace = ' \t\r\n'

Because wordchars does not include Cyrillic letters, _timelex emits each byte in the date string as a separate character. This is what Rik observed.

Another large obstacle is that dateutil uses Python byte strings instead of Unicode strings internally for all of its processing. This means that, even if _timelex was extended to accept Cyrillic letters, then there would still be mismatches between handling of bytes and of characters, and problems caused by difference in string encoding between the caller and python_dateutil source code.

There are other minor issues, such as an assumption that every month name is at least 3 characters long (not true for Japanese), and many details related to the Gregorian calendar. It would be helpful for the wordchars field to be picked up from parserinfo if present, so that parserinfo could define the right set of characters for its month and day names.

python_dateutil v 2.0 has been ported to Python 3, but the above design problems aren't significantly changed. The differences betwen 2.0 and 1.5 are to handle Pyhon language changes, not dateutil's design and data structures.

Oleg, you were able to modify parserinfo, and I suspect you succeeded because your test code didn't use the parser() (and _timelex) of python_dateutil. You in essence supplied your own parser and lexer.

Correcting this problem would require fairly major improvements to the text-handling of python_dateutil. It would be great if someone were to make a patch with that change, and the package maintainers were able to incorporate it.

这篇关于如何使用python_dateutil 1.5'parse'函数来处理unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 19:29
查看更多