问题描述
从Outlook中使用python解析电子邮件时,我们遇到了问题.有时,电子邮件中包含无法使用openpyxl附加到excel工作表中的字符.它引起的错误只是IllegalCharacterError
.
We are running into a problem when parsing emails with python from outlook. Sometimes emails have characters that are not able to be appended to an excel worksheet using openpyxl. The error it raises is just IllegalCharacterError
.
我正试图迫使它打印出被认为是非法"的实际字符.
I am trying to force this to print out the actual characters that are considered "Illegal".
这就是说,我在cell.py
上发现的opnepyxl中的一个文件中进行了一些挖掘时,这会引发错误.
That said while doing some digging in one of the files in opnepyxl I found on cell.py
this line that raises the error.
if next(ILLEGAL_CHARACTERS_RE.finditer(value), None):
raise IllegalCharacterError
因此,导航至定义ILLEGAL_CHARACTERS_RE
的位置,我们会发现:
So navigating to where ILLEGAL_CHARACTERS_RE
is defined we find:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
所以我尝试print(ILLEGAL_CHARACTERS_RE)
,希望它可以打印出它所代表的值.由于我不太熟悉正则表达式或使用编译,所以我不确定会发生什么,但可悲的是,我打印到控制台的全部是re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
.
So I tried to print(ILLEGAL_CHARACTERS_RE)
in the hopes it might print out the values it is representing. As I am not very skilled in regex or the use of compile I was not sure what would happen but sadly all I got printed out to console was re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
.
有人可以帮我弄清楚如何打印这些值,或者至少了解如何找到这些值代表什么?
Can someone help me figure out how to print these values or at the very least understand how to find what these values represent?
推荐答案
在正则表达式(或简称Regex)中,您看到的输出是给定范围内某些字符的表达式.例如:
In Regular Expression, or Regex for short, the output you are seeing is an expression of certain characters in a given range. For example:
RE的第一部分:
[\000-\010]
这意味着该集合包含0到8(字符代码0到8)中的任何字符,它们是控制字符.您可能会从 NULL ( )到 BS (退格键)获得任何字符.
This means that this set contains any character from 0 to 8 (char codes 0 to 8), which are control characters. You could be getting any character from NULL (�) to BS (backspace).
RE的第二部分:
[\013-\014]
同样,这是更多控制字符.具体地说,是11到12之间的字符(字符代码11到12).可以来自 VT 或 FF .请注意, VT 实际上是无法打印的列表.
Again, this is more control characters. Specifically, characters from 11 to 12 (char code 11 to 12). Which can be from VT or FF. Note that VT is actually tabulation which cannot be printable.
RE的第三部分:
[\016-\037]
现在这有点有趣,因为它既包含控制字符,又包含可打印字符.如此说来,您可以期望得到14到31之间的任何字符(字符代码14到31).
Now this is a bit more interesting, as this contains both control characters as well as printable characters. So with this being said, you could expect to get any character from 14 to 31 (char code 14 to 31).
所以它不能打印任何非法字符的唯一逻辑原因是,因为所提供的RE根本就不需要可打印字符. 33之后的任何ASCII字符都是可打印字符 (32是空格字符),但是正如您在此处看到的那样,您的代码将所有内容从\ 000到\ 037.因此,您正在尝试打印不可打印的控制字符.
So the only logical reason why it cannot print any illegal characters is because the RE that has been provided simply does not entail printable characters. Any ASCII character after 33 is a printable character (32 is the space character), but as you can see here, your code takes everything from \000 to \037. So you're trying to print control characters that aren't printable.
以下是ASCII表供参考: https://www.w3schools.com/charsets/ref_html_ascii.asp
Here is a ASCII table for reference:https://www.w3schools.com/charsets/ref_html_ascii.asp
我希望这会有所帮助!
这篇关于来自openpyxl的所有ILLEGAL_CHARACTERS是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!