本文介绍了Python 3-如果字符串仅包含ASCII,它等于字符串的字节数吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑使用Python 3 SMTPD-接收的数据包含在字符串中. http://docs.python.org/3.4/library/smtpd.html quote:数据是包含电子邮件内容的字符串"

Consider Python 3 SMTPD - the data received is contained in a string. http://docs.python.org/3.4/library/smtpd.html quote: "and data is a string containing the contents of the e-mail"

事实(正确吗?):

  • Python 3中的字符串是Unicode.
  • 电子邮件始终为ASCII.
  • 纯ASCII是有效的Unicode.

因此,传入的电子邮件是纯ASCII(有效的Unicode),因此SMTPD DATA字符串与SMPTD接收的原始字节完全等效.这是正确的吗?

Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?

因此,我的问题是,如果我将SMTPD DATA字符串解码为ASCII,或将DATA字符串转换为字节,这是否等同于通过SMTP到达的实际电子邮件的字节?

Thus my question, if I decode the SMTPD DATA string to ASCII, or convert the DATA string to bytes, is this equivalent to the bytes of the actual email message that arrived via SMTP?

上下文(也许是一个更好的问题)是我如何将接收到的字节精确地保存到文件Python 3的SMTPD数据中?"我担心的是,当DATA通过字符串到字节的转换时,它已经以某种方式与通过SMTP到达的原始字节有所不同.

Context, (and perhaps a better question) is "How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?" My concern is that when DATA goes through string to bytes conversion then somehow it has been changed from the original bytes that arrived via SMTP.

看来Python开发人员认为SMTPD无论如何都应该返回二进制数据.似乎尚未修复... http://bugs.python.org/issue19662

it seems the Python developers think SMTPD should be returning binary data anyway. Doesn't seem to have been fixed... http://bugs.python.org/issue19662

推荐答案

不.在Python 3中不相等:

No. It is not equal in Python 3:

>>> '1' == b'1'
False

bytes对象不等于str(Unicode字符串)对象,类似于整数不等于字符串的方式:

bytes object is not equal to str (Unicode string) object in a similar way that an integer is not equal to a string:

>>> '1' == 1
False

在某些编程语言中,上述比较是正确的,例如在Python 2中:

In some programming languages the above comparisons are true e.g., in Python 2:

>>> b'1' == u'1'
True

1 == '1'在Perl中:

$ perl -e "print qq(True\n) if 1 == q(1)"
True

您的问题很好地说明了为什么 stricter Python 3行为更可取.它迫使程序员面对他们的文本/字节错误观念,而不必等待他们的代码为某些输入而中断.

Your question is a good example of why the stricter Python 3 behaviour is preferable. It forces programmers to confront their text/bytes misconceptions without waiting for their code to break for some input.

是的. 字符串是Python 3中Unicode代码点的不可变序列. >

yes. Strings are immutable sequences of Unicode code points in Python 3.

大多数电子邮件都以7位消息(ASCII范围:十六进制00-7F)的形式传输.尽管>实际上,所有现代电子邮件服务器都是8位清洁的. 8位内容不会被破坏.并且 8BITMIME扩展允许某些8位内容的传递.

Most emails are transported as 7-bit messages (ASCII range: hex 00-7F). Though "virtually all modern email servers are 8-bit clean." i.e., 8-bit content won't be corrupted. And 8BITMIME extension sanctions the passing of some of 8-bit content.

换句话说:电子邮件不是 始终是ASCII .

In other words: emails are not always ASCII.

ASCII是字符编码.您可以使用US-ASCII字符编码将解码 一些字节序列转换为Unicode. Unicode字符串没有关联的字符编码,即,您可以使用可以表示相应Unicode代码点的任何字符编码,将它们编码为字节.

ASCII is a character encoding. You can decode some byte sequences to Unicode using US-ASCII character encoding. Unicode strings have no associated character encoding i.e., you can encode them into bytes using any character encoding that can represent corresponding Unicode code points.

如果输入在ASCII范围内,则data.decode('ascii', 'strict').encode('ascii') == data.尽管 Lib/smtpd.py 进行了一些转换输入数据(根据RFC 5321),因此,即使输入是纯ASCII,您获得的作为data的内容也可能会有所不同.

If input is in ascii range then data.decode('ascii', 'strict').encode('ascii') == data.Though Lib/smtpd.py does some conversions to the input data (according to RFC 5321) therefore the content that you get as data may be different even if the input is pure ASCII.

我的目标不是查找格式错误的电子邮件,而是将入站电子邮件精确地以到达的二进制/字节形式保存到磁盘中.

my goal is not to find malformed emails but to save inbound emails to disk in precisely the binary/bytes form that they arrived.

您链接的错误( smtpd.py不应解码utf-8 ) smptd.py非8位清除.

The bug that you've linked (smtpd.py should not decode utf-8) makes smptd.py non 8-bit clean.

您可以从smtpd.py中覆盖 SMTPChannel.collect_incoming_data方法可以按原样保存传入的字节.

You could override SMTPChannel.collect_incoming_data method from smtpd.py to save incoming bytes as is.

是真的.这是UTF-8编码的不错的属性.如果可以使用US-ASCII字符编码将字节序列解码为Unicode,那么还可以使用UTF-8字符编码解码字节(并且两种情况下产生的Unicode代码点相同).

It is true. It is a nice property of UTF-8 encoding. If you can decode a byte sequence into Unicode using US-ASCII character encoding then you can also decode the bytes using UTF-8 character encoding (and the resulting Unicode code points are the same in both cases).

smptd.py应该使用latin1(它解码任何字节序列)或ascii(带有"strict"错误处理程序以使任何非ascii字节失败),而不是使用utf-8(它允许某些非ASCII字节). -ascii字节-错误).

smptd.py should have used either latin1 (it decodes any byte sequence) or ascii (with 'strict' error handler to fail on any non-ascii byte) instead of utf-8 (it allows some non-ascii bytes -- bad).

请紧记:

  • 某些电子邮件的字节可能不在ascii范围内
  • 根据RFC 5321的去透明性即使输入字节都在ascii范围内也不能保持原样

这篇关于Python 3-如果字符串仅包含ASCII,它等于字符串的字节数吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 17:49