


I've been trying to debug this for far too long, and I obviously have no idea what I'm doing, so hopefully someone can help. I'm not even sure what I should be asking, but here it goes:

我正在尝试发送 Apple 推送通知,它们的有效负载大小限制为 256 字节.所以减去一些开销的东西,我只剩下大约 100 个英文字符的主要消息内容.

I'm trying to send Apple Push Notifications, and they have a payload size limit of 256 bytes. So subtract some overhead stuff, and I'm left with about 100 english characters of main message content.


So if a message is longer than the max, I truncate it:

body = (body[:MAX_PUSH_LENGTH]) if len(body) > MAX_PUSH_LENGTH else body


So that's fine and dandy, and no matter how long of a message I have (in english), the push notification sends successfully. However, now I have an Arabic string:

str = "هيك بنكون
عيش بجنون تون تون تون هيك بنكون
عيش بجنون تون تون تون
أوكي أ"

>>> print len(str)

所以应该截断.但是,我总是收到无效负载大小错误!奇怪的是,我不断降低 MAX_PUSH_LENGTH 阈值,看看它需要什么才能成功,直到我将限制设置为 60 左右,推送通知才成功.

So that should truncate. But, I always get an invalid payload size error! Curious, I kept lowering the MAX_PUSH_LENGTH threshold to see what it would take for it to succeed, and it's not until I set the limit to around 60 that the push notification succeeded.

我不确定这是否与英语以外的语言的字节大小有关.我的理解是英文字符需要一个字节,那么阿拉伯字符需要 2 个字节吗?会不会跟这个有关?

I'm not exactly sure if this has something to do with the byte size of languages other than english. It is my understanding that an English character takes one byte, so does an Arabic character take 2 bytes? Might this have something to do with it?

此外,字符串在发送之前是 JSON 编码的,所以它最终看起来像这样:\u0634 ... 会不会是被解释为原始字符串,而u0647只有5个字节?

Also, the string is JSON encoded before it is sent off, so it ends up looking something like this: \u0647\u064a\u0643 \u0628\u0646\u0643\u0648\u0646 \n\u0639\u064a\u0634 ... Could it be that it is being interpreted as a raw string, and just u0647 is 5 bytes?


What should I be doing here? Are there any obvious errors or am I not asking the right question?



You need to cut to bytes length, so you need first to .encode('utf-8') your string, and then cut it at a code point boundary.

在 UTF-8 中,ASCII () 是 1 个字节.设置了两个或多个最高有效位的字节(>= 192) 是字符起始字节;后面的字节数由设置的最高有效位的数量决定.其他任何东西都是连续字节.

In UTF-8, ASCII (<= 127) are 1-byte. Bytes with two or more most significant bits set (>= 192) are character-starting bytes; the number of bytes that follow is determined by the number of most significant bits set. Anything else is continuation bytes.


A problem may arise if you cut the multi-byte sequence in the middle; if a character did not fit, it should be cut completely, up to the starting byte.


  (0xC0, 2), # first byte mask, total codepoint length
  (0xE0, 3),
  (0xF0, 4),
  (0xF8, 5),
  (0xFC, 6),

def codepoint_length(first_byte):
    if first_byte < 128:
        return 1 # ASCII
    for mask, length in LENGTH_BY_PREFIX:
        if first_byte & mask == mask:
            return length
    assert False, 'Invalid byte %r' % first_byte

def cut_to_bytes_length(unicode_text, byte_limit):
    utf8_bytes = unicode_text.encode('UTF-8')
    cut_index = 0
    while cut_index < len(utf8_bytes):
        step = codepoint_length(ord(utf8_bytes[cut_index]))
        if cut_index + step > byte_limit:
            # can't go a whole codepoint further, time to cut
            return utf8_bytes[:cut_index]
            cut_index += step
    # length limit is longer than our bytes strung, so no cutting
    return utf8_bytes

现在测试.如果 .decode() 成功,我们就做出了正确的切割.

Now test. If .decode() succeeds, we have made a correct cut.

unicode_text = u"هيك بنكون" # note that the literal here is Unicode

print cut_to_bytes_length(unicode_text, 100).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 10).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 5).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 4).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 3).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 2).decode('UTF-8')

# This returns empty strings, because an Arabic letter
# requires at least 2 bytes to represent in UTF-8.
print cut_to_bytes_length(unicode_text, 1).decode('UTF-8')

您可以测试该代码是否也适用于 ASCII.

You can test that the code works with ASCII as well.


