问题描述
假设我有一堆UTF-8文件以unicode的形式发送到外部API.该API对每个unicode字符串进行操作,并返回包含(character_offset, substr)
元组的列表.
Suppose I have a bunch of files in UTF-8 that I send to an external API in unicode. The API operates on each unicode string and returns a list with (character_offset, substr)
tuples.
我需要的输出是每个找到的子字符串的开始和结束字节偏移量.如果幸运的话,输入文本仅包含ASCII字符(使字符偏移量和字节偏移量相同),但是并非总是如此.如何找到已知的开始字符偏移量和子字符串的开始和结束字节偏移量?
The output I need is the begin and end byte offset for each found substring. If I'm lucky the input text contains only ASCII characters (making character offset and byte offset identical), but this is not always the case. How can I find the begin and end byte offsets for a known begin character offset and substring?
我已经亲自回答了这个问题,但是期待其他更可靠,更有效和/或更易读的解决方案.
I've answered this question myself, but look forward to other solutions to this problem that are more robust, more efficient, and/or more readable.
推荐答案
我将使用字典将字符偏移量映射到字节偏移量,然后在其中查找偏移量来解决此问题.
I'd solve this using a dictionary mapping character offsets to byte offsets and then looking up the offsets in that.
def get_char_to_byte_map(unicode_string):
"""
Generates a dictionary mapping character offsets to byte offsets for unicode_string.
"""
response = {}
byte_offset = 0
for char_offset, character in enumerate(unicode_string):
response[char_offset] = byte_offset
byte_offset += len(character.encode('utf-8'))
return response
char_to_byte_map = get_char_to_byte_map(text)
for begin_offset, substring in api_response:
begin_offset = char_to_byte_map[character_offset]
end_offset = char_to_byte_map[character_offset + len(substring)]
# do something
与您的解决方案相比,此解决方案的性能在很大程度上取决于输入的大小和所涉及的子字符串的数量.局部微基准测试表明,对文本中的每个字符进行编码所需的时间大约是一次对整个文本进行编码的1000倍.
Performance of this solution as compared to yours depends a lot on the size of the input and the amount of substrings involved. Local micro-benchmarking suggests that encoding each individual character in a text takes about 1000 times as long as encoding the entire text at once.
这篇关于将字符偏移量转换为字节偏移量(在Python中)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!