问题描述
我正在尝试从字符串中清除所有 HTML,因此最终输出是一个文本文件.我对各种转换器"进行了一些研究,并且开始倾向于为实体和符号创建我自己的字典并在字符串上运行替换.我正在考虑这个,因为我想自动化这个过程,并且底层 html 的质量有很多可变性.为了开始比较我的解决方案和替代方案之一的速度,例如 pyparsing,我决定使用字符串方法替换来测试 \xa0 的替换.我得到一个
I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I decided to test replace of \xa0 using the string method replace. I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
实际的代码行是
s=unicodestring.replace('\xa0','')
无论如何 - 我决定我需要在它前面加上一个 r 所以我运行了这行代码:
Anyway-I decided that I needed to preface it with an r so I ran this line of code:
s=unicodestring.replace(r'\xa0','')
它运行没有错误,但是当我查看 s 的一部分时,我看到 \xaO 仍然存在
It runs without error but I when I look at a slice of s I see that the \xaO is still there
推荐答案
可能是你应该做的
s=unicodestring.replace(u'\xa0',u'')
这篇关于如何在 Python 中使用 unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!