问题描述
我正在尝试从打开的办公室文档中导入文本(另存为.sxw和
使用
$ b $从sxw-archive内的content.xml读取数据b elementtree和这样的工具)。
给我的问题最少的编码似乎是cp1252,
然而它并不是完全完美的因为那里仍然是字符
,如\ 93或\94。以前有人处理过吗?我宁愿不要
重新发明轮子并开始手动翻译字符串。
Anton
I''m trying to import text from an open office document (save as .sxw and
read the data from content.xml inside the sxw-archive using
elementtree and such tools).
The encoding that gives me the least problems seems to be cp1252,
however it''s not completely perfect because there are still characters
in it like \93 or \94. Has anyone handled this before? I''d rather not
reinvent the wheel and start translating strings ''by hand''.
Anton
推荐答案
这可能会有所帮助:
< / F>
this might help:
http://effbot.org/zone/unicode-gremlins.htm
</F>
这可能会有所帮助:
非常感谢!下面的代码不仅让奇怪的字符消失了,而且还修复了xml-parsing错误...也许这对于b / b
也很有用但是,使用风险自负。
安东
来自gremlins import kill_gremlins
来自zipfile import ZipFile,ZIP_DEFLATED
def修复(infn,outfn):
zin = ZipFile(infn,''r'',ZIP_DEFLATED)
zout = ZipFile(outfn,''w'',ZIP_DEFLATED)
zin.namelist()中x的
:
data = zin.read(x)
如果x ==''contents.xml'':
zout.writestr(x,kill_gremlins(data).encode(''cp1252''))
else:
zout.writestr(x,data)
zout.close()
def test() :
infn =" xxxx.sxw"
outfn =''dg.sxw''
repair(infn,outfn)
if __name __ ==''__ main__'':
test()
Thanks a lot! The code below not only made the strange chars go away,
but it also fixed the xml-parsing errors ... Maybe it''s useful to
someone else too, use at own risk though.
Anton
from gremlins import kill_gremlins
from zipfile import ZipFile, ZIP_DEFLATED
def repair(infn,outfn):
zin = ZipFile(infn, ''r'', ZIP_DEFLATED)
zout = ZipFile(outfn, ''w'', ZIP_DEFLATED)
for x in zin.namelist():
data = zin.read(x)
if x == ''contents.xml'':
zout.writestr(x,kill_gremlins(data).encode(''cp1252 ''))
else:
zout.writestr(x,data)
zout.close()
def test():
infn = "xxxx.sxw"
outfn = ''dg.sxw''
repair(infn,outfn)
if __name__==''__main__'':
test()
不确定我理解这个问题。如果您处理cp1252中的数据,
那么\ xx94和\ x94是合法字符,并且Python编解码器应该
支持它们就好了。
问候,
马丁
Not sure I understand the question. If you process data in cp1252,
then \x94 and \x94 are legal characters, and the Python codec should
support them just fine.
Regards,
Martin
这篇关于不是1252的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!