本文介绍了如何将 utf-8 花式引号转换为中性引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个 Python 小脚本,用于解析 Word 文档并写入 csv 文件.但是,有些文档包含一些 utf-8 字符,我的脚本无法正确处理.

花式引号经常出现(u'u201c').有没有一种快速、简单(和智能)的方法可以用支持中性 ascii 的引号替换那些引号,所以我可以将 line.encode('ascii') 写入 csv 文件?

我试图找到左引号并替换它:

val = line.find(u'u201c')如果 val >= 0: line[val] = '"'

但无济于事:

TypeError: 'unicode' 对象不支持项目分配

我所描述的是否是一个好的策略?或者我应该只设置 csv 以支持 utf-8(虽然我不确定将要读取 CSV 的应用程序是否需要 utf-8)?

谢谢

解决方案

您可以使用 Unidecode 可自动将所有 Unicode 字符转换为其最接近的纯 ASCII 等效字符.

from unidecode import unidecode行 = unidecode(行)

这将处理双引号的两个方向以及单引号、破折号和其他您可能尚未发现的内容.

评论指出,如果您的语言不是英语,您可能会发现 ASCII 过于严格.这是上述代码的改编版,它使用白名单来指示不应转换的字符.

>>>从 unidecode 导入 unidecode>>>whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿÿ')>>>行 = 'u201cRésuméu201d'>>>打印(行)恢复">>>line = ''.join(c if c in whitelist else unidecode(c) for c in line)>>>打印(行)《简历》

I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.

Fancy quotes show up quite often (u'u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?

I have tried to find the left quote and replace it:

val = line.find(u'u201c')
if val >= 0: line[val] = '"'

But to no avail:

TypeError: 'unicode' object does not support item assignment

Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?

Thank you

解决方案

You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.

from unidecode import unidecode
line = unidecode(line)

This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.

Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.

>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = 'u201cRésuméu201d'
>>> print(line)
"Résumé"
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"

这篇关于如何将 utf-8 花式引号转换为中性引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 09:33