带有变音符号和连字的Python 3正则表达式， | 带有变音符号和连字的Python

本文介绍了带有变音符号和连字的Python 3正则表达式，的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

以下形式的名称：凯撒（Ceasar），朱利叶斯（Julius）将被拆分为朱利叶斯（Julius）姓Ceasar的名字。

Names in the form: Ceasar, Julius are to be split into First_name Julius Surname Ceasar.

名字可能包含变音符号（áàé..）和连字（æ，ø）

Names may contain diacritics (á à é ..), and ligatures (æ, ø)

代码在Python 3.3中似乎可以正常运行

This code seems to work OK in Python 3.3

import re

def doesmatch(pat, str):
    try:
        yup = re.search(pat, str)
        print('Firstname {0} lastname {1}'.format(yup.group(2), yup.group(1)))
    except AttributeError:
        print('no match for {0}'.format(str))

s = 'Révèrberë, Harry'
t = 'Åapö, Renée'
u = 'C3po, Robby'
v = 'Mærsk, Efraïm'
w = 'MacDønald, Ron'
x = 'Sträßle, Mpopo'

pat = r'^([^\d\s]+), ([^\d\s]+)'
# matches any letter, diacritic or ligature, but not digits or punctuation inside the ()

for i in s, t, u, v, w, x:
    doesmatch(pat, i)

除u匹配外的全部。（名称中的数字不匹配），但是我想知道是否有没有比非数字非空格方法更好的方法了。
更重要的是：我想优化模式，以区分大写字母和小写字母，但包括大写字母和连字，最好也使用正则表达式。好像（[A-Z] [a-z] +），将匹配带重音和组合字符。

All except u match. (no match for numbers in names), but I wonder if there isn't a better way than the non-digit non-space approach.More important though: I'd like to refine the pattern so it distinquishes capitals from lowercase letters, but including capital diacritics and ligatures, preferably using regex also. As if ([A-Z][a-z]+), would match accented and combined characters.

这可能吗？

（到目前为止，我看过的东西：
；此在Unicode上（我是不使用）；我认为我不需要，但我承认我尚未阅读其所有文档）

(what I've looked at so far:Dive into python 3 on UTF-8 vs Unicode; This Regex tutorial on Unicode (which I'm not using); I think I don't need new regex but I admit I haven't read all its documentation)

推荐答案

如果要使用标准库的 re 模块，那么恐怕您将不得不手动构建所有相关Unicode代码点的字符类。

If you want to distinguish uppercase and lowercase letters using the standard library's re module, then I'm afraid you'll have to build a character class of all the relevant Unicode codepoints manually.

如果您真的不需要做这个，使用

If you don't really need to do this, use

[^\W\d_]

匹配任何Unicode字母。此字符类别与既不是数字也不是下划线的所有非字母数字字符（与字母数字字符相同）相匹配。

to match any Unicode letter. This character class matches anything that's "not a non-alphanumeric character" (which is the same as "an alphanumeric character") that's also not a digit nor an underscore.

                        这篇关于带有变音符号和连字的Python 3正则表达式，的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！