Python笔记 - 正则表达式

正则表达式（Regular Expression，简称regex）是一种强大的工具，用于匹配字符串模式。在Python中，正则表达式通过re模块提供。本文将带你深入了解Python中的正则表达式，从基础概念到高级用法。

1. 什么是正则表达式？

正则表达式是一种用来描述字符串模式的方法。它可以用来匹配、查找和替换文本中的特定模式。通过使用正则表达式，你可以定义一些规则，然后搜索文本中符合这些规则的内容。这种功能在文本处理、数据抽取和字符串匹配等领域非常有用。

2. 基本概念

在介绍具体用法之前，先了解一些基本概念：

模式（Pattern）：正则表达式的核心，由字符和特殊符号组成，用于描述字符串的规则。
匹配（Match）：字符串是否符合模式。
组（Group）：通过括号()来定义子模式，方便提取子字符串。

3. 常用符号

以下是一些常用的正则表达式符号：

.：匹配除换行符以外的任意字符。
^：匹配字符串的开头。
$：匹配字符串的结尾。
*：匹配前一个字符零次或多次。
+：匹配前一个字符一次或多次。
?：匹配前一个字符零次或一次。
{n}：匹配前一个字符n次。
{n,m}：匹配前一个字符n到m次。
[]：匹配方括号内的任意字符。
|：匹配左右任意一个表达式。
\d：匹配任何数字，相当于[0-9]。
\D：匹配任何非数字字符。
\w：匹配任何字母、数字、下划线字符。
\W：匹配任何非字母、数字、下划线字符。
\s：匹配任何空白字符（包括空格、制表符等）。
\S：匹配任何非空白字符。

4. Python中的正则表达式

在Python中，可以使用re模块进行正则表达式操作。以下是一些常用方法：

导入`re`模块

import re

`re.match()`

re.match从字符串的起始位置匹配正则表达式。

import re

pattern = r'hello'
text = 'hello world'
match = re.match(pattern, text)

if match:
    print("Match found:", match.group())
else:
    print("No match")

`re.search()`

re.search扫描整个字符串并返回第一个成功的匹配。

import re

pattern = r'world'
text = 'hello world'
match = re.search(pattern, text)

if match:
    print("Match found:", match.group())
else:
    print("No match")

`re.findall()`

re.findall返回字符串中所有非重叠的匹配。

import re

pattern = r'\d+'
text = 'There are 123 apples and 456 oranges.'
matches = re.findall(pattern, text)

print("Matches found:", matches)

`re.sub()`

re.sub用于替换字符串中的匹配项。

import re

pattern = r'apples'
replacement = 'bananas'
text = 'I like apples'
new_text = re.sub(pattern, replacement, text)

print("Replaced text:", new_text)

`re.split()`

re.split用于根据匹配项拆分字符串。

import re

pattern = r'\s+'
text = 'Split this sentence into words.'
words = re.split(pattern, text)

print("Words:", words)

使用分组

分组是正则表达式的强大功能之一，可以提取子字符串。

import re

pattern = r'(\d+)-(\d+)-(\d+)'
text = 'My phone number is 123-456-7890'
match = re.search(pattern, text)

if match:
    print("Full match:", match.group(0))
    print("Area code:", match.group(1))
    print("Prefix:", match.group(2))
    print("Line number:", match.group(3))

5. 高级用法

非贪婪匹配

默认情况下，正则表达式是贪婪的，会匹配尽可能多的字符。使用?可以进行非贪婪匹配。

import re

text = 'He said: "Hello, world!"'
pattern_greedy = r'".*"'
pattern_nongreedy = r'".*?"'

match_greedy = re.search(pattern_greedy, text)
match_nongreedy = re.search(pattern_nongreedy, text)

print("Greedy match:", match_greedy.group())
print("Non-greedy match:", match_nongreedy.group())

命名组

使用命名组可以更方便地提取子字符串。

import re

pattern = r'(?P<area>\d+)-(?P<prefix>\d+)-(?P<line>\d+)'
text = 'My phone number is 123-456-7890'
match = re.search(pattern, text)

if match:
    print("Area code:", match.group('area'))
    print("Prefix:", match.group('prefix'))
    print("Line number:", match.group('line'))

6. 实战案例

验证邮箱地址

import re

def is_valid_email(email):
    pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
    return re.match(pattern, email) is not None

email = 'test@example.com'
print("Is valid email:", is_valid_email(email))

提取URL中的域名

import re

def extract_domain(url):
    pattern = r'https?://(www\.)?(\w+\.\w+)'
    match = re.search(pattern, url)
    if match:
        return match.group(2)
    return None

url = 'https://www.example.com/path/to/page'
print("Domain:", extract_domain(url))

7. 结论

正则表达式是一种非常强大的工具，可以极大地简化字符串处理任务。在Python中，re模块提供了丰富的正则表达式功能。通过本文的介绍，相信你已经掌握了基本的正则表达式语法和一些常用的操作。希望这些内容能够帮助你在日常编程中更加高效地处理字符串。

MerlinTheMagic