我试图从cars.com上分别提取车身颜色、内饰颜色、变速器等信息。
HTML格式:

<ul class="listing-row__meta">
 <li>
  <strong>
    Ext. Color:
  </strong>
    Gray
 </li>
 <li>
  <strong>
    Int. Color:
  </strong>
    White
 </li>
 <li>
  <strong>
    Transmission:
  </strong>
    Automatic
 </li>

我尝试了下面的代码,但它显示了“预期的字符串或类似对象的字节”。如有任何建议或解决方案,将不胜感激。
from bs4 import BeautifulSoup
import urllib
import re

url ='https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all('div',{'class':'shop-srp-listings__listing-container'})

for each in all_matches:

    info=each.findAll('ul',class_='listing-row__meta')
    pattern=re.compile(r'Ext. Color:')
    matches=pattern.finditer(info)
    for match in matches:
        print(match.text)

最佳答案

也许,这会更接近你想要提取的,我猜,用一个类似于:

(?is)<strong>\s*([^<]*?)\s*<\/strong>

或者,
(?is)(?<=<strong>)\s*[^<]*?\s*(?=<\/strong>)

当然,您也可以使用内置函数来实现这一点。
测试1
from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = re.findall(
        r'(?is)<strong>\s*[^<]*?\s*<\/strong>\s*([^<]*?)\s*<', str(info[0]))
    for match in matches:
        print(match)

产出1
Gray
Beige
Automatic
AWD
Gray
White
Automatic
AWD
Black

测试2
如果你愿意的话,你也可以做一个口述:
from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = dict(re.findall(
        r'(?is)<strong>\s*([^<]*?)\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))

    for k, v in matches.items():
        print(f'{k} {v}')

产出2
Ext. Color: Gray
Int. Color: Beige
Transmission: Automatic
Drivetrain: AWD
Ext. Color: Gray
Int. Color: White
Transmission: Automatic
Drivetrain: AWD
Ext. Color: Black

测试3
如果您愿意列出:
from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = re.findall(
        r'(?is)<strong>\s*([^<]*?)\s*<\/strong>\s*([^<]*?)\s*<', str(info[0]))

    for match in matches:
        print(list(match))

输出
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'Gray']
['Int. Color:', 'Gray']
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'White']
['Int. Color:', 'Black']
['Transmission:', 'Automatic']
['Drivetrain:', 'RWD']
['Ext. Color:', 'White']
['Int. Color:', 'Beige']
['Transmission:', 'Automatic']
['Drivetrain:', 'AWD']
['Ext. Color:', 'Gray']
['Int. Color:', 'Beige']
['Transmission:', 'Automatic']
['Drivetrain:', 'AWD']
['Ext. Color:', 'White']

测试4
from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})


keys = ['Ext. Color', 'Int. Color', 'Transmission', 'Drivetrain']

outputs = dict()

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = dict(re.findall(
        r'(?is)<strong>\s*([^<:]*?)\s*:\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))

    for item in matches.items():
        if item[0] not in outputs:
            outputs[item[0]] = [item[1]]
        if item[0] in keys:
            outputs[item[0]].append(item[1])

产出4
{Ext.Color':['银','银','白','白','黑','灰',
“灰色”,“黑色”,“黑色”,“白色”,“蓝色”,“红色”,“银色”,“灰色”,
“黑”,“白”,“黑”,“灰”,“白”,“黑”,“黑”],“内景。
颜色“:['米色','米色','黑色','白色','黑色','灰色',
“米色”,“黑色”,“黑色”,“米色”,“米色”,“黑色”,“黑色”,
“黑”,“黑”,“黑”,“黑”,“白”,“白”,“黑”],
'变速器':['自动','自动','自动','自动',
“自动”,“自动”,“自动”,“自动”,“自动”,
“自动”,“自动”,“自动”,“自动”,“自动”,
“自动”,“自动”,“自动”,“自动”,“自动”,
“自动”,“自动”],“传动系”:[“AWD”,“AWD”,“AWD”,“AWD”,
“后轮驱动”,“后轮驱动”,“后轮驱动”,“后轮驱动”,“全轮驱动”,“后轮驱动”,“后轮驱动”,“后轮驱动”,“全轮驱动”,“后轮驱动”,
“后轮驱动”,“全轮驱动”,“后轮驱动”,“全轮驱动”,“全轮驱动”,“全轮驱动”,“全轮驱动”]}
测试5
from bs4 import BeautifulSoup
import urllib
import re
import requests

url = 'https://www.cars.com/for-sale/searchresults.action/?zc=92617&rd=10&stkTypId=28881&mkId=28263&searchSource=RESEARCH_SHOP_INDEX'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all(
    'div', {'class': 'shop-srp-listings__listing-container'})


keys = ['Ext. Color', 'Int. Color', 'Transmission', 'Drivetrain']

outputs = dict()

for each in all_matches:
    info = each.findAll('ul', class_='listing-row__meta')
    matches = dict(re.findall(
        r'(?is)<strong>\s*([^<:]*?)\s*:\s*<\/strong>\s*([^<]*?)\s*<', str(info[0])))

    for item in matches.items():
        if item[0] not in outputs:
            outputs[item[0]] = [item[1]]
        if item[0] in keys:
            outputs[item[0]].append(item[1])


print(outputs)

print('*' * 50)

no_duplicate_outputs = dict()
for item in outputs.items():
    if item[0] not in no_duplicate_outputs:
        no_duplicate_outputs[item[0]] = list(set(item[1]))

print(no_duplicate_outputs)

产出5
{Ext.Color':['黑','黑','白','黑','其他','灰',
“白色”,“白色”,“灰色”,“白色”,“灰色”,“银色”,“蓝色”,“黑色”,
“银色”,“银色”,“黑色”,“蓝色”,“蓝色”,“黑色”,“白色”],“内景。
颜色“:['黑色','黑色','米色','米色','黑色','灰色','黑色',
“米色”,“米色”,“白色”,“黑色”,“黑色”,“灰色”,“黑色”,“黑色”,
“灰色”,“黑色”,“黑色”,“黑色”,“白色”,“黑色”],“传输”:
[“自动”,“自动”,“自动”,“自动”,“自动”,
“自动”,“自动”,“自动”,“自动”,“自动”,
“自动”,“自动”,“自动”,“自动”,“自动”,
“自动”,“自动”,“自动”,“自动”,“自动”,
“自动”],“传动系”:[“全轮驱动”,“全轮驱动”,“后轮驱动”,“后轮驱动”,“后轮驱动”,
“后轮驱动”,“全轮驱动”,“全轮驱动”,“全轮驱动”,“全轮驱动”,“全轮驱动”,“全轮驱动”,“全轮驱动”,“全轮驱动”,“全轮驱动”,
'RWD'、'AWD'、'AWD'、'AWD'、'AWD']}
**************************************************{'Ext.Color':['Silver'、'White'、'Blue'、'Other'、'Black'、'Gray'、'Int.Color':
['米色','白色','黑色','灰色','变速器':['自动'],
“传动系”:[“后轮驱动”,“全轮驱动”]}
如果您希望简化/修改/浏览表达式,则会在regex101.com的右上角面板中进行说明。如果您愿意,也可以在this link中查看它如何与一些示例输入匹配。
正则表达式电路
jex.im可视化正则表达式:
python - 从强标签中提取文本-LMLPHP

09-25 18:07