链接= http://fortune.com/worlds-most-admired-companies/2016/

因此,我希望div中所有具有已知“类名”的“ href”
我无法摆脱这个:

import bs4 as bs
import urllib.request

raw = urllib.request.urlopen('http://fortune.com/worlds-most-admired-companies/2016/')
soup = bs.BeautifulSoup(raw, 'lxml')

listdiv = soup.find('div', clsss_="company-franchise-result-content current")

for url in listdiv.find_all('a'):
    print(url.get('href'))


我以前用过:

for a in soup.find_all('a'):
    print(a.get('href'))


它可以工作,但只返回10项内容,从苹果到普通电器。即使当我输入链接时,也可以单击“查看完整列表”按钮获得链接。
我对JSON的工作原理有0的想法,但看起来这是朝着这个方向发展。

最佳答案

完整的数据实际上在HTML中。它只是在script标记内的JavaScript对象内。您可以找到此script标记,获取其文本,提取JSON字符串,使用json.loads()将其加载到Python数据结构中并获取所需的数据:

In [1]: from bs4 import BeautifulSoup

In [2]: import json

In [3]: import re

In [4]: url = "http://fortune.com/worlds-most-admired-companies/2016/"

In [5]: response = requests.get(url)

In [6]: soup = BeautifulSoup(response.content, "lxml")

In [7]: pattern = re.compile(r"var fortune_wp_vars = ({.*?});", re.DOTALL | re.MULTILINE)

In [8]: script = soup.find("script", text=pattern)

In [9]: data = json.loads(pattern.search(script.get_text()).group(1))

In [10]: companies = data["bootstrap"]["franchise"]["filtered_sorted_data"]

In [11]: for company in companies:
    ...:     print(company["title"])
    ...:
Apple
Alphabet
...
Yum Brands
ZF Friedrichshafen
Zurich Insurance Group

关于python - 由于“查看完整列表”按钮,最多10个项目,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/42905397/

10-13 06:28