python - 无法刮-LMLPHP
我正试图从angellisthttps://angel.co/companies获取公司列表
我试过这个密码

from bs4 import BeautifulSoup
import urllib2

headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://angel.co/companies', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, "html.parser")
p1 = soup.find_all('div' , {"class"," dc59 frw44 _a _jm"})
print p1

但这会返回一个空字符串。
我也经历过类似的问题,有人说更新美化组,有人说更改解析器。没有什么对我有用。

最佳答案

通过从https://angel.co/company_filters/search_data获取参数,您可以获得所有公司信息html,而不需要selenium:

import requests
from bs4 import BeautifulSoup



js = "https://angel.co/company_filters/search_data"

headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}




u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"
with requests.Session() as s:
    params = s.post(js, data={"sort": "signal"}, headers=headers).json()
    companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])),params["page"] ,params["total"], params["hexdigest"]), headers=headers)
    soup = BeautifulSoup(companies.json()["html"])

迭代以模拟加载更多内容时,可以传递页码:
import requests
from bs4 import BeautifulSoup
import time

# post url
js = "https://angel.co/company_filters/search_data"

# X-Requested-With is important
headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}


# get url
u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"


def get_next_pages(js, u, start_page=1):
    with requests.Session() as s:
        params = s.post(js, data={"sort": "signal","page":start_page}, headers=headers).json()
        companies = s.get(
            u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"], params["hexdigest"]),
            headers=headers)
        soup = BeautifulSoup(companies.json()["html"])
        comps = soup.select("div.company.column")
        yield comps
        while True:
            # increment page count from previous.
            page = params["page"] + 1
            params = s.post(js, data={"sort": "signal", "page": page}, headers=headers).json()
            # keep going until we have reached the maximum queries
            if "ids" not in params:
                break
            companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"],
                                       params["hexdigest"]),
                              headers=headers)
            soup = BeautifulSoup(companies.json()["html"])
            comps = soup.select("div.company.column")
            # don't hammer with requests
            time.sleep(.3)
            yield comps

for comps in get_next_pages(js, u):
    print(comps)

如果我们查看开发人员工具的网络输出,我们可以看到当我们点击加载更多时的post数据,它会一直运行,直到我们点击限制:
python - 无法刮-LMLPHP
运行上述代码的输出片段:
[<div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies" title="Dunwello"><img alt="Dunwello" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275696-99335faecd2fb01467c98d5032f23cf6-thumb_jpg.jpg?buster=1393099676"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies">Dunwello</a>
</div>
<div class="pitch">
Trustworthy recommendations of individual professionals.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies" title="GroupAhead"><img alt="GroupAhead" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275832-3541a563250008bd3f7f9b4d7fe9c33c-thumb_jpg.jpg?buster=1423077576"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies">GroupAhead</a>
</div>
<div class="pitch">
Dedicated apps for groups
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies" title="Workpop"><img alt="Workpop" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/431492-c1b857e30254da60f3847d5358db5c82-thumb_jpg.jpg?buster=1404420060"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies">Workpop</a>
</div>
<div class="pitch">
When can you start?
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies" title="Late Stage Pre-IPO @ Flight.vc"><img alt="Late Stage Pre-IPO @ Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/446358-3511ab7edb5192dad97cbccf2b67ddd7-thumb_jpg.jpg?buster=1428089778"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies">Late Stage Pre-IPO @ Flight.vc</a>
</div>
<div class="pitch">
Syndicated:  Beepi, Zirx, Boost Media, Rent the Runway, Life 360, Scripted
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies" title="Complex Polygon"><img alt="Complex Polygon" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/450451-4f00fd11b2d54533a5bac3cfa72acb1e-thumb_jpg.jpg?buster=1407937645"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies">Complex Polygon</a>
</div>
<div class="pitch">
Product studio based in San Francisco, California.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies" title="21"><img alt="21" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/457068-2e7b8c417c3a70aab3026f5f0ca3d8e9-thumb_jpg.jpg?buster=1425975133"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies">21</a>
</div>
<div class="pitch">
A bitcoin miner in every device and in every hand.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies" title="Parenthoods"><img alt="Parenthoods" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/460720-25bc7ca7afd4f7bf0fd7842cafa1bdd1-thumb_jpg.jpg?buster=1425426951"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies">Parenthoods</a>
</div>
<div class="pitch">
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies" title="Seed"><img alt="Seed" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/462906-f6b439e20a9d36b9e2d3792da92d160d-thumb_jpg.jpg?buster=1462318689"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies">Seed</a>
</div>
<div class="pitch">
Online Business Banking
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies" title="Zen99"><img alt="Zen99" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/470102-67da791cec4374a1046c53fe99b6f05f-thumb_jpg.jpg?buster=1410560341"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies">Zen99</a>
</div>
<div class="pitch">
Finance and insurance tools for freelancers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies" title="Maven Ventures Growth Labs"><img alt="Maven Ventures Growth Labs" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/488240-d467860829cac8b1a9fbfa2d14e05789-thumb_jpg.jpg?buster=1411577330"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies">Maven Ventures Growth Labs</a>
</div>
<div class="pitch">
Get a option to invest up to $500k in the best Maven grads
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies" title="Skydio"><img alt="Skydio" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/507975-aac9786d6c4cba99be634b7bc1969cf3-thumb_jpg.jpg?buster=1420952326"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies">Skydio</a>
</div>
<div class="pitch">
MIT, Google[x]ers with deep prior experience doing intelligent navigation for drones
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies" title="Fin Tech by Flight.vc"><img alt="Fin Tech by Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/517240-5bc50eb42d1e40a8ad437c6bd164a5a8-thumb_jpg.jpg?buster=1414004533"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies">Fin Tech by Flight.vc</a>
</div>
<div class="pitch">
Investing in Financial Services and Fin-Tech that has proprietary advantages
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies" title="Channel"><img alt="Channel" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/521452-b6bc15ef040fdf37d885aea71ecad3bb-thumb_jpg.jpg?buster=1446676191"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies">Channel</a>
</div>
<div class="pitch">
Watch the world.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies" title="HealthSherpa"><img alt="HealthSherpa" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/443932-63c6bcbbf9ba36a7fa3e532177222c9b-thumb_jpg.jpg?buster=1462374897"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies">HealthSherpa</a>
</div>
<div class="pitch">
Next-generation Healthcare.gov
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies" title="Sidewire"><img alt="Sidewire" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/558206-b416bf8347c7f766b5ea1cf79123c4d2-thumb_jpg.jpg?buster=1444189112"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies">Sidewire</a>
</div>
<div class="pitch">
Where Experts Chat in Public
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies" title="Brainchild &amp;amp; Co."><img alt="Brainchild &amp; Co." class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/570055-cc2c2309fefa21e3ebda6229d6a0b890-thumb_jpg.jpg?buster=1420474118"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies">Brainchild &amp; Co.</a>
</div>
<div class="pitch">
Building services and products for consumers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies" title="Signatures Capital"><img alt="Signatures Capital" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/571060-8a077d7cbac9cc7e2d81859adb8cd1c6-thumb_jpg.jpg?buster=1420664121"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies">Signatures Capital</a>
</div>
<div class="pitch">
Supporting founders committed to inventing the future.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies" title="Airtable"><img alt="Airtable" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/623000-9d210a39051abc7accec1dc686888dcc-thumb_jpg.jpg?buster=1449952044"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies">Airtable</a>
</div>
<div class="pitch">
Organize anything you can imagine
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies" title="Meerkat"><img alt="Meerkat" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/630861-820b9d4af09e110b150c9affe418d860-thumb_jpg.jpg?buster=1425688408"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies">Meerkat</a>
</div>
<div class="pitch">
Live Stream Video.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies" title="Flight Ventures"><img alt="Flight Ventures" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/658877-89ccd88502db9d964a651ecba6f86d9d-thumb_jpg.jpg?buster=1457552637"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies">Flight Ventures</a>
</div>
<div class="pitch">
Investing in the Top Companies and Entrepreneurs
</div>
</div>
</div>
</div>]

有更多的过滤器等。。如果您想查看如何在浏览器中选择它们,并在网络下的xhr选项卡下查看firebug或开发人员工具中如何发出请求,则可以进行设置。

10-07 21:28