我已经在Python中使用beautifulSoup编写了一个脚本,该脚本用于从网站上抓取工作职位(我有许可)。
问题
刮板效果很好,但是对于不同的职位,它返回相同的标题,而在职位发布中,它们应该是不同的。
代码
import requests
from bs4 import BeautifulSoup
base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"
req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
links = soup.select("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
res = requests.get(base + link.get("href")).text
soup = BeautifulSoup(res,'html.parser')
title = soup.select_one("h1.page-intro__title").get_text() if
soup.select_one("h1.section__title") else ""
overview = soup.select_one("p.page-intro__longDescription").get_text()
details = soup.select_one("div.rte").get_text()
print(title, link, details)
结果
出于某种原因,所有职位均被赋予相同的标题,但职位发布的其他所有内容都是唯一的(URL,副本等)。
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-to-improve-value-creation-and-finance-functions/"
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-with-unique-competences-within-hr-excellence/"
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-within-supply-chain-management/"
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-within-leadership-development-or-change-management/"
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-to-help-our-customers-succeed-with-it/"
预期结果
结果应类似于以下内容,其中标题是唯一的:
TITLE: Management consultants to improve value creation and finance functions\r\n LINK href="/career/management-consultants-within-leadership-development-or-change-management/"
TITLE: Management Consultants to help our customers succeed with IT functions\r\n LINK href="/career/management-consultants-to-help-our-customers-succeed-with-it/"
已编辑
尝试了以下代码,但对于许多职位仍然看到相同的标题:
import requests
from bs4 import BeautifulSoup
base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"
req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
for link in soup.select("a"):
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
res = requests.get(base + link.get("href")).text
soup = BeautifulSoup(res,'html.parser')
try:
title = soup.select_one("h1.page-intro__title").get_text().strip()
except:
title = ''
print(title)
最佳答案
应用此方法,希望它可以解决此问题:
title = soup.select_one("h1.page-intro__title").get_text() if soup.select_one("h1.section__title") else ""
而且,您也可以像这样去:
import requests
from bs4 import BeautifulSoup
base = "http://implementconsultinggroup.com"
url = "http://implementconsultinggroup.com/career/#/1143"
req = requests.get(url).text
soup = BeautifulSoup(req,'html.parser')
for link in soup.select("a"):
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
res = requests.get(base + link.get("href")).text
soup = BeautifulSoup(res,'html.parser')
try:
title = soup.select_one("h1.page-intro__title").get_text().strip()
except:
title = ''
print(title)
结果如下:
Management consultants to improve value creation and finance functions
Management consultants with unique competences within Organisation & HR
Management consultants within supply chain management
Management consultants within leadership development or change management
Management consultants to help our customers succeed with IT
Management consultants within process improvement
更新结果
(u'Management consultants to improve value creation and finance functions', <a
class="box-link" href="/career/management-consultants-to-improve-value-
creation-and-finance-functions/">\n<h2
(u'Management consultants to improve value creation and finance functions', <a
class="box-link" href="/career/management-consultants-with-unique-competences-
within-hr-excellence/">\n<h2
(u'Management consultants to improve value creation and finance functions', <a
class="box-link" href="/career/management-consultants-within-supply-chain-
管理/“> \ n
关于python - 使用beautifulSoup刮取公司网站时,我获得多个职位的相同职位名称,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46013405/