如何使用请求从网站上抓取不同职位的标题?

本文介绍了如何使用请求从网站上抓取不同职位的标题?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用request模块在python中创建脚本，以从网站上抓取不同职位的标题.要解析不同工作的标题，我需要首先从该站点获得相关响应，以便我可以使用BeautifulSoup处理内容.但是，当我运行以下脚本时，我可以看到该脚本产生了胡闹，其中实际上不包含我要查找的标题.

I'm trying to create a script in python using requests module to scrape the title of different jobs from a website. To parse the title of different jobs I need to get the relevant response from that site first so that I can process the content using BeautifulSoup. However, When I run the following script, I can see that the script produces gibberish which literally do not contain the titles I look for.

网站链接(如果您看不到任何数据，请确保刷新页面)

我尝试过:

import requests
from bs4 import BeautifulSoup

link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'

query_string = {
    'page': '1',
    'position': '235',
    'type': '',
    'city': '',
    'region': ''
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
    s.headers.update({"Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=&region="})
    res = s.get(link,params=query_string)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
        print(item.text)

我什至尝试过这样:

import urllib.request
from bs4 import BeautifulSoup
from urllib.parse import urlencode

link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'

query_string = {
    'page': '1',
    'position': '235',
    'type': '',
    'city': '',
    'region': ''
}

headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
    "Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=&region="
}

def get_content(url,params):
    req = urllib.request.Request(f"{url}{params}",headers=headers)
    res = urllib.request.urlopen(req).read()
    soup = BeautifulSoup(res,"lxml")
    for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
        yield item.text

if __name__ == '__main__':
    params = urlencode(query_string)
    for item in get_content(link,params):
        print(item)

PS浏览器模拟器不是执行此任务的选项.

PS Browser simulator is not an option here to do the task.

推荐答案

我想看看你的胡言乱语是什么样子.当我运行您的代码时，我得到了一堆希伯来语字符(毫不奇怪，因为该网站是希伯来语)和职称:

I'd like to see what your gibberish looks like. When I ran your code, I got a bunch of Hebrew characters (unsurprising, since the website is in Hebrew) and job titles:

您要过滤出希伯来语字符吗?因为那只需要简单的正则表达式！导入re软件包，然后使用以下代码替换您的打印语句:

Is your problem that you want to filter out the Hebrew characters? Because that just requires simple regex! Import the re package, and then replace your print statement with this:

print(re.sub('[^A-z0-9]+',' ',item.text))

希望这会有所帮助！

这篇关于如何使用请求从网站上抓取不同职位的标题?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！