问题描述
我正在尝试使用request模块在python中创建脚本,以从网站上抓取不同职位的标题.要解析不同工作的标题,我需要首先从该站点获得相关响应,以便我可以使用BeautifulSoup处理内容.但是,当我运行以下脚本时,我可以看到该脚本产生了 胡闹 ,其中实际上不包含我要查找的标题.
I'm trying to create a script in python using requests module to scrape the title of different jobs from a website. To parse the title of different jobs I need to get the relevant response from that site first so that I can process the content using BeautifulSoup. However, When I run the following script, I can see that the script produces gibberish which literally do not contain the titles I look for.
网站链接(如果您看不到任何数据,请确保刷新页面
)
我尝试过:
import requests
from bs4 import BeautifulSoup
link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'
query_string = {
'page': '1',
'position': '235',
'type': '',
'city': '',
'region': ''
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
s.headers.update({"Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=®ion="})
res = s.get(link,params=query_string)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
print(item.text)
我什至尝试过这样:
import urllib.request
from bs4 import BeautifulSoup
from urllib.parse import urlencode
link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'
query_string = {
'page': '1',
'position': '235',
'type': '',
'city': '',
'region': ''
}
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
"Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=®ion="
}
def get_content(url,params):
req = urllib.request.Request(f"{url}{params}",headers=headers)
res = urllib.request.urlopen(req).read()
soup = BeautifulSoup(res,"lxml")
for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
yield item.text
if __name__ == '__main__':
params = urlencode(query_string)
for item in get_content(link,params):
print(item)
PS浏览器模拟器不是执行此任务的选项.
PS Browser simulator is not an option here to do the task.
推荐答案
我想看看你的胡言乱语是什么样子.当我运行您的代码时,我得到了一堆希伯来语字符(毫不奇怪,因为该网站是希伯来语)和职称:
I'd like to see what your gibberish looks like. When I ran your code, I got a bunch of Hebrew characters (unsurprising, since the website is in Hebrew) and job titles:
您要过滤出希伯来语字符吗?因为那只需要简单的正则表达式!导入re软件包,然后使用以下代码替换您的打印语句:
Is your problem that you want to filter out the Hebrew characters? Because that just requires simple regex! Import the re package, and then replace your print statement with this:
print(re.sub('[^A-z0-9]+',' ',item.text))
希望这会有所帮助!
这篇关于如何使用请求从网站上抓取不同职位的标题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!