问题描述
Evening Folks,
我正试图向Google提出一个问题,并从其尊重的搜索查询中提取所有相关链接(即我搜索site:Wikipedia .com Thomas Jefferson它给了我wiki.com/jeff,wiki.com/tom等。
这是我的代码:
来自bs4 import BeautifulSoup
来自urllib2 import urlopen
query ='Thomas Jefferson'
query.replace(,+)
#replaces带有加号的空格,用于Google兼容性目的
soup = BeautifulSoup(urlopen(https) ://www.google.com/?gws_rd = ssl #q = site:wikipedia.com ++ query,html.parser)
#creates soup并打开Google的网址。开始搜索网站:wikipedia.com所以只有维基百科
#links出现。使用html解析器。
for soup.find_all中的项目('h3',attrs = {'class':'r'}):
print item.string
#Guides BS to h3 class绿色维基百科URL所在的r,然后打印URL
#Limiter代码仅拉取前5个结果
这里的目标是让我设置查询变量,让python查询Google,如果你愿意,Beautiful Soup会提取所有绿色链接。
我只希望完全拉出绿色链接。奇怪的是,谷歌的源代码是隐藏的(他们的搜索架构的一个症状),所以美丽的汤不能只是从h3标签中拉出一个href。当我检查元素时,我能够看到h3 hrefs,但是当我查看源时,我看不到。
我的问题是:如果我无法访问他们的来源,如何通过BeautifulSoup从Google中拉出前5个最相关的绿色链接代码,只检查元素?
PS:为了了解我想要实现的目标,我发现了两个相对接近的Stack Overflow问题,比如我的:
当我尝试使用JavaScript搜索时,我得到的网址与Rob M.不同isabled -
https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson& gbv = 1& sei = YwHNVpHLOYiWmQHk3K24Cw
要使这个适用于任何查询,首先应确保你的查询中没有空格(这就是你得到400:错误请求的原因)。您可以使用:
query =Thomas Jefferson
query = urllib。 quote_plus(query)
将urlencode所有空格作为加号 - 创建一个有效的URL。 / p>
然而 ,这不与urllib一起工作 - 你得到403:Forbidden 。我使用模块:
从bs4导入导入请求
import urllib
BeautifulSoup
query ='Thomas Jefferson'
query = urllib.quote_plus(查询)
r = requests.get('https://www.google.com /search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text,html.parser)
#creates soup并打开Google的网址。开始搜索网站:wikipedia.com所以只有维基百科
#links出现。使用html解析器。
links = []
for soup.find_all中的项目('h3',attrs = {'class':'r'}):
links.append(item。 a ['href'] [7:])#[7:]剥离/ url?q =前缀
#Guides BS到h3 classr绿色维基百科URL所在的位置,然后打印URL
#Limiter代码只能拉出前5个结果
打印链接给出:
打印链接
#[u'http://en.wikipedia.com/wiki/Thomas_Jefferson& sa = U& ved = 0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA& amp; ; USG = AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA,
#u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy& SA = U&安培; VED = 0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&安培; USG = AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg ',
#u'http://en.wikipedia.com/wiki/Sally_Hemings& sa = U& ved = 0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI& usg = AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
#u'http:/ /en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4 p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&安培; USG = AFQjCNE4YlDpcIUqJRGghuSC43TkG-917克 '
#u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University& SA = U&安培; VED = 0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&安培; USG = AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
#u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson& sa = U& ved = 0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU& usg = AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
#u'http://en.wikipedia.com/ wiki / United_States_presidential_election,_1800& sa = U& ved = 0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY& usg = AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
#u'http://en.wikipedia.com/wiki/Isaac_Jefferson& sa = U& ved = 0ahUKEwj4p5- 4zI_LAhXCJCYKHUEMCjQQFgg8MAc&安培; USG = AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ '
#u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796& SA = U&安培; VED = 0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&安培; USG = AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
# u'http://en.wikipedia。 com / wiki / United_States_presidential_election,_1804& sa = U& ved = 0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk& usg = AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']
Evening Folks,
I'm attempting to ask Google a question, and pull all the relevant links from its respected search query (i.e. I search "site: Wikipedia.com Thomas Jefferson" and it gives me wiki.com/jeff, wiki.com/tom, etc.)
Here's my code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
query = 'Thomas Jefferson'
query.replace (" ", "+")
#replaces whitespace with a plus sign for Google compatibility purposes
soup = BeautifulSoup(urlopen("https://www.google.com/?gws_rd=ssl#q=site:wikipedia.com+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.
for item in soup.find_all('h3', attrs={'class' : 'r'}):
print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
The goal here is for me to set the query variable, have python query Google, and Beautiful Soup pulls all the "green" links, if you will.
Here is a picture of a Google results page
I only wish to pull the green links, in their full extent. What's weird is that Google's Source Code is "hidden" (a symptom of their search architecture), so Beautiful Soup can't just go and pull a href from an h3 tag. I am able to see the h3 hrefs when I Inspect Element, but not when I view source.
Here is a picture of the Inspect Element
My question is: How do I go about pulling the top 5 most relevant green links from Google via BeautifulSoup if I cannot access their Source Code, only Inspect Element?
PS: To give an idea of what I am trying to accomplish, I have found two relatively close Stack Overflow questions like mine:
beautiful soup extract a href from google search
How to collect data of Google Search with beautiful soup using python
I got a different URL than Rob M. when I tried searching with JavaScript disabled -
https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw
To make this work with any query, you should first make sure that your query has no spaces in it (that's why you'll get a 400: Bad Request). You can do this using urllib.quote_plus()
:
query = "Thomas Jefferson"
query = urllib.quote_plus(query)
which will urlencode all of the spaces as plus signs - creating a valid URL.
However, this does not work with urllib - you get a 403: Forbidden. I got it to work by using the python-requests
module like this:
import requests
import urllib
from bs4 import BeautifulSoup
query = 'Thomas Jefferson'
query = urllib.quote_plus(query)
r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.
links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
Printing links gives:
print links
# [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
# u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
# u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
# u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
# u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
# u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
# u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
# u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
# u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
# u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']
这篇关于使用Beautiful Soup从Google搜索中提取数据/链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!