问题描述
我正在尝试制作显示页面排名基本概念的网络爬虫.对我来说,代码对我来说似乎很好,但会给我返回错误,例如
I'm trying to make web crawler which shows basic idea of page rank. And code for me seems fine for me but gives me back errors e.x.
`Traceback (most recent call last):
File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 89, in <module>
webpages()
File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 17, in webpages
get_single_item_data(href)
File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 23, in get_single_item_data
source_code = requests.get(item_url)
File "C:\Python34\lib\site-packages\requests\api.py", line 65, in get
return request('get', url, **kwargs)
File "C:\Python34\lib\site-packages\requests\api.py", line 49, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 447, in request
prep = self.prepare_request(req)
File "C:\Python34\lib\site-packages\requests\sessions.py", line 378, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Python34\lib\site-packages\requests\models.py", line 303, in prepare
self.prepare_url(url, params)
File "C:\Python34\lib\site-packages\requests\models.py", line 360, in prepare_url
"Perhaps you meant http://{0}?".format(url))
requests.exceptions.MissingSchema: Invalid URL '//www.hm.com/lv/logout': No schema supplied. Perhaps you meant http:////www.hm.com/lv/logout?`
我运行后python给我的最后一行代码是:
and the last row of code which python gives me back after I run it is:
//www.hm.com/lv/logout
也许问题出在两个 //
上,但我敢肯定,无论如何,当我尝试调用其他网页时,例如http://en.wikipedia.org/wiki/Wiki 它给了我 None
和同样的错误.
Maybe problem is with two //
but I'm sure, anyway when I try to crall other web pages e.x. http://en.wikipedia.org/wiki/Wiki it gives me back None
and same errors.
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
from operator import itemgetter
all_links = defaultdict(int)
def webpages():
url = 'http://www.hm.com/lv/'
source_code = requests.get(url)
text = source_code.text
soup = BeautifulSoup(text)
for link in soup.findAll ('a'):
href = link.get('href')
print(href)
get_single_item_data(href)
return all_links
def get_single_item_data(item_url):
#if not item_url.startswith('http'):
#item_url = 'http' + item_url
source_code = requests.get(item_url)
text = source_code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a'):
href = link.get('href')
if href and href.startswith('http://www.'):
if href:
all_links[href] += 1
print(href)
def sort_algorithm(list):
for index in range(1,len(list)):
value= list[index]
i = index - 1
while i>=0:
if value < list[i]:
list[i+1] = list[i]
list[i] = value
i=i -1
else:
break
vieni = ["", "viens", "divi", "tris", "cetri", "pieci",
"sesi", "septini", "astoni", "devini"]
padsmiti = ["", "vienpadsmit", "divpadsmit", "trispadsmit", "cetrpadsmit",
"piecpadsmit", 'sespadsmit', "septinpadsmit", "astonpadsmit", "devinpadsmit"]
desmiti = ["", "desmit", "divdesmit", "trisdesmit", "cetrdesmit",
"piecdesmit", "sesdesmit", "septindesmit", "astondesmit", "devindesmit"]
def num_to_words(n):
words = []
if n == 0:
words.append("zero")
else:
num_str = "{}".format(n)
groups = (len(num_str) + 2) // 3
num_str = num_str.zfill(groups * 3)
for i in range(0, groups * 3, 3):
h = int(num_str[i])
t = int(num_str[i + 1])
u = int(num_str[i + 2])
print()
print(vieni[i])
g = groups - (i // 3 + 1)
if h >= 1:
words.append(vieni[h])
words.append("hundred")
if int(num_str) % 100:
words.append("and")
if t > 1:
words.append(desmiti[t])
if u >= 1:
words.append(vieni[u])
elif t == 1:
if u >= 1:
words.append(padsmiti[u])
else:
words.append(desmiti[t])
else:
if u >= 1:
words.append(vieni[u])
return " ".join(words)
webpages()
for k, v in sorted(webpages().items(),key=itemgetter(1),reverse=True):
print(k, num_to_words(v))
推荐答案
来自网页函数循环的链接可能以两个斜杠开头,表示该链接使用当前的 Schema .例如,打开 https://en.wikipedia.org/wiki/Wiki 链接//en.wikipedia.org/login" 将是 "https://en.wikipedia.org/login".打开 http://en.wikipedia.org/wiki/Wiki 将是http://en.wikipedia.org/login.
The links come from the loop of webpages functions may be start with two slash.It means this link use the current Schema . For ex, open https://en.wikipedia.org/wiki/Wiki the link "//en.wikipedia.org/login" will be "https://en.wikipedia.org/login". open http://en.wikipedia.org/wiki/Wiki will be http://en.wikipedia.org/login.
在 html "a" 标签中打开 url 的更好方法是使用 urlparse.urljoin 函数.它将目标和当前 url 连接起来.不管绝对/相对路径.
A better way to open url in a html "a" tag is using the urlparse.urljoin function.It joins the target and current url. Regardless of absolute / relative path.
希望能帮到你.
这篇关于在 python 中制作我自己的网络爬虫,它显示了页面排名的主要思想的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!