问题描述
我在 python
中制作了一个网页抓取工具,可以让我了解各种博彩网站的免费投注优惠何时发生变化或添加了新优惠.
I have made a web scraper in python
to give me information on when free bet offers from various bookie websites have changed or new ones have been added.
然而,博彩公司倾向于记录与IP
流量和MAC
地址相关的信息,以便标记匹配的更好的.
However, the bookies tend to record information relating to IP
traffic and MAC
addresses in order to flag up matched betters.
在urllib.request
模块中使用Request()
方法时,如何欺骗我的IP
地址?
How can I spoof my IP
address when using the Request()
method in the urllib.request
module?
我的代码如下:
req = Request('https://www.888sport.com/online-sports-betting-promotions/', headers={'User-Agent': 'Mozilla/5.0'})
site = urlopen(req).read()
content = bs4.BeautifulSoup(site, 'html.parser')
推荐答案
我不久前遇到了同样的问题.这是我正在使用的代码片段,以便匿名抓取.
I faced the same problem a while ago. Here is my code snippet, which I am using, in order to scrape anonymously.
from urllib.request import Request, urlopen
from fake_useragent import UserAgent
import random
from bs4 import BeautifulSoup
from IPython.core.display import clear_output
# Here I provide some proxies for not getting caught while scraping
ua = UserAgent() # From here we generate a random user agent
proxies = [] # Will contain proxies [ip, port]
# Main function
def main():
# Retrieve latest proxies
proxies_req = Request('https://www.sslproxies.org/')
proxies_req.add_header('User-Agent', ua.random)
proxies_doc = urlopen(proxies_req).read().decode('utf8')
soup = BeautifulSoup(proxies_doc, 'html.parser')
proxies_table = soup.find(id='proxylisttable')
# Save proxies in the array
for row in proxies_table.tbody.find_all('tr'):
proxies.append({
'ip': row.find_all('td')[0].string,
'port': row.find_all('td')[1].string
})
# Choose a random proxy
proxy_index = random_proxy()
proxy = proxies[proxy_index]
for n in range(1, 20):
req = Request('http://icanhazip.com')
req.set_proxy(proxy['ip'] + ':' + proxy['port'], 'http')
# Every 10 requests, generate a new proxy
if n % 10 == 0:
proxy_index = random_proxy()
proxy = proxies[proxy_index]
# Make the call
try:
my_ip = urlopen(req).read().decode('utf8')
print('#' + str(n) + ': ' + my_ip)
clear_output(wait = True)
except: # If error, delete this proxy and find another one
del proxies[proxy_index]
print('Proxy ' + proxy['ip'] + ':' + proxy['port'] + ' deleted.')
proxy_index = random_proxy()
proxy = proxies[proxy_index]
# Retrieve a random index proxy (we need the index to delete it if not working)
def random_proxy():
return random.randint(0, len(proxies) - 1)
if __name__ == '__main__':
main()
这将创建一些正在工作的代理.而这部分:
That will create some proxies which are working. And the this part:
user_agent_list = (
#Chrome
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
#Firefox
'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
)
这将创建不同的标题",伪装成浏览器.最后但并非最不重要的一点是将这些输入到您的 request() 中.
Which will create different "headers", pretending to be a browser.Last but not least just enter those into you request().
# Make a get request
user_agent = random.choice(user_agent_list)
headers= {'User-Agent': user_agent, "Accept-Language": "en-US, en;q=0.5"}
proxy = random.choice(proxies)
response = get("your url", headers=headers, proxies=proxy)
希望能解决您的问题.
否则看这里:https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/
干杯
这篇关于网页抓取时欺骗IP地址(python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!