问题描述
我正在尝试构建使用Tor代理的多线程搜寻器:我正在使用以下方法建立Tor连接:
I am trying to build multi threaded crawler that uses tor proxies:I am using following to establish tor connection:
from stem import Signal
from stem.control import Controller
controller = Controller.from_port(port=9151)
def connectTor():
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
socket.socket = socks.socksocket
def renew_tor():
global request_headers
request_headers = {
"Accept-Language": "en-US,en;q=0.5",
"User-Agent": random.choice(BROWSERS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": "http://thewebsite2.com",
"Connection": "close"
}
controller.authenticate()
controller.signal(Signal.NEWNYM)
这是网址提取程序:
def get_soup(url):
while True:
try:
connectTor()
r = requests.Session()
response = r.get(url, headers=request_headers)
the_page = response.content.decode('utf-8',errors='ignore')
the_soup = BeautifulSoup(the_page, 'html.parser')
if "captcha" in the_page.lower():
print("flag condition matched while url: ", url)
#print(the_page)
renew_tor()
else:
return the_soup
break
except Exception as e:
print ("Error while URL :", url, str(e))
然后我要创建多线程提取作业:
I am then creating multithreaded fetch job:
with futures.ThreadPoolExecutor(200) as executor:
for url in zurls:
future = executor.submit(fetchjob,url)
然后我遇到以下错误,在使用多处理程序时看不到此错误:
then I am getting following error, which I am not seeing when I use multiprocessing:
Socket connection failed (Socket error: 0x01: General SOCKS server failure)
我将不胜感激,建议您避免袜子错误并提高爬网方法的性能以使其成为多线程.
I would appreciate Any advise to avoid socks error and improving the performance of crawling method to make it multi threaded.
推荐答案
这是为什么猴子修补socket.socket
不好的完美示例.
This is a perfect example of why monkey patching socket.socket
is bad.
这用SOCKS套接字替换了 all socket
连接(大多数情况下)使用的套接字.
This replaces the socket used by all socket
connections (which is most everything) with the SOCKS socket.
稍后再连接到控制器时,它将尝试使用SOCKS协议进行通信,而不是建立直接连接.
When you go to connect to the controller later, it attempts to use the SOCKS protocol to communicate instead of establishing a direct connection.
由于您已经在使用requests
,因此建议您摆脱SocksiPy和socks.socket = socks.socksocket
代码,并使用SOCKS 代理功能:
Since you're already using requests
, I'd suggest getting rid of SocksiPy and the socks.socket = socks.socksocket
code and using the SOCKS proxy functionality built into requests:
proxies = {
'http': 'socks5h://127.0.0.1:9050',
'https': 'socks5h://127.0.0.1:9050'
}
response = r.get(url, headers=request_headers, proxies=proxies)
这篇关于使用Tor代理时使用多线程搜寻器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!