问题描述
我正在尝试使用 Python 3.X 对 Twitter 进行网络抓取,但我只收集了我请求的最后 20 条推文.我想收集 2006 年至今的请求的全部数据.为此,我认为要创建另外两个功能:一个将收集最后的推文,另一个将收集当前的推文?我如何从这个滚动页面收集数据?我认为我必须使用推文的 id,但无论我提出什么请求,总是我收到的最后 20 条推文.
I'm trying to web scrape Twitter using Python 3.X but I only collect the last 20 tweets of my request.I would like to collect whole data of a request between 2006 and now. For this I think to have create two more function: one which will collect the last tweets and one which will collect the current tweets?And how can I collect the data from this scrolling page? I think that I have to use the tweet's id but no matter the request I do it's always the last 20 tweets that I get.
from pprint import pprint
from lxml import html
import requests
import datetime as dt
from BeautifulSoup import BeautifulSoup
def search_twitter(search):
url = "https://twitter.com/search?f=tweets&vertical=default&q="+search+"&src=typd&lang=fr"
request = requests.get(url)
sourceCode = BeautifulSoup(request.content, "lxml")
tweets = sourceCode.find_all('li', 'js-stream-item')
return tweets
def filter_tweets(tweets):
data = []
for tweet in tweets:
if tweet.find('p', 'tweet-text'):
dtwee = [
['id', tweet['data-item-id']],
['username', tweet.find('span', 'username').text],
['time', tweet.find('a', 'tweet-timestamp')['title']],
['tweet', tweet.find('p', 'tweet-text').text.encode('utf-8')]]
data.append(dtwee)
#tweet_time = dt.datetime.strptime(tweet_time, '%H:%M - %d %B %Y')
else:
continue
return data
def firstlastId_tweets(tweets):
firstID = ""
lastID = ""
i = 0
for tweet in tweets:
if(i == 0):
firstID = tweet[0][1]
elif(i == (len(tweets)-1)):
lastID = tweet[0][1]
i+=1
return firstID, lastID
def last_tweets(search, lastID):
url = "https://twitter.com/search?f=tweets&vertical=default&q="+search+"&src=typd&lang=fr&max_position=TWEET-"+lastID
request = requests.get(url)
sourceCode = BeautifulSoup(request.content, "lxml")
tweets = sourceCode.find_all('li', 'js-stream-item')
return tweets
tweets = search_twitter("lol")
tweets = filter_tweets(tweets)
pprint(tweets)
firstID, lastID = firstlastId_tweets(tweets)
print(firstID, lastID)
while True:
lastTweets = last_tweets("lol", lastID)
pprint(lastTweets)
firstID, lastID = firstlastId_tweets(lastTweets)
print(firstID, lastID)
推荐答案
我根据这个网页找到了一个很好的解决方案:
I found a good solution based on this webpage:
http://ataspinar.com/2015/11/09/collecting-data-from-twitter/
我所做的是创建一个名为 max_pos 的变量,我在其中存储了这个字符串:
What I did was creating a variable called max_pos where I stored this string:
'&max_position=TWEET-'+last_id+'-'+first_id
我存储了first_id
(position1 Tweet id
)和last_id
(position20 Tweet id
)
所以对于请求,我使用了这样的东西:
So for the request, I used something like this:
request = requests.get(url+max_pos)
从 max_pos empty
开始.
我发现这可能是一个常见问题,我们可以发布一个有效的解决方案.我仍然没有按照我需要的方式显示结果,但我可以按照链接中的指南模拟向下滚动到最后".
I see this can be a common issue, we could post a working solution. I still do not have it showing the results the way I need, but I could simulate the "scroll down till the end" following the guide from the link.
这篇关于无限网络抓取 Twitter的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!