问题描述
我正在对推文进行内容分析.我正在使用 tweepy 返回匹配某些术语的推文,然后将 N 条推文写入 CSV 文件以进行分析.创建文件和获取数据不是问题,但我想减少数据收集时间.目前我正在遍历文件中的术语列表.一旦达到 N(例如 500 条推文),它就会移动到下一个过滤器项.
我想将我的所有术语(少于 400 个)输入到一个变量中,并且所有结果都匹配.这也有效.我无法得到的是来自 Twitter 的关于状态中匹配的术语的返回值.
class CustomStreamListener(tweepy.StreamListener):def __init__(self, output_file, api=None):super(CustomStreamListener, self).__init__()self.num_tweets = 0self.output_file = output_filedef on_status(self, status):清洁 = status.text.replace('\'','').replace('&','').replace('>','').replace(',','').replace("\n",'')self.num_tweets = self.num_tweets + 1如果 self.num_tweets
特别是我的问题是这个.如果 track 变量有多个条目,我如何获得匹配的内容?我还将声明我对 python 和 tweepy 比较陌生.
预先感谢您的任何建议和帮助!
您可以根据匹配的术语检查推文文本.类似的东西:
>>>a = "你好,这是一条推文">>>条款 = [这个"]>>>匹配 = []>>>对于 i,枚举中的术语(术语):...如果(a中的术语):...matches.append(i)...>>>火柴[0]>>>这将为您提供与该特定推文 a 匹配的所有术语.在这种情况下,这只是这个"术语.
I am doing content analysis on tweets. I'm using tweepy to return tweets that match certain terms and then writing N amount of tweets to a CSv file for analysis. Creating the files and getting data is not an issue, but I would like to reduce data collection time. Currently I am iterating through a list of terms from a file. Once the N is reached (eg 500 tweets), it moves to the next filter term.
I would like to input all my terms (less than 400) into a single variable and all the results to match. This works too. What I cannot get is a return value from twitter on what term matched in the status.
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, output_file, api=None):
super(CustomStreamListener, self).__init__()
self.num_tweets = 0
self.output_file = output_file
def on_status(self, status):
cleaned = status.text.replace('\'','').replace('&','').replace('>','').replace(',','').replace("\n",'')
self.num_tweets = self.num_tweets + 1
if self.num_tweets < 500:
self.output_file.write(topicName + ',' + status.user.location.encode("UTF-8") + ',' + cleaned.encode("UTF-8") + "\n")
print ("capturing tweet number " + str(self.num_tweets) + " for search term: " + topicName)
return True
else:
return False
sys.exit("terminating")
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True #Don't kill the stream
with open('termList.txt', 'r') as f:
topics = [line.strip() for line in f]
for topicName in topics:
stamp = datetime.datetime.now().strftime(topicName + '-%Y-%m-%d-%H%M%S')
with open(stamp + '.csv', 'w+') as topicFile:
sapi = tweepy.streaming.Stream(auth, CustomStreamListener(topicFile))
sapi.filter(track=[topicName])
Specifically my issue is this. How do I get what matched if the track variable has multiple entries? I will also state that I am relatively new to python and tweepy.
Thanks in advance for any advice and assistance!
You could check the tweet text against your matching terms. Something like:
>>> a = "hello this is a tweet"
>>> terms = [ "this "]
>>> matches = []
>>> for i, term in enumerate( terms ):
... if( term in a ):
... matches.append( i )
...
>>> matches
[0]
>>>
Which would give you all of the terms that that specific tweet, a, matched. Which in this case was just the "this" term.
这篇关于Tweepy 跟踪多个术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!