本文介绍了具体的结果聚焦,而使用Python和美丽的汤4刮微博?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个跟进我的文章Using Python的刮嵌套的div,横跨在Twitter的?。

This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.

我不使用Twitter的API,因为它不看的鸣叫
这包括hashtag远。完成code和产量均低于例子了。

I'm not using the Twitter API because it doesn't look at the tweets byhashtag this far back. Complete code and output are below after examples.

我想从每个鸣叫刮具体数据。 名称处理检索的正是我要找的,但我无法缩小其余元素

I want to scrape specific data from each tweet. name and handle are retrieving exactly what I'm looking for, but I'm having trouble narrowing down the rest of the elements.

作为一个例子:

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
 url = link[0]

获取此:

 <a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
 <span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>

有关的URL,我只需要在的href 从第一行的值。

For url, I only need the href value from the first line.

同样,锐推收藏夹命令返回的HTML大块,当我真正需要的是数值所显示为每一个值

Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.

我怎样才能缩小的结果,所需数据的URL,retweetcount和favcount输出?

How can I narrow down the results to the required data for the url, retweetcount and favcount outputs?

我计划通过所有的鸣叫有这个周期一旦我得到它的工作,在对你的建议的影响情况。

I am planning to have this cycle through all the tweets once I get it working, in case that has an influence on your suggestions.

完成code:

 from bs4 import BeautifulSoup
 import requests
 import sys

 url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
 r = requests.get(url, headers=headers)
 data = r.text.encode('utf-8')
 soup = BeautifulSoup(data, "html.parser")

 name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
 username = name[0].contents[0]

 handle = soup('span', {'class': 'username js-action-profile-name'})
 userhandle = handle[0].contents[1].contents[0]

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
 url = link[0]

 messagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
 message = messagetext[0]

 retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
 retweetcount = retweets[0]

 favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
 favcount = favorites[0]

 print (username, "\n", "@", userhandle, "\n", "\n", url, "\n", "\n", message, "\n", "\n", retweetcount, "\n", "\n", favcount) #extra linebreaks for ease of reading

完整输出:

Michael Peel

@Mikepeeljourno

<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>

<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>

<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>

<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>

有人建议BeautifulSoup - 提取可能有这个问题的答案有属性值。不过,我认为,问题和答案,没有足够的上下文或解释是在更复杂的情况很有帮助。美丽的汤文档的相关部分的链接是有帮助的,虽然,的

It was suggested that BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags

推荐答案

使用的类似于字典的访问应用于标签的属性。

Use the dictionary-like access to the Tag's attributes.

例如,要获得的href 属性值:

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]

或者,如果你需要获得的href 对发现的每一个环节的值:

Or, if you need to get the href values for every link found:

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]


作为一个方面说明,你并不需要指定完整的价值定位的元素。 是一个特殊的多值属性的,你可以只使用其中的一个类(如果这足以缩小所需搜索元素)。例如,代替


As a side note, you don't need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:

soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})

您可以使用:

soup('a', {'class': 'tweet-timestamp'})

或者,:

soup.select("a.tweet-timestamp")

这篇关于具体的结果聚焦,而使用Python和美丽的汤4刮微博?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 20:50
查看更多