问题描述
我正在抓取这个网站并使用 .get_text().strip()
将标题"和类别"作为文本获取.
I am scraping this website and get "title" and "category" as text using .get_text().strip()
.
我在使用相同的方法将作者"提取为文本时遇到问题.
I have a problem using the same approach for extracting the "author" as text.
data2 = {
'url' : [],
'title' : [],
'category': [],
'author': [],
}
url_pattern = "https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&year=2018&page={}"
count_min = 1
count_max = 3
while count_min <= count_max:
print (count_min)
url = url_pattern.format(count_min)
r = requests.get(url)
try:
soup = BeautifulSoup(r.content, 'lxml')
for links in soup.find_all('article'):
data2['url'].append(links.a.attrs['href'])
data2['title'].append(links.h3.get_text().strip())
data2["category"].append(links.span.get_text().strip())
data2["author"].append(links.find('span', {"itemprop": "name"}).get_text().strip()) #??????
except Exception as exc:
print(exc.__class__.__name__, exc)
time.sleep(0.1)
count_min = count_min + 1
print ("Fertig.")
df = pd.DataFrame( data2 )
df
df
应该打印一个带有author"、category"、title"、url"的表格.打印异常给了我以下提示:AttributeError 'NoneType' object has no attribute 'get_text'
.但我收到以下消息,而不是表格.
df
is supposed to print a table with "author", "category", "title", "url". The print Exception gives me the following hint: AttributeError 'NoneType' object has no attribute 'get_text'
. But instead of the table I get the following message.
ValueError Traceback (most recent call last)
<ipython-input-34-9bfb92af1135> in <module>()
29
30 print ("Fertig.")
---> 31 df = pd.DataFrame( data2 )
32 df
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
328 dtype=dtype, copy=copy)
329 elif isinstance(data, dict):
--> 330 mgr = self._init_dict(data, index, columns, dtype=dtype)
331 elif isinstance(data, ma.MaskedArray):
332 import numpy.ma.mrecords as mrecords
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
459 arrays = [data[k] for k in keys]
460
--> 461 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
462
463 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
6161 # figure out the index, if necessary
6162 if index is None:
-> 6163 index = extract_index(arrays)
6164 else:
6165 index = _ensure_index(index)
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)
6209 lengths = list(set(raw_lengths))
6210 if len(lengths) > 1:
-> 6211 raise ValueError('arrays must all be same length')
6212
6213 if have_dicts:
ValueError: arrays must all be same length
如何改进我的代码以提取作者"姓名?
How can I improve my code to get the "author" names extracted?
推荐答案
你们很亲近——我推荐几件事情.首先,我建议仔细查看 HTML——在这种情况下,作者姓名实际上在 ul
中,其中每个 li
包含一个 span
其中 itemprop
是 'name'
.然而,并非所有文章都有任何作者姓名.在这种情况下,使用您当前的代码,对 links.find('div', {'itemprop': 'name'})
的调用将返回 None
.None
当然没有属性get_text
.这意味着该行将抛出一个错误,在这种情况下只会导致没有值被附加到 data2
'author'
列表.我建议将作者存储在如下列表中:
You're very close--there's a couple of things I recommend. First, I'd recommend taking a closer look at the HTML--in this case the author names are actually in a ul
, where each li
contains a span
where itemprop
is 'name'
. However, not all articles have any author names at all. In this case, with your current code, the call to links.find('div', {'itemprop': 'name'})
returns None
. None
, of course, has no attribute get_text
. This means that line will throw an error, which in this case will just cause no value to be appended to the data2
'author'
list. I'd recommend storing the author(s) in a list like so:
authors = []
ul = links.find('ul', itemprop='creator')
for author in ul.find_all('span', itemprop='name'):
authors.append(author.text.strip())
data2['authors'].append(authors)
这处理了我们所期望的没有作者的情况,作者"是一个空列表.
This handles the case where there are no authors as we would expect, by "authors" being an empty list.
作为旁注,将您的代码放在一个
As a side note, putting your code inside a
try:
...
except:
pass
construct 通常被认为是糟糕的实践,这正是您现在看到的原因.默默地忽略错误可以使您的程序看起来运行正常,而实际上任何数量的事情都可能出错.至少,将错误信息打印到 stdout
很少是一个坏主意.即使只是做这样的事情也比什么都不做要好:
construct is generally considered poor practice, for exactly the reason you're seeing now. Ignoring errors silently can give your program the appearance of running properly, while in fact any number of things could be going wrong. At the very least it's rarely a bad idea to print error info to stdout
. Even just doing something like this is better than nothing:
try:
...
except Exception as exc:
print(exc.__class__.__name__, exc)
然而,对于调试,通常也需要完整的回溯.为此,您可以使用 traceback
模块.
For debugging, however, having the full traceback is often desirable as well. For this you can use the traceback
module.
import traceback
try:
...
except:
traceback.print_exc()
这篇关于从网站提取文本时出错:AttributeError 'NoneType' 对象没有属性 'get_text'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!