我正在做一个项目,从一个特定的图书馆中搜集图书的目录信息。到目前为止,我的脚本可以从表中删除所有单元格。然而,我对如何只归还新不列颠图书馆的特定单元感到困惑。
import requests
from bs4 import BeautifulSoup
mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
table = soup.find('table', attrs={'class':'itemTable'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
for index, libraryinfo in enumerate(data):
print(index, libraryinfo)
以下是脚本中新不列颠图书馆的输出示例:
["New Britain, Main Library - Children's Department", 'J FIC PALACIO', 'Check Shelf']
与其把所有的牢房都退回去,我怎么能只退回有关新不列颠图书馆的牢房呢?我只想要图书馆名称和结帐状态。
期望的输出是:
["New Britain, Main Library - Children's Department", 'Check Shelf']
可以有多个单元格,因为一本书可以在同一个库中有多个副本。
最佳答案
为了简单地根据特定字段(示例中的第一个字段)筛选出数据,您可以构建一个理解:
[element for element in data if 'New Britain' in element[0]]
您提供的示例消除了使数据元素具有不同大小的空值。这使得很难知道哪个字段对应于每个数据组件。使用dict我们可以使数据更容易理解和处理。
有些字段的内部似乎有空块(只有类似空格的字符[
'\n'
,'\r'
,'\t'
,' '
,])。所以脱衣舞不会移除这些。将它与一个简单的正则表达式相结合可以帮助改进这一点。为此,我编写了一个简单的函数:def squish(s):
return re.sub(r'\s+', ' ', s)
总之,我相信这会帮助你:
import re
import requests
from bs4 import BeautifulSoup
def squish(s):
return re.sub(r'\s+', ' ', s)
def filter_by_location(data, location_name):
return [x for x in data if location_name.lower() in x['Location'].lower()]
mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
table = soup.find('table', attrs={'class':'itemTable'})
headers = [squish(element.text.strip()) for element in table.find('tr').find_all('th')]
for row in table.find_all('tr')[1:]:
cols = [squish(element.text.strip()) for element in row.find_all('td')]
data.append({k:v for k, v in zip(headers, cols)})
filtered_data = filter_by_location(data, 'New Britain')
for x in filtered_data:
print('Location: {}'.format(x['Location']))
print('Status: {}'.format(x['Status']))
print()
运行结果如下:
Location: New Britain, Jefferson Branch - Children's Department
Status: Check Shelf
Location: New Britain, Main Library - Children's Department
Status: Check Shelf
Location: New Britain, Main Library - Children's Department
Status: Check Shelf
关于python - 从图书馆目录中收集信息,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/50431651/