我是一个初学者,通过一些小项目学习python,因此目前正在使用BeautifulSoup学习网络抓取。页面的html看起来像这样:
<div class="BrandList"> <div><b>Brand Name: </b>ONCOTRON INJ</div>
<div><b>Manufacture Name: </b>SUN PHARMA</div> <div><b>Compositions:
</b>
Mitoxantrone 2mg/ml injection,
</div>
我需要解析信息并将其存储在具有以下三列的csv中:名称,制造商名称和组成。
我尝试运行代码,但只能提取品牌名称,而我想要div中的其余文本。
import requests
from bs4 import BeautifulSoup
data = requests.get ('http://www.inpharmation.in/Search/BrandList?Type=Manufacturer&ProductID=79').text
soup= BeautifulSoup(data, 'lxml')
brand = soup.find('div', attrs = {'id':'maincontent'})
out_filename = "Sunp.csv"
headers = "brand,Compositions \n"
f = open(out_filename, "w")
f.write(headers)
for BrandList in brand.findAll('div', attrs = {'class':'BrandList'}):
BrandList['Name'] = Brand_Name.b.text
BrandList['Compositions'] = Compositions.b.text
print("brand: " + brand + "\n")
print("Compositions: " + Compositions + "\n")
f.write (brand + "," + Compositions + "\n")
f.close()
我希望获得品牌名称,成分和制造商名称的输出,但我只会得到品牌名称。
最佳答案
Python的strip()内置函数用于删除字符串中的所有前导和尾随空格。
find_all方法返回元素的集合。使用pandas
库将数据保存到csv文件中。
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = requests.get ('http://www.inpharmation.in/Search/BrandList?Type=Manufacturer&ProductID=79').text
soup= BeautifulSoup(data, 'lxml')
brand_list = soup.find_all('div', attrs = {'class':'BrandList'})
brand_json = []
for brand in brand_list:
my_dict = {}
brand = brand.find_all("div")
my_dict['brand_name'] = brand[0].text.split(":")[1].strip()
my_dict['manufacture'] = brand[1].text.split(":")[1].strip()
my_dict['compositions'] = brand[2].text.split(":")[1].strip()
brand_json.append(my_dict)
print(brand_json)
df = pd.DataFrame(brand_json)
#save dataframe into csv file
df.to_csv("sunp.csv")