本文介绍了使用beautifulSoup、Python在h3和div标签中抓取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我没有使用 python、BeautifulSoup、Selenium 等的经验,但我很想从网站上抓取数据并存储为 csv 文件.我需要的单个数据样本编码如下(单行数据).
<div class="row"><div class="col-lg-10"><h3>标题</h3><div><i class="fa user"></i> NAME</div><div><i class="fa phone"></i> MOBILE</div><div><i class="fa mobile-phone fa-2"></i> NUMBER</div><div><i class="fa address"></i> XYZ_ADDRESS</div><div class="space"> </div><div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www.link_to_another_page.com"><i class="fa search-plus"></i> 更多信息</a></div>
<div class="col-lg-2">
我需要的输出是Heading,NAME,MOBILE,NUMBER,XYZ_ADDRESS
我发现这些数据没有 id 或 class 还没有作为一般文本出现在网站上.我为此分别尝试了 BeautifulSoup 和 Python Selenium,但由于没有看到教程,我在这两种方法中都坚持提取,指导我从这些和标签中提取文本
我使用 BeautifulSoup 的代码
导入 urllib2从 bs4 导入 BeautifulSoup进口请求导入 csv最大值 = 2'''with open("lg.csv", "a") as f:w=csv.writer(f)'''##for i in range(1,MAX+1)url="http://www.example_site.com"页面=requests.get(url)汤 = BeautifulSoup(page.content,"html.parser")对于soup.find_all('h3') 中的h:打印(h.get('h3'))
我的硒代码
导入csv从硒导入网络驱动程序MAX_PAGE_NUM = 2驱动程序 = webdriver.Firefox()对于范围内的 i (1, MAX_PAGE_NUM+1):url = "http://www.example_site.com"driver.get(url)name = driver.find_elements_by_xpath('//div[@class = "col-lg-10"]/h3')#contact = driver.find_elements_by_xpath('//span[@class="item-price"]')# 电话 =# 移动 =# 地址 =# 打印(len(买家))# num_page_items = len(买家)# with open('res.csv','a') as f:# for i in range(num_page_items):# f.write(buyers[i].text + "," + 价格[i].text + "
")打印(名称)驱动程序关闭()
解决方案
您可以使用 CSS 选择器来查找您需要的数据.在您的情况下 div >h3 ~ div
将查找直接位于 div
元素内并由 h3
元素处理的所有 div
元素.>
导入 bs4页="""<div class="盒子效果"><div class="row"><div class="col-lg-10"><h3>标题</h3><div><i class="fa user"></i> NAME</div><div><i class="fa phone"></i> MOBILE</div><div><i class="fa mobile-phone fa-2"></i> NUMBER</div><div><i class="fa address"></i> XYZ_ADDRESS</div>
"""汤= bs4.BeautifulSoup(页面,'lxml')# 查找div元素内的所有div元素# 并由 h3 元素进行处理选择器 = 'div >h3 ~ div'# 找到包含我们想要的数据的元素找到 = 汤.选择(选择器)# 从找到的元素中提取数据data = [x.text.split(';')[-1].strip() for x in found]对于数据中的 x:打印(x)
刮去标题中的文本..
heading = soup.find('h3')标题数据 = 标题.文本打印(标题数据)
或者您可以使用如下选择器一次获取标题和其他 div 元素:div.col-lg-10 >*
.这将查找属于 col-lg-10
类的 div
元素中的所有元素.
soup = bs4.BeautifulSoup(page, 'lxml')# 找到类 col-lg-10 的 div 元素中的所有元素选择器 = 'div.col-lg-10 >*'# 找到包含我们想要的数据的元素找到 = 汤.选择(选择器)# 从找到的元素中提取数据data = [x.text.split(';')[-1].strip() for x in found]对于数据中的 x:打印(x)
I have no experience with python, BeautifulSoup, Selenium etc. but I'm eager to scrape data from a website and store as a csv file.A single sample of data I need is coded as follows (a single row of data).
<div class="box effect">
<div class="row">
<div class="col-lg-10">
<h3>HEADING</h3>
<div><i class="fa user"></i> NAME</div>
<div><i class="fa phone"></i> MOBILE</div>
<div><i class="fa mobile-phone fa-2"></i> NUMBER</div>
<div><i class="fa address"></i> XYZ_ADDRESS</div>
<div class="space"> </div>
<div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www.link_to_another_page.com"><i class="fa search-plus"></i> more info</a></div>
</div>
<div class="col-lg-2">
</div>
</div>
</div>
The output I need is Heading,NAME,MOBILE,NUMBER,XYZ_ADDRESS
I found those data don't have a id or class yet being in website as general text.I'm trying BeautifulSoup and Python Selenium separately for that, where I got stuck to extract in both the methods as no tutorials I saw, guided me to extract text from these and tags
My code using BeautifulSoup
import urllib2
from bs4 import BeautifulSoup
import requests
import csv
MAX = 2
'''with open("lg.csv", "a") as f:
w=csv.writer(f)'''
##for i in range(1,MAX+1)
url="http://www.example_site.com"
page=requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
for h in soup.find_all('h3'):
print(h.get('h3'))
My selenium code
import csv
from selenium import webdriver
MAX_PAGE_NUM = 2
driver = webdriver.Firefox()
for i in range(1, MAX_PAGE_NUM+1):
url = "http://www.example_site.com"
driver.get(url)
name = driver.find_elements_by_xpath('//div[@class = "col-lg-10"]/h3')
#contact = driver.find_elements_by_xpath('//span[@class="item-price"]')
# phone =
# mobile =
# address =
# print(len(buyers))
# num_page_items = len(buyers)
# with open('res.csv','a') as f:
# for i in range(num_page_items):
# f.write(buyers[i].text + "," + prices[i].text + "
")
print (name)
driver.close()
解决方案
You can use CSS selectors to find the data you need.In your case div > h3 ~ div
will find all div
elements that are directly inside a div
element and are proceeded by a h3
element.
import bs4
page= """
<div class="box effect">
<div class="row">
<div class="col-lg-10">
<h3>HEADING</h3>
<div><i class="fa user"></i> NAME</div>
<div><i class="fa phone"></i> MOBILE</div>
<div><i class="fa mobile-phone fa-2"></i> NUMBER</div>
<div><i class="fa address"></i> XYZ_ADDRESS</div>
</div>
</div>
</div>
"""
soup = bs4.BeautifulSoup(page, 'lxml')
# find all div elements that are inside a div element
# and are proceeded by an h3 element
selector = 'div > h3 ~ div'
# find elements that contain the data we want
found = soup.select(selector)
# Extract data from the found elements
data = [x.text.split(';')[-1].strip() for x in found]
for x in data:
print(x)
Edit: To scrape the text in heading..
heading = soup.find('h3')
heading_data = heading.text
print(heading_data)
Edit: Or you can get the heading and other div elements at once by using a selector like this: div.col-lg-10 > *
. This finds all elements inside a div
element that belongs to col-lg-10
class.
soup = bs4.BeautifulSoup(page, 'lxml')
# find all elements inside a div element of class col-lg-10
selector = 'div.col-lg-10 > *'
# find elements that contain the data we want
found = soup.select(selector)
# Extract data from the found elements
data = [x.text.split(';')[-1].strip() for x in found]
for x in data:
print(x)
这篇关于使用beautifulSoup、Python在h3和div标签中抓取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
08-06 07:41