问题描述
本网站 https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1_12.html 似乎有一个组织不好的html表.表格单元格的唯一标识符是每个tr标签内的宽度.我要抓取全部60页的信息.如何找到适当地刮擦每一行表格的方法?我知道标头的大小为10列,但是由于对于某些tr
标签,我有5个td
标签,而对于另一些我具有或多或少的td
标签,根据其准确地抓取数据并不容易列.
This websitehttps://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.htmlseems have a not well organized html table. the only identifier of table cells are width inside each tr tag. I want to scrape the information of all 60 pages. How I can find a way to scrape each row of tables appropriately? I know the size of header is 10 columns but since for some tr
tags, I have 5 td
tags and for some other I have more or less td
tags, it's not easy to exactly scrape the data according to its column.
在这里您可以看到一部分代码,该代码仅提取与一行相关的数据,而没有保留空单元格的空值.
Here you can see a part of code which is extracting just data related to one row but not with keeping the null values for empty cells.
soup = BeautifulSoup(page.content, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,10), index = [0]) # I know the size
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
这是我从这段代码中得到的输出(将所有值连续放入行中,它们之间没有任何间隙):
It's the output which I have from this code (putting all values in a row without any gaps between them):
0 1 2 3 4 5 6 7 8 9
0 62.00 PACL Palaeocene Claystones SWAP NaN NaN NaN NaN NaN
但是实际输出应该是这样的:
but the real output should be something like this:
0 1 2 3 4 5 6 7 8 9
0 62.00 NaN NaN PACL Palaeocene Claystones NaN NaN NaN NaN SWAP
推荐答案
我使用注释中提到的方法(使用宽度)来确定数据中的空值.这是Python代码:
I've used the method I mentioned in the comments (using width) to determine the null values in the data. Here's the Python code:
import requests
import bs4
URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'
response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('table')
count = 0
cells_count = 0
for table in tables:
count +=1
if count >2:
row = table.tr
cells = row.find_all('td')
print ''
x = 0
width_diff = 0
cell_text = []
for cell in cells:
width = cell.get('width')
if int(width) < 10:
continue
if width_diff > 0:
cell_text.append('NaN ')
if width_diff > 50:
x += 2
cell_text.append('Nan ')
else:
x += 1
width_diff = 0
if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
elif x == 5:
width_range = [220,221,222,223,224,225,226,227,228,229,230]
elif x == 7:
width_range = [136]
if cell.text:
cell_text.append(cell.text.strip() + ' ')
else:
cell_text.append('NaN ')
if int(width) not in width_range:
width_diff = int(width) - width_range[-1]
x += 1
#print x,
length = len(cell_text)
for i in range(0, length):
print cell_text[i],
diff = 9 - length
if diff > 0:
for j in range(0, diff):
print 'NaN ',
如您所见,我注意到每列都使用了一定的宽度范围.通过将每个像元与其假定的宽度进行比较,我们可以确定它需要占用多少空间.如果宽度差异太大,则意味着它占用了下两个单元格的空间.
As you can see, I've noticed that a certain width range is used in each column. By comparing each cell to its supposed width, we can determine how many spaces it takes. If the difference in width is too great, that means it takes the space of the next two cells.
可能需要进行一些改进,您需要针对所有URL测试脚本,以确保数据绝对干净.
It might need some refining, you'll need to test the script against all URLs to ensure that the data is absolutely clean.
这是运行此代码的示例输出:
Here's a sample output from running this code:
61.00 SED TERT WBDS NaN Woolwich Beds GP NaN WLDB
62.00 NaN NaN PACL NaN Palaeocene Claystones NaN Nan SWAP
63.00 NaN NaN SMFC NaN Shallow Marine Facies NaN Nan SONS
64.00 NaN NaN DMFC NaN Deep Marine Facies NaN NaN NaN
65.00 NaN NaN SLSY NaN Selsey Member GN NaN WSXB
66.00 NaN NaN MFM NaN Marsh Farm Member NaN NaN NaN
67.00 NaN NaN ERNM NaN Earnley Member NaN NaN NaN
68.00 NaN NaN WITT NaN Wittering Member NaN NaN NaN
69.00 NaN NaN WHI NaN Whitecliff Beds GZ NaN NaN
70.00 NaN NaN Nan WFSM NaN Whitecliff Sand Member NaN Nan GN
71.00 NaN WESQ NaN Nan Westray Group Equivalent NL GW WESH
72.00 NaN WESR NaN Nan Westray Group NM GO CNSB
73.00 NaN NaN THEF NaN Thet Formation NaN Nan MOFI
74.00 NaN NaN SKAD NaN Skade Formation NB NaN NONS
75.00 NaN NORD NaN Nan Nordland NP Q CNSB
75.50 NaN NaN SWCH NaN Swatchway Formation Q NaN MOFI
75.60 NaN NaN CLPT NaN Coal Pit Formation NaN NaN NaN
75.70 NaN NaN LNGB NaN Ling Bank Formation NaN NaN NaN
76.00 NaN NaN SHKL NaN Shackleton Formation GO QP ROCK
77.00 NaN NaN UGNS NaN Upper Tertiary sands NaN NM NONS
78.00 NaN NaN CLSD NaN Claret Sand NP NaN SVIG
79.00 NaN NaN BLUE NaN Blue Sand NaN NaN NaN
80.00 NaN NaN ABGF NaN Aberdeen Ground Formation QH NaN CNSB
81.00 NaN NaN NUGU NaN Upper Glauconitic Unit NB NA MOFI
82.00 NaN NaN POWD NaN Powder Sand GN NaN SVIG
83.00 NaN NaN BASD NaN Basin Sand NaN Nan CNSB
84.00 NaN NaN CRND NaN Crenulate Sand NaN NaN NaN
85.00 NaN NaN NORS NaN Nordland Sand QP NaN SONS
86.00 NaN NaN MIOS NaN Miocene Sand NM NaN ESHB
87.00 NaN NaN MIOL NaN Miocene Limestone NaN Nan CNSB
88.00 NaN NaN FLSF NaN Fladen Sand Formation GP GO WYGG
注意:我不知道示例第一个单元格中的0是如何创建的,因此我将其遗漏在答案之外.我不知道它是否也应该被刮掉,因为我在任何地方都找不到.
Note: I don't know how the 0 in the first cell of your example is created, so I left it out of the answer. I don't know if it's supposed to be scraped as well, because I didn't find it anywhere.
这篇关于如何在Python中使用Beautifulsoup抓取结构不好的html表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!