thon中使用Beautifulsoup抓取结构不好的html表

thon中使用Beautifulsoup抓取结构不好的html表

本文介绍了如何在Python中使用Beautifulsoup抓取结构不好的html表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

本网站 https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1_12.html 似乎有一个组织不好的html表.表格单元格的唯一标识符是每个tr标签内的宽度.我要抓取全部60页的信息.如何找到适当地刮擦每一行表格的方法?我知道标头的大小为10列,但是由于对于某些tr标签,我有5个td标签,而对于另一些我具有或多或少的td标签,根据其准确地抓取数据并不容易列.

This websitehttps://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.htmlseems have a not well organized html table. the only identifier of table cells are width inside each tr tag. I want to scrape the information of all 60 pages. How I can find a way to scrape each row of tables appropriately? I know the size of header is 10 columns but since for some tr tags, I have 5 td tags and for some other I have more or less td tags, it's not easy to exactly scrape the data according to its column.

在这里您可以看到一部分代码,该代码仅提取与一行相关的数据,而没有保留空单元格的空值.

Here you can see a part of code which is extracting just data related to one row but not with keeping the null values for empty cells.

soup = BeautifulSoup(page.content, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[0] # Grab the first table
new_table = pd.DataFrame(columns=range(0,10), index = [0]) # I know the size
row_marker = 0
for row in table.find_all('tr'):
     column_marker = 0
     columns = row.find_all('td')
     for column in columns:
           new_table.iat[row_marker,column_marker] = column.get_text()
           column_marker += 1

这是我从这段代码中得到的输出(将所有值连续放入行中,它们之间没有任何间隙):

It's the output which I have from this code (putting all values in a row without any gaps between them):

     0     1     2                  3     4   5    6    7    8    9

0  62.00    PACL  Palaeocene Claystones  SWAP  NaN  NaN  NaN  NaN  NaN

但是实际输出应该是这样的:

but the real output should be something like this:

   0        1    2   3                        4   5    6    7    8    9

0  62.00   NaN NaN  PACL  Palaeocene Claystones  NaN  NaN  NaN  NaN  SWAP

推荐答案

我使用注释中提到的方法(使用宽度)来确定数据中的空值.这是Python代码:

I've used the method I mentioned in the comments (using width) to determine the null values in the data. Here's the Python code:

import requests
import bs4

URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'

response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text, 'lxml')

tables = soup.find_all('table')
count = 0
cells_count = 0

for table in tables:
        count +=1
        if count >2:
                row = table.tr
                cells = row.find_all('td')
                print ''
                x = 0
                width_diff = 0
                cell_text = []
                for cell in cells:
                        width = cell.get('width')
                        if int(width) < 10:
                                continue

                        if width_diff > 0:
                                cell_text.append('NaN ')
                                if width_diff > 50:
                                        x += 2
                                        cell_text.append('Nan ')
                                else:
                                        x += 1
                                width_diff = 0

                        if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
                                width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
                        elif x == 5:
                                width_range = [220,221,222,223,224,225,226,227,228,229,230]
                        elif x == 7:
                                width_range = [136]


                        if cell.text:
                                cell_text.append(cell.text.strip() + ' ')
                        else:
                                cell_text.append('NaN ')

                        if int(width) not in width_range:
                                width_diff = int(width) - width_range[-1]
                        x += 1
                        #print x,
                length = len(cell_text)
                for i in range(0, length):
                        print cell_text[i],
                diff = 9 - length
                if diff > 0:
                        for j in range(0, diff):
                                print 'NaN ',

如您所见,我注意到每列都使用了一定的宽度范围.通过将每个像元与其假定的宽度进行比较,我们可以确定它需要占用多少空间.如果宽度差异太大,则意味着它占用了下两个单元格的空间.

As you can see, I've noticed that a certain width range is used in each column. By comparing each cell to its supposed width, we can determine how many spaces it takes. If the difference in width is too great, that means it takes the space of the next two cells.

可能需要进行一些改进,您需要针对所有URL测试脚本,以确保数据绝对干净.

It might need some refining, you'll need to test the script against all URLs to ensure that the data is absolutely clean.

这是运行此代码的示例输出:

Here's a sample output from running this code:

61.00  SED  TERT  WBDS  NaN  Woolwich Beds  GP  NaN  WLDB
62.00  NaN  NaN  PACL  NaN  Palaeocene Claystones  NaN  Nan  SWAP
63.00  NaN  NaN  SMFC  NaN  Shallow Marine Facies  NaN  Nan  SONS
64.00  NaN  NaN  DMFC  NaN  Deep Marine Facies  NaN  NaN  NaN
65.00  NaN  NaN  SLSY  NaN  Selsey Member  GN  NaN  WSXB
66.00  NaN  NaN  MFM  NaN  Marsh Farm Member  NaN  NaN  NaN
67.00  NaN  NaN  ERNM  NaN  Earnley Member  NaN  NaN  NaN
68.00  NaN  NaN  WITT  NaN  Wittering Member  NaN  NaN  NaN
69.00  NaN  NaN  WHI  NaN  Whitecliff Beds  GZ  NaN  NaN
70.00  NaN  NaN  Nan  WFSM  NaN  Whitecliff Sand Member  NaN  Nan  GN
71.00  NaN  WESQ  NaN  Nan  Westray Group Equivalent  NL  GW  WESH
72.00  NaN  WESR  NaN  Nan  Westray Group  NM  GO  CNSB
73.00  NaN  NaN  THEF  NaN  Thet Formation  NaN  Nan  MOFI
74.00  NaN  NaN  SKAD  NaN  Skade Formation  NB  NaN  NONS
75.00  NaN  NORD  NaN  Nan  Nordland  NP  Q  CNSB
75.50  NaN  NaN  SWCH  NaN  Swatchway Formation  Q  NaN  MOFI
75.60  NaN  NaN  CLPT  NaN  Coal Pit Formation  NaN  NaN  NaN
75.70  NaN  NaN  LNGB  NaN  Ling Bank Formation  NaN  NaN  NaN
76.00  NaN  NaN  SHKL  NaN  Shackleton Formation  GO  QP  ROCK
77.00  NaN  NaN  UGNS  NaN  Upper Tertiary sands  NaN  NM  NONS
78.00  NaN  NaN  CLSD  NaN  Claret Sand  NP  NaN  SVIG
79.00  NaN  NaN  BLUE  NaN  Blue Sand  NaN  NaN  NaN
80.00  NaN  NaN  ABGF  NaN  Aberdeen Ground Formation  QH  NaN  CNSB
81.00  NaN  NaN  NUGU  NaN  Upper Glauconitic Unit  NB  NA  MOFI
82.00  NaN  NaN  POWD  NaN  Powder Sand  GN  NaN  SVIG
83.00  NaN  NaN  BASD  NaN  Basin Sand  NaN  Nan  CNSB
84.00  NaN  NaN  CRND  NaN  Crenulate Sand  NaN  NaN  NaN
85.00  NaN  NaN  NORS  NaN  Nordland Sand  QP  NaN  SONS
86.00  NaN  NaN  MIOS  NaN  Miocene Sand  NM  NaN  ESHB
87.00  NaN  NaN  MIOL  NaN  Miocene Limestone  NaN  Nan  CNSB
88.00  NaN  NaN  FLSF  NaN  Fladen Sand Formation  GP  GO  WYGG

注意:我不知道示例第一个单元格中的0是如何创建的,因此我将其遗漏在答案之外.我不知道它是否也应该被刮掉,因为我在任何地方都找不到.

Note: I don't know how the 0 in the first cell of your example is created, so I left it out of the answer. I don't know if it's supposed to be scraped as well, because I didn't find it anywhere.

这篇关于如何在Python中使用Beautifulsoup抓取结构不好的html表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 17:05