如果该行具有rowspan element,那么如何使该行与Wikipedia页面中的表相对应。

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd

wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

try:
    table = soup.find_all('table')[6]
except AttributeError as e:
    print 'No tables found, exiting'

try:
    first = table.find_all('tr')[0]
except AttributeError as e:
    print 'No table row found, exiting'

try:
    allRows = table.find_all('tr')[1:-1]
except AttributeError as e:
    print 'No table row found, exiting'


headers = [header.get_text() for header in first.find_all(['th', 'td'])]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]


df = pd.DataFrame(data=results, columns=headers)
df

我得到表作为输出..但是对于其中行包含rowspan的表-我得到表如下-

最佳答案

如您所知,这种情况是由以下情况引起的,

html内容:

<tr>
     <td rowspan="2">2=</td>
     <td>West Indies</td>
     <td>4</td>
     <td>Lord's</td>
     <td>2009</td>
</tr>
<tr>
     <td style="text-align:left;">India</td>
     <td>4</td>
     <td>Mumbai</td>
      <td>2012</td>
</tr>

因此,当td具有rowspan属性时,请考虑对相同级别的下一个td重复相同的tr vaulue,并且rowspan的值表示下一个tr标签的数量。
  • 获取所有此类rowspan信息并保存在变量中。保存tr标记的序列号,td标记的序列号,rowspan的值,即,多少个tr标记具有相同的td以及td的文本值。
  • 按照上述方法更新所有tr的结果。

  • 注意::仅检查给定的测试用例。需要检查更多的测试用例。

    代码:
    from bs4 import BeautifulSoup
    import urllib2
    from lxml.html import fromstring
    import re
    import csv
    import pandas as pd
    
    
    wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
    header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
    req = urllib2.Request(wiki,headers=header)
    page = urllib2.urlopen(req)
    
    soup = BeautifulSoup(page)
    
    table = soup.find_all('table')[6]
    
    tmp = table.find_all('tr')
    
    first = tmp[0]
    allRows = tmp[1:-1]
    #table.find_all('tr')[1:-1]
    
    
    headers = [header.get_text() for header in first.find_all('th')]
    
    results = [[data.get_text() for data in row.find_all('td')] for row in allRows]
    
    #<td rowspan="2">2=</td>
    # list of tuple (Level of tr, Level of td, total Count, Text Value)
    #e.g.
    #[(1, 0, 2, u'2=')]
    # (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=)
    rowspan = []
    
    for no, tr in enumerate(allRows):
        tmp = []
        for td_no, data in enumerate(tr.find_all('td')):
            print  data.has_key("rowspan")
            if data.has_key("rowspan"):
                rowspan.append((no, td_no, int(data["rowspan"]), data.get_text()))
    
    
    if rowspan:
        for i in rowspan:
            # tr value of rowspan in present in 1th place in results
            for j in xrange(1, i[2]):
                #- Add value in next tr.
                results[i[0]+j].insert(i[1], i[3])
    
    
    df = pd.DataFrame(data=results, columns=headers)
    print df
    

    输出:
      Rank       Opponent No. wins Most recent venue Season
    0    1   South Africa        6            Lord's   1951
    1   2=    West Indies        4            Lord's   2009
    2   2=          India        4            Mumbai   2012
    3    4      Australia        3            Sydney   1932
    4    5       Pakistan        2      Trent Bridge   1967
    5    6      Sri Lanka        1      Old Trafford   2002
    

    也要工作到表10
      Rank Hundreds            Player Matches Innings Average
    0    1       25     Alastair Cook     107     191   45.61
    1    2       23   Kevin Pietersen     104     181   47.28
    2    3       22     Colin Cowdrey     114     188   44.07
    3    3       22     Wally Hammond      85     140   58.46
    4    3       22  Geoffrey Boycott     108     193   47.72
    5    6       21    Andrew Strauss     100     178   40.91
    6    6       21          Ian Bell     103     178   45.30
    7   8=       20    Ken Barrington      82     131   58.67
    8   8=       20      Graham Gooch     118     215   42.58
    9   10       19        Len Hutton      79     138   56.67
    

    10-07 15:19