问题描述
当然,一个HTML页面可以使用任意数量的python解析器进行解析,但我很惊讶,似乎没有任何公共解析脚本可以从一个解析器中提取有意义的内容(不包括侧边栏,导航等)给出HTML文档。我猜这是收集DIV和P元素,然后检查文本内容的最小量,但我相信一个稳定的实现将包括大量我没有想到的事。 试试库。它具有非常简单的方法来从HTML文件中提取信息。
试图一般地从网页中提取数据将需要人们以类似的方式编写他们的页面......但有几乎无数的方式来传达一个看起来相同的页面,更不用说所有的组合,你可以传达相同的信息。
是否有特定类型的信息你试图提取或其他最终目标?
您可以尝试提取'div'和'p'标记中的任何内容,并比较所有信息的相对大小在页面中。问题在于,人们可能会将信息分组为div's和p's的集合(或者至少在他们正在编写格式良好的html时会这样做!)。
也许如果您形成了一个信息如何相关的树(节点将是'p'或'div或任何其他节点将包含关联的文本),您可以进行某种分析来识别最小的'p'或'div ',它包含了大部分信息..?
也许你可以把它放到树结构中我建议,你可以使用类似的点数系统来发送垃圾邮件刺客。定义一些试图分类信息的规则。一些例子:
每100个单词+1点
+1点,每个子元素具有>如果部分名称包含单词'nav',
-100个单词
-1
如果部分名称包含单词'advert',则为b-2
如果你有很多低分得分的规则,当你找到更多相关的章节时,这些规则会加起来,我认为这可能演变成一个相当强大和强大的技术。 p>
看着可读性,它看起来和我刚刚建议的完全一致!也许它可以改进,试图更好地理解表?
Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
这篇关于python方法从HTML页面提取内容(不包括导航)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!