将HTML转换为纯文本和维护结构

将HTML转换为纯文本和维护结构

本文介绍了将HTML转换为纯文本和维护结构/格式化,红宝石的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想HTML转换为纯文本。我不想只去掉标签,虽然,我想聪明保留​​尽可能多的格式尽可能。插入换行符为< BR> 标记,检测段落和格式化它们的方式,等等。

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.

输入是pretty简单,通常是格式良好的HTML(不是整个文件,只是一堆的内容,通常没有锚或图像)。

The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).

我可以放在一起一对夫妇regexs这让我80%,但有推测有可能是更智能一些现有的解决方案。

I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.

推荐答案

首先,不要试图用正则表达式这一点。赔率是非常好的,你会拿出一个易碎/脆弱的解决方案,将在HTML格式的变化破裂或将是非常难以管理和维护。

First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.

您可以得到一部分的方式有很快速地利用引入nokogiri解析HTML并提取文本:

You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:

require 'nokogiri'

html = '
<html>
<body>
  <p>This is
  some text.</p>
  <p>This is some more text.</p>
  <pre>
  This is
  preformatted
  text.
  </pre>
</body>
</html>
'

doc = Nokogiri::HTML(html)
puts doc.text

>>  This is
>>  some text.
>>  This is some more text.
>>
>>  This is
>>  preformatted
>>  text.

之所以这样的工作原理是引入nokogiri将返回文本节点,它们基本上是围绕着标签的空白,包含在该标签的文本一起。如果您在使用做HTML的pre-飞行清理整理有时你可以得到很多更好的输出。

The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.

当你比较解析器的输出,或查看HTML的任何手段,用什么浏览器显示问题。浏览器关注的是presenting的HTML作为赏心悦目的方式成为可能,忽略了一个事实,即HTML可以可怕的畸形和破碎。解析器的目的不是要做到这一点。

The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.

您可以提取内容之前按摩HTML去除多余的换行符,如\\ n\\ r其次是换掉&LT; BR&GT; 与换行符标签。这里有许多问题上的SO解释如何用别的东西来替换标记。我觉得也有作为教程。

You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.

如果你真的想这样做的权利,你需要弄清楚你想要做什么&LT;李&GT; 标签&LT; UL&GT; &LT; OL方式&gt; 标签,以及表

If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.

另一种攻击方法是将捕获文本浏览器,如猞猁之一的输出。几年前,我需要做的关键字的文本处理上未使用的Meta关键字标签的网站,发现该文本的浏览器,让我抢渲染输出这种方式之一。我没有可用的源代码,所以我不能检查,看看到底是哪一个。

An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

这篇关于将HTML转换为纯文本和维护结构/格式化,红宝石的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 08:23