问题描述
背景
大多数关于从 HTML 中提取文本的问题(即剥离标签)使用:
jQuery( htmlString ).text();
虽然这抽象了浏览器的不一致(例如 innerText
与 textContent
),但函数调用也忽略了块级元素(例如 li
).
问题
在各种浏览器中保留块级元素的换行符(即语义意图)需要不小的努力,因为 迈克威尔科克斯描述.
一个看似更简单的解决方案是模拟将 HTML 内容粘贴到 中,这会在保留块级元素换行符的同时剥离 HTML.但是,当用户将内容粘贴到
中时,基于 JavaScript 的插入不会触发浏览器采用的相同 HTML 到文本的例程.
我还尝试集成 Mike Wilcox 的 JavaScript 代码.该代码适用于 Chromium,但不适用于 Firefox.
问题
从 HTML 中提取文本的最简单的跨浏览器方法是什么,同时为 块级元素 使用 jQuery(或 vanilla JavaScript)?
示例
考虑:
- 选择并复制整个问题.
- 打开textarea 示例页面.
- 将内容粘贴到 textarea 中.
textarea 为有序列表、标题、预格式化文本等保留换行符.这就是我想要达到的结果.
进一步澄清,给定任何 HTML 内容,例如:
标题
<p>段落</p><ul><li>第一个</li><li>第二个</li><dl><dt>术语</dt><dd>定义</dd></dl><div>Div 与 <span>span</span>.<br/>在 <a href="...">break</a>.</div> 之后
你会如何生产:
标题段落第一的第二学期定义带跨度的 Div.休息后.注意:缩进和非规范化空格都不相关.
考虑:
/*** 返回节点的样式.** @param n 要检查的节点.* @param p 要检索的属性(通常是显示").* @link http://www.quirksmode.org/dom/getstyles.html*/this.getStyle = function( n, p ) {返回 n.currentStyle ?n.currentStyle[p] :document.defaultView.getComputedStyle(n, null).getPropertyValue(p);}/*** 将 HTML 转换为文本,保留块级语义换行符* 元素.** @param node - 执行文本提取的 HTML 节点.*/this.toText = 函数(节点){var 结果 = '';if( node.nodeType == document.TEXT_NODE ) {//用一个空格替换重复的空格、换行符和制表符.结果 = node.nodeValue.replace(/s+/g, ' ' );}别的 {for( var i = 0, j = node.childNodes.length; i < j; i++ ) {结果 += _this.toText( node.childNodes[i] );}var d = _this.getStyle( node, 'display' );if( d.match(/^block/) || d.match(/list/) || d.match(/row/) ||node.tagName == 'BR' ||node.tagName == 'HR' ) {结果 += '
';}}返回结果;}
也就是说,除了一两个例外,遍历每个节点并打印其内容,让浏览器的计算样式告诉您何时插入换行符.
Background
Most questions about extracting text from HTML (i.e., stripping the tags) use:
jQuery( htmlString ).text();
While this abstracts browser inconsistencies (such as innerText
vs. textContent
), the function call also ignores the semantic meaning of block-level elements (such as li
).
Problem
Preserving newlines of block-level elements (i.e., the semantic intent) across various browsers entails no small effort, as Mike Wilcox describes.
A seemingly simpler solution would be to emulate pasting HTML content into a <textarea>
, which strips HTML while preserving block-level element newlines. However, JavaScript-based inserts do not trigger the same HTML-to-text routines that browsers employ when users paste content into a <textarea>
.
I also tried integrating Mike Wilcox's JavaScript code. The code works in Chromium, but not in Firefox.
Question
What is the simplest cross-browser way to extract text from HTML while preserving semantic newlines for block-level elements using jQuery (or vanilla JavaScript)?
Example
Consider:
- Select and copy this entire question.
- Open the textarea example page.
- Paste the content into the textarea.
The textarea preserves the newlines for ordered lists, headings, preformatted text, and so forth. That is the result I would like to achieve.
To further clarify, given any HTML content, such as:
<h1>Header</h1>
<p>Paragraph</p>
<ul>
<li>First</li>
<li>Second</li>
</ul>
<dl>
<dt>Term</dt>
<dd>Definition</dd>
</dl>
<div>Div with <span>span</span>.<br />After the <a href="...">break</a>.</div>
How would you produce:
Header Paragraph First Second Term Definition Div with span. After the break.
Note: Neither indentation nor non-normalized whitespace are relevant.
Consider:
/**
* Returns the style for a node.
*
* @param n The node to check.
* @param p The property to retrieve (usually 'display').
* @link http://www.quirksmode.org/dom/getstyles.html
*/
this.getStyle = function( n, p ) {
return n.currentStyle ?
n.currentStyle[p] :
document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}
/**
* Converts HTML to text, preserving semantic newlines for block-level
* elements.
*
* @param node - The HTML node to perform text extraction.
*/
this.toText = function( node ) {
var result = '';
if( node.nodeType == document.TEXT_NODE ) {
// Replace repeated spaces, newlines, and tabs with a single space.
result = node.nodeValue.replace( /s+/g, ' ' );
}
else {
for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
result += _this.toText( node.childNodes[i] );
}
var d = _this.getStyle( node, 'display' );
if( d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
node.tagName == 'BR' || node.tagName == 'HR' ) {
result += '
';
}
}
return result;
}
That is to say, with an exception or two, iterate through each node and print its contents, letting the browser's computed style tell you when to insert newlines.
这篇关于从 HTML 中提取文本,同时保留块级元素换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!