问题描述
最近我被推荐使用JSoup来解析和修改HTML文档。然而,如果我有一个我想要修改的HTML文档(发送,存储在其他地方等),我该怎么做,而不必改变它原始文件?假设我有这样的HTML文件:
< HTML>
< head>< / head>
< body>
< p>< / p>
< h2>标题:标题< / h2>
< p>< / p>
< p>名称:< / p>
< p>地址:< / p>
< p>电话号码:< / p>
< / body>
< / html>
我想填写姓名,地址,电话号码和其他任何信息的相应数据如果不修改原始HTML文件,我该如何解决使用JSoup的问题?
@MarcoS有一个很好的解决方案,使用NodeTraversor在,我只是稍微修改了他的方法,它将一个节点(一组标记)替换为在节点中的数据加上你想添加的任何信息。
为了在内存中存储字符串,我使用了一个静态的 StringBuilder
将HTML保存在内存中。
首先,我们读取HTML文件(手动指定,可以更改),然后我们进行一系列检查以更改包含任何数据的任何节点想要。
MarcoS在解决方案中没有解决的一个问题是,它会分割每个单词,而不是查看一行。然而,我只是对多个单词使用' - ',因为否则它会将该字符串直接放在该单词之后。
所以完整的实现:
import java.util。*;
import org.jsoup.Jsoup;
import org.jsoup.nodes。*;
import org.jsoup.select。*;
import java.io. *;
public class memoryHTML
{
static String htmlLocation =C:\\Users\\User\\;
static String fileName =blah; //只是为了演示,很容易修改。
static StringBuilder buildTmpHTML = new StringBuilder();
static StringBuilder buildHTML = new StringBuilder();
static String name =John Doe;
static String address =42大学Dr.,Somewhere,Someplace;
static String phoneNumber =(123)456-7890;
public static void main(String [] args)
{
//您可以使用文件名将完整路径发送给它。我将它们分开,因为我用它来处理多个文件。
readHTML(htmlLocation,fileName);
modifyHTML();
System.out.println(buildHTML.toString());
//您需要清除StringBuilder对象,否则它将保留在内存中并在每次运行时生成。
buildTmpHTML.setLength(0);
buildHTML.setLength(0);
System.exit(0);
//简单地为一个临时HTML文件解析和构建一个StringBuilder,该文件将在modifyHTML()中修改
public static void readHTML(String directory,String fileName)
{
try
{
BufferedReader br = new BufferedReader(new FileReader(directory + fileName +.html));
字符串行; ((line = br.readLine())!= null)
buildTmpHTML.append(line);
}
br.close();
catch(Exception e)
{
e.printStackTrace();
System.exit(1);
}
}
//通过@MarcoS在HTML文件中解析和修改节点的最佳方法是https://stackoverflow.com/a/6594828/1861357
//它有一些小问题,但它有诀窍。
public static void modifyHTML()
{
String htmld = buildTmpHTML.toString();
Document doc = Jsoup.parse(htmld);
最终列表< TextNode> nodesToChange = new ArrayList< TextNode>();
NodeTraversor nd = new NodeTraversor(new NodeVisitor()
{
@Override
public void tail(Node node,int depth)
{
if(node instanceof TextNode)
{
TextNode textNode =(TextNode)node;
nodesToChange.add(textNode);
}
}
@Override
public void head(Node node,int depth)
{
}
});
nd.traverse(doc.body()); (TextNode textNode:nodesToChange)
{
Node newNode = buildElementForText(textNode);
textNode.replaceWith(newNode);
}
buildHTML.append(doc.html());
private static Node buildElementForText(TextNode textNode)
{
String text = textNode.getWholeText();
String [] words = text.trim()。split();
Set< String> units = new HashSet< String>();
for(String word:words)
units.add(word);
String newText = text;
for(String rpl:units)
{
if(rpl.contains(Name))
newText = newText.replaceAll(rpl,+ rpl + + name :));
if(rpl.contains(Address)|| rpl.contains(Residence))
newText = newText.replaceAll(rpl,+ rpl ++ address);
if(rpl.contains(Phone-Number)|| rpl.contains(PhoneNumber))
newText = newText.replaceAll(rpl,+ rpl ++ phoneNumber);
}
返回新的DataNode(newText,textNode.baseUri());
}
然后你会得到这个HTML(记得我改了Phone Number到电话号码):
< html>
< head>< / head>
< body>
< p>< / p>
< h2>标题:标题< / h2>
< p>< / p>
< p>名称:John Doe< / p>
< p>地址:42大学Dr.,Somewhere,Someplace< / p>
< p>电话号码:(123)456-7890< / p>
< / body>
< / html>
Recently I was recommended to use JSoup to parse and modify HTML documents.
However what if I have a HTML document that I want to modify (to send, store somewhere else, etc.), how might I go about doing that without changing the original document?
Say I have an HTML file like so:
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: </p>
<p>Address: </p>
<p>Phone Number: </p>
</body>
</html>
And I want to fill in the appropriate data for Name, Address, Phone Number and any other information I'd like, without modifying the original HTML file, how might I go about that using JSoup?
@MarcoS had an excellent solution using a NodeTraversor to make a list of nodes to change at https://stackoverflow.com/a/6594828/1861357 and I only very slightly modified his method which replaces a node (a set of tags) with the data in the node plus whatever information you would like to add.
To store a String in memory I used a static StringBuilder
to save the HTML in memory.
First we read in the HTML file (that is manually specified, this can be changed), then we make a series of checks to change whatever nodes with any data that we want.
The one problem that I didn't fix in the solution by MarcoS was that it split each individual word, instead of looking at a line. However I just used '-' for multiple words, because otherwise it places the string directly after that word.
So a full implementation:
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.*;
public class memoryHTML
{
static String htmlLocation = "C:\\Users\\User\\";
static String fileName = "blah"; // Just for demonstration, easily modified.
static StringBuilder buildTmpHTML = new StringBuilder();
static StringBuilder buildHTML = new StringBuilder();
static String name = "John Doe";
static String address = "42 University Dr., Somewhere, Someplace";
static String phoneNumber = "(123) 456-7890";
public static void main(String[] args)
{
// You can send it the full path with the filename. I split them up because I used this for multiple files.
readHTML(htmlLocation, fileName);
modifyHTML();
System.out.println(buildHTML.toString());
// You need to clear the StringBuilder Object or it will remain in memory and build on each run.
buildTmpHTML.setLength(0);
buildHTML.setLength(0);
System.exit(0);
}
// Simply parse and build a StringBuilder for a temporary HTML file that will be modified in modifyHTML()
public static void readHTML(String directory, String fileName)
{
try
{
BufferedReader br = new BufferedReader(new FileReader(directory + fileName + ".html"));
String line;
while((line = br.readLine()) != null)
{
buildTmpHTML.append(line);
}
br.close();
}
catch (Exception e)
{
e.printStackTrace();
System.exit(1);
}
}
// Excellent method of parsing and modifying nodes in HTML files by @MarcoS at https://stackoverflow.com/a/6594828/1861357
// It has its small problems, but it does the trick.
public static void modifyHTML()
{
String htmld = buildTmpHTML.toString();
Document doc = Jsoup.parse(htmld);
final List<TextNode> nodesToChange = new ArrayList<TextNode>();
NodeTraversor nd = new NodeTraversor(new NodeVisitor()
{
@Override
public void tail(Node node, int depth)
{
if (node instanceof TextNode)
{
TextNode textNode = (TextNode) node;
nodesToChange.add(textNode);
}
}
@Override
public void head(Node node, int depth)
{
}
});
nd.traverse(doc.body());
for (TextNode textNode : nodesToChange)
{
Node newNode = buildElementForText(textNode);
textNode.replaceWith(newNode);
}
buildHTML.append(doc.html());
}
private static Node buildElementForText(TextNode textNode)
{
String text = textNode.getWholeText();
String[] words = text.trim().split(" ");
Set<String> units = new HashSet<String>();
for (String word : words)
units.add(word);
String newText = text;
for (String rpl : units)
{
if(rpl.contains("Name"))
newText = newText.replaceAll(rpl, "" + rpl + " " + name:));
if(rpl.contains("Address") || rpl.contains("Residence"))
newText = newText.replaceAll(rpl, "" + rpl + " " + address);
if(rpl.contains("Phone-Number") || rpl.contains("PhoneNumber"))
newText = newText.replaceAll(rpl, "" + rpl + " " + phoneNumber);
}
return new DataNode(newText, textNode.baseUri());
}
And you'll get this HTML back (remember I changed "Phone Number" to "Phone-Number"):
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: John Doe </p>
<p>Address: 42 University Dr., Somewhere, Someplace</p>
<p>Phone-Number: (123) 456-7890</p>
</body>
</html>
这篇关于使用JSoup修改内存中的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!