java - 如何获取HTML中所有脚本的内容

我正在制作一个Java程序，其中涉及从网页中提取标签。为了进行解析，我使用的是Jsoup，它运行良好，但是下载的页面中的标签数量存在一些问题。
我有4个文件：-

goog1.htm（我通过浏览器从https://www.google.co.in保存的）
goog2.html（我使用命令“ wget https://www.google.co.in”下载了该文件）
goog3.html（我使用BufferedReader和InputStreamReader通过Java程序下载了该文件）
goog4.html（我通过从'view-source：https://www.google.co.in/'复制整个代码获得）

当我在这4个文件中搜索字符串“

最佳答案

1）script标签数量不同的原因是，在script页面中可以定义多个HTML标签。

2）页面中的所有脚本标签均已加载，并且将运行。如果要测试所有脚本代码，则需要对所有脚本代码进行测试。这取决于您的测试范围。

3）如果您将内容作为文本处理到JAVA程序中，则可以通过使用子字符串方法进行解析来获取所有脚本标签的内容。但是我建议使用Apache commons StringUtils类来执行此操作。

import org.apache.commons.lang.StringUtils;

public class scriptContentRetriever{

public static void main(String[] args) {
        String yourScriptContent = "<script>This is Script 1 Content</script><script>This is Script 2 Content</script>";
        String[] scriptStrings = StringUtils.substringsBetween(yourScriptContent, "<script>", "</script>");
        for (String scriptString : scriptStrings) {
            //Do what ever you want with the script content right here.
            System.out.println(scriptString);
        }
    }

}