本文介绍了阅读方程式从Word(Docx)到html的公式,并使用Java保存数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个word/docx文件,其中的方程式如下图所示

I have a word/docx file which has equations as under images

我想读取文件word/docx的数据并保存到我的数据库中当需要时,我可以从数据库中获取数据并显示在我的html页面上我使用apache Poi读取docx文件中的数据,但不能接受方程式请帮帮我!

I want read data of file word/docx and save to my databaseand when need I can get data from database and show on my html pageI used apache Poi for read data form docx file but It can't take equationsPlease help me!

推荐答案

Word *.docx文件是ZIP归档文件,其中包含XML个文件,这些文件 Office Open XML . Word *.docx文档中包含的公式为 Office MathML(OMML).

Word *.docx files are ZIP archives containing XML files which are Office Open XML. The formulas contained in Word *.docx documents are Office MathML (OMML).

不幸的是,这种XML格式在Microsoft Office之外并不是很为人所知.因此,例如,它不能直接在HTML中使用.但是幸运的是它是XML,因此可以使用转换XML数据进行转换XSLT .因此,例如,我们可以将OMML转换为 MathML ,它可以在用例.

Unfortunately this XML format is not really well known outside Microsoft Office. So it is not directly usable in HTML for example. But fortunately it is XML and as such it is transformable using Transforming XML Data with XSLT. So we can transform that OMML into MathML for example, which is usable in a wider area of use cases.

通过XSLT进行的转换过程主要基于转换的XSL定义.不幸的是,创建这样的库也不是一件容易的事.但是幸运的是Microsoft已经完成了该操作,如果您已经安装了当前的Microsoft Office,则可以在%ProgramFiles%\Microsoft Office程序目录中找到该文件OMML2MML.XSL.如果找不到,请进行网络研究以获取它.

A transformation process via XSLT mainly bases on a XSL definition of the transformation. Unfortunately creating a such is also not really easy. But fortunately Microsoft has done that already and if you have a current Microsoft Office installed, you can find this file OMML2MML.XSL in the Microsoft Office program directory in %ProgramFiles%\. If you don't find it, do a web research to get it.

因此,如果我们了解这一切,我们可以从XWPFDocument中获取OMML,将其转换为MathML,然后将其保存以备后用.

So if we are knowing this all, we can getting the OMML from the XWPFDocument, transforming it into MathML and then saving that for later usage.

我的示例将找到的公式存储为字符串ArrayList中的MathML.您还应该能够将此字符串存储在数据库中.

My example stores the found formulas as MathML in a ArrayList of strings. You should also be able storing this strings in your data base.

该示例需要完整的ooxml-schemas-1.3.jar,如 https://poi.apache所述.org/faq.html#faq-N10025 .这是因为它使用 CTOMath ,它不与较小的poi-ooxml-schemas jar一起提供.

The example needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025. This is because it uses CTOMath which is not shipped with the smaller poi-ooxml-schemas jar.

Word文档:

Java代码:

import java.io.*;
import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import org.w3c.dom.Node;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import java.awt.Desktop;

import java.util.List;
import java.util.ArrayList;

/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/

public class WordReadFormulas {

 static File stylesheet = new File("OMML2MML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet);

 static String getMathML(CTOMath ctomath) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);

  Node node = ctomath.getDomNode();

  DOMSource source = new DOMSource(node);
  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.setOutputProperty("omit-xml-declaration", "yes");
  transformer.transform(source, result);

  String mathML = stringwriter.toString();
  stringwriter.close();

  //The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
  //We don't need this since we want using the MathML in HTML, not in XML.
  //So ideally we should changing the OMML2MML.XSL to not do so.
  //But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
  mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
  mathML = mathML.replaceAll("xmlns:mml", "xmlns");
  mathML = mathML.replaceAll("mml:", "");

  return mathML;
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));

  //storing the found MathML in a AllayList of strings
  List<String> mathMLList = new ArrayList<String>();

  //getting the formulas out of all body elements
  for (IBodyElement ibodyelement : document.getBodyElements()) {
   if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
    XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
    for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
     mathMLList.add(getMathML(ctomath));
    }
    for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
     for (CTOMath ctomath : ctomathpara.getOMathList()) {
      mathMLList.add(getMathML(ctomath));
     }
    }
   } else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
    XWPFTable table = (XWPFTable)ibodyelement;
    for (XWPFTableRow row : table.getRows()) {
     for (XWPFTableCell cell : row.getTableCells()) {
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
       for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
        mathMLList.add(getMathML(ctomath));
       }
       for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
        for (CTOMath ctomath : ctomathpara.getOMathList()) {
         mathMLList.add(getMathML(ctomath));
        }
       }
      }
     }
    }
   }
  }

  document.close();

  //creating a sample HTML file
  String encoding = "UTF-8";
  FileOutputStream fos = new FileOutputStream("result.html");
  OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
  writer.write("<!DOCTYPE html>\n");
  writer.write("<html lang=\"en\">");
  writer.write("<head>");
  writer.write("<meta charset=\"utf-8\"/>");

  //using MathJax for helping all browsers to interpret MathML
  writer.write("<script type=\"text/javascript\"");
  writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
  writer.write(">");
  writer.write("</script>");

  writer.write("</head>");
  writer.write("<body>");
  writer.write("<p>Following formulas was found in Word document: </p>");

  int i = 1;
  for (String mathML : mathMLList) {
   writer.write("<p>Formula" + i++ + ":</p>");
   writer.write(mathML);
   writer.write("<p/>");
  }

  writer.write("</body>");
  writer.write("</html>");
  writer.close();

  Desktop.getDesktop().browse(new File("result.html").toURI());

 }
}

结果:

这篇关于阅读方程式从Word(Docx)到html的公式,并使用Java保存数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 12:44
查看更多