问题描述
我们正在构建一个 java 代码,以使用 apache POI 将word文档(.docx)读入我们的程序.当我们在文档中遇到公式和化学方程式时,我们会陷入困境.但是,我们设法读取了公式,但是我们不知道如何在相关字符串中找到其索引.
We are building a java code to read word document (.docx) into our program using apache POI.We are stuck when we encounter formulas and chemical equation inside the document.Yet, we managed to read formulas but we have no idea how to locate its index in concerned string..
输入(格式为*.docx
)
text before formulae **CHEMICAL EQUATION** text after
我们设计的输出(格式应为HTML
)
OUTPUT (format shall be HTML
) we designed
text before formulae text after **CHEMICAL EQUATION**
我们无法获取字符串并将其恢复为原始格式.
We are unable to fetch the string and reconstruct to its original form.
问题
现在可以用任何方法在剥离线中定位图像和公式的位置,以便在重建字符串后可以将其恢复为原始形式,而不是附加它在字符串末尾.?
Now is there any way to locate the position of the image and formulae within the stripped line, so that it can be restored to its original form after reconstruction of the string, as against having it appended at the end of string.?
推荐答案
如果所需格式为HTML
,则Word
文本内容与 Office MathML 方程可以通过以下方式读取.
If the needed format is HTML
, then Word
text content together with Office MathML equations can be read the following way.
在我提供了一个示例,该示例将Word
文档中的所有Office MathML
方程式获取到HTML
中.它使用paragraph.getCTP().getOMathList()
和paragraph.getCTP().getOMathParaList()
从段落中获取OMath
元素.这样会将OMath
元素移出文本上下文.
In Reading equations & formula from Word (Docx) to html and save database using java I have provided an example which gets all Office MathML
equations out of an Word
document into HTML
. It uses paragraph.getCTP().getOMathList()
and paragraph.getCTP().getOMathParaList()
to get the OMath
elements from the paragraph. This takes the OMath
elements out of the text context.
如果要与段落中的其他元素一起获取那些OMath
元素,则需要使用org.apache.xmlbeans.XmlCursor
循环遍历该段落中的所有不同XML
元素.下面的示例使用XmlCursor
来使文本与段落中的OMath
元素一起运行.
If one wants get those OMath
elements in context with the other elements in the paragraphs, then using a org.apache.xmlbeans.XmlCursor
is needed to loop over all different XML
elements in the paragraph. The following example uses the XmlCursor
to get text runs together with OMath
elements from the paragraph.
从Office MathML
到 MathML 的转换是使用相同的XSLT
方法如阅读方程式从Word(Docx)到html的公式,并使用java 保存数据库.还描述了OMML2MML.XSL
的来源.
The transformation from Office MathML
into MathML is taken using the same XSLT
approach as in Reading equations & formula from Word (Docx) to html and save database using java. There also is described where the OMML2MML.XSL
comes from.
文件Formula.docx
如下:
代码:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;
import org.apache.xmlbeans.XmlCursor;
import org.w3c.dom.Node;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.awt.Desktop;
import java.util.List;
import java.util.ArrayList;
/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/
public class WordReadTextWithFormulasAsHTML {
static File stylesheet = new File("OMML2MML.XSL");
static TransformerFactory tFactory = TransformerFactory.newInstance();
static StreamSource stylesource = new StreamSource(stylesheet);
//method for getting MathML from oMath
static String getMathML(CTOMath ctomath) throws Exception {
Transformer transformer = tFactory.newTransformer(stylesource);
Node node = ctomath.getDomNode();
DOMSource source = new DOMSource(node);
StringWriter stringwriter = new StringWriter();
StreamResult result = new StreamResult(stringwriter);
transformer.setOutputProperty("omit-xml-declaration", "yes");
transformer.transform(source, result);
String mathML = stringwriter.toString();
stringwriter.close();
//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
//We don't need this since we want using the MathML in HTML, not in XML.
//So ideally we should changing the OMML2MML.XSL to not do so.
//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
mathML = mathML.replaceAll("xmlns:mml", "xmlns");
mathML = mathML.replaceAll("mml:", "");
return mathML;
}
//method for getting HTML including MathML from XWPFParagraph
static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {
StringBuffer textWithFormulas = new StringBuffer();
//using a cursor to go through the paragraph from top to down
XmlCursor xmlcursor = paragraph.getCTP().newCursor();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
//elements w:r are text runs within the paragraph
//simply append the text data
textWithFormulas.append(xmlcursor.getTextValue());
} else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
//we have oMath
//append the oMath as MathML
textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
}
} else if (tokentype.isEnd()) {
//we have to check whether we are at the end of the paragraph
xmlcursor.push();
xmlcursor.toParent();
if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
break;
}
xmlcursor.pop();
}
}
return textWithFormulas.toString();
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));
//using a StringBuffer for appending all the content as HTML
StringBuffer allHTML = new StringBuffer();
//loop over all IBodyElements - should be self explained
for (IBodyElement ibodyelement : document.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
XWPFTable table = (XWPFTable)ibodyelement;
allHTML.append("<table border=1>");
for (XWPFTableRow row : table.getRows()) {
allHTML.append("<tr>");
for (XWPFTableCell cell : row.getTableCells()) {
allHTML.append("<td>");
for (XWPFParagraph paragraph : cell.getParagraphs()) {
allHTML.append("<p>");
allHTML.append(getTextAndFormulas(paragraph));
allHTML.append("</p>");
}
allHTML.append("</td>");
}
allHTML.append("</tr>");
}
allHTML.append("</table>");
}
}
document.close();
//creating a sample HTML file
String encoding = "UTF-8";
FileOutputStream fos = new FileOutputStream("result.html");
OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
writer.write("<!DOCTYPE html>\n");
writer.write("<html lang=\"en\">");
writer.write("<head>");
writer.write("<meta charset=\"utf-8\"/>");
//using MathJax for helping all browsers to interpret MathML
writer.write("<script type=\"text/javascript\"");
writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
writer.write(">");
writer.write("</script>");
writer.write("</head>");
writer.write("<body>");
writer.write(allHTML.toString());
writer.write("</body>");
writer.write("</html>");
writer.close();
Desktop.getDesktop().browse(new File("result.html").toURI());
}
}
结果:
这篇关于使用apache poi从Word(* .docx)到HTML读取方程式及其文本上下文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!