问题描述
Apache Tika是否能够提取中文,日语等外语?
Is Apache Tika able to extract foreign languages like Chinese, Japanese?
我有以下代码:
Detector detector = new DefaultDetector();
Parser parser = new AutoDetectParser(detector);
InputStream stream = new ByteArrayInputStream(bytes);
OutputStream outputstream = new ByteArrayOutputStream();
ContentHandler textHandler = new BodyContentHandler(outputstream);
Metadata metadata = new Metadata();
// Set<String> langs = LanguageIdentifier.getSupportedLanguages();
// metadata.set(Metadata.CONTENT_LANGUAGE, lang);
// metadata.set(Metadata.FORMAT, hint);
ParseContext context = new ParseContext();
try {
parser.parse(stream, textHandler, metadata, context);
String extractedText = outputstream.toString();
return extractedText;
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
如果输入的文档文件包含汉字,则每个汉字将被提取为?".
If the input is a doc file that contains Chinese characters, each Chinese characters will be extracted as "?".
非常感谢!
推荐答案
Apache Tika能够从其支持的文件格式中提取unicode文本.只要文件格式可以存储Unicode文本(例如中文或日文字符),Apache Tika都可以提取该文本
Apache Tika is able to extract unicode text from its supported file formats. As long as the file format can store unicode text (eg Chinese or Japanese characters), Apache Tika can extract it
Tika还为此进行了许多单元测试,以验证其是否有效.其中一种测试使用此示例中文电子邮件.如果使用Tika命令行应用程序,并抓住前几行,我们将看到它起作用:
Tika also includes a number of unit tests for this, which verify it works. One such test uses this sample chinese email. If with use the command line Tika app, and grab the first few lines, we see it working:
$ java -jar tika-app-1.4.jar --text testMSG_chinese.msg | head
Alfresco MSG format testing ( MSG 格式測試 )
From
Tests Chang@FT (張毓倫)
To
Tests Chang@FT (張毓倫)
Recipients
[email protected]
或与此日语文件:
$ java -jar tika-app-1.4.jar --text testRTFJapanese.rtf | head -2
ゾルゲの処刑記録、
ゾルゲと尾崎、淡々と最期
您只需要确保将生成的任何文本输出存储在适当的编码中(例如utf8),并且用于显示它的字体就可以支持这些字形!
You'll just need to ensure that any text output you generate gets stored in a suitable encoding (eg utf8), and the font you use to display it supports those glyphs!
这篇关于Apache Tika能够提取诸如中文,日语之类的外语吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!