本文介绍了有什么方法可以从PDF -C中提取马拉雅拉姆语文本(身份H编码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从pdf中提取malayalam文本。虽然使用itextsharp文本提取策略提取文本,但文本中还会附带一些特殊字符。



什么我试过了:



我已经尝试过这个

PdfReader pdfReader = new PdfReader(fileName);



for(int page = 1; page< = pdfReader.NumberOfPages; page ++)

{

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy ();

string currentText = PdfTextExtractor.GetTextFromPage(pdfReader,page,strategy);



text.Append(currentText);

}

pdfReader.Close();



我还被判了

byte [] bytes = Encoding.UTF8.GetBytes(ParseText);

byte [] converted = Encoding.Convert(Encoding.Default,Encoding.UTF8,bytes);

string final = Encoding.UTF8.GetString(converted);

I need to extractmalayalam text from pdf.While extracting the text with itextsharp text extraction strategy some special characters are also coming with the text.

What I have tried:

I HAVE TRIED THIS
PdfReader pdfReader = new PdfReader(fileName);

for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

text.Append(currentText);
}
pdfReader.Close();

I HAVE ALSO TRIED
byte[] bytes = Encoding.UTF8.GetBytes(ParseText);
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);
string final = Encoding.UTF8.GetString(converted);

推荐答案


这篇关于有什么方法可以从PDF -C中提取马拉雅拉姆语文本(身份H编码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-25 03:08