问题描述
我需要从pdf中提取malayalam文本。虽然使用itextsharp文本提取策略提取文本,但文本中还会附带一些特殊字符。
什么我试过了:
我已经尝试过这个
PdfReader pdfReader = new PdfReader(fileName);
for(int page = 1; page< = pdfReader.NumberOfPages; page ++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy ();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader,page,strategy);
text.Append(currentText);
}
pdfReader.Close();
我还被判了
byte [] bytes = Encoding.UTF8.GetBytes(ParseText);
byte [] converted = Encoding.Convert(Encoding.Default,Encoding.UTF8,bytes);
string final = Encoding.UTF8.GetString(converted);
I need to extractmalayalam text from pdf.While extracting the text with itextsharp text extraction strategy some special characters are also coming with the text.
What I have tried:
I HAVE TRIED THIS
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(currentText);
}
pdfReader.Close();
I HAVE ALSO TRIED
byte[] bytes = Encoding.UTF8.GetBytes(ParseText);
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);
string final = Encoding.UTF8.GetString(converted);
推荐答案
这篇关于有什么方法可以从PDF -C中提取马拉雅拉姆语文本(身份H编码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!