问题描述
我使用以下方法读取PDF文件并获取页面的文本字符串:
I use the following to read a PDF file and get text strings of a page:
my $pdf = CAM::PDF->new($pdf_file);
my $pagetree = $pdf->getPageContentTree($page_no);
# Get all text strings of the page
# MyRenderer is a separate package which implements getTextBlocks and
# renderText methods
my @text = $pagetree->traverse('MyRenderer')->getTextBlocks;
现在,@text
具有所有文本字符串,并且每个文本字符串的x,y开头.
Now, @text
has all the text strings and start x,y of each text string.
如何获取每个字符串的宽度(可能还有高度)?
How can I get the width (and possibly the height) of each string?
MyRenderer程序包如下:
MyRenderer package is as follows:
package MyRenderer;
use base 'CAM::PDF::GS';
sub new {
my ($pkg, @args) = @_;
my $self = $pkg->SUPER::new(@args);
$self->{refs}->{text} = [];
return $self;
}
sub getTextBlocks {
my ($self) = @_;
return @{$self->{refs}->{text}};
}
sub renderText {
my ($self, $string, $width) = @_;
my ($x, $y) = $self->textToDevice(0,0);
push @{$self->{refs}->{text}}, {
str => $string,
left => $x,
bottom => $y,
right =>$x + $width,
};
return;
}
更新1:有一个功能 getStringWidth($ fontmetrics,$ string) 在CAM :: PDF中.尽管该函数中有一个参数$ fontmetrics,但无论我传递给该参数什么,该函数都会为给定的字符串返回相同的值.
Update 1:There's a function getStringWidth($fontmetrics, $string) in CAM::PDF. Altough there's a parameter $fontmetrics in that function, irespective of what I pass to that parameter, the function returns the same value for a given string.
此外,我不确定返回值使用的计量单位.
Also, I am not sure of the unit of measure the returned value uses.
更新2:我将renderText函数更改为以下内容:
Update 2:I changed the renderText function to following:
sub renderText {
my ($self, $string, $width) = @_;
my ($x, $y) = $self->textToDevice(0,0);
push @{$self->{refs}->{text}}, {
str => $string,
left => $x,
bottom => $y,
right =>$x + ($width * $self->{Tfs}),
font => $self->{Tf},
font_size => $self->{Tfs},
};
return;
}
请注意,除了获取font和font_size之外,我还将$ width与font size相乘以获得字符串的实际宽度.
Note that in addition to getting font and font_size, I multiplied $width with font size to get the real width of the string.
现在,唯一缺少的是高度.
Now, only thing missing is the height.
推荐答案
getStringWidth()在很大程度上取决于您提供的字体规格.如果无法在该数据结构中找到字符宽度,则它会退回到以下代码:
getStringWidth() depends heavily on the font metrics you provide. If it can't find the character widths in that data structure, then it falls back to the following code:
if ($width == 0)
{
# HACK!!!
#warn "Using klugy width!\n";
$width = 0.2 * length $string;
}
这可能是您所看到的.当我写这篇文章时,我认为它比返回0更好.如果您的字体指标不错,并且您认为CAM :: PDF中存在错误,请随意发布更多详细信息,我来看看.
which may be what you're seeing. When I wrote that, I thought it was better than returning 0. If your font metrics seem good and you think there's a bug in CAM::PDF, feel free to post more details and I'll take a look.
这篇关于如何使用CAM :: PDF获取文本字符串的宽度和高度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!