问题描述
我在Tesseract OCR方面苦苦挣扎.我有一个血液检查图像,它有一个带有压痕的桌子.尽管tesseract能够很好地识别字符,但最终输出中并未保留其结构.例如,查看缩进的"Emocromo con公式"(英语翻译:带有公式的血球计数)下面的行.我想保留该缩进.
I'm struggling with Tesseract OCR.I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation.
我阅读了其他相关讨论,找到了选项preserve_interword_spaces=1
.结果略有改善,但是如您所见,它并不完美.
I read the other related discussions and I found the option preserve_interword_spaces=1
. The result became slightly better but as you can see, it isn't perfect.
有什么建议吗?
更新:
我尝试过Tesseract v5.0,结果是相同的.
I tried Tesseract v5.0 and the result is the same.
代码:
Tesseract版本为4.0.0.20190314
Tesseract version is 4.0.0.20190314
from PIL import Image
import pytesseract
# Preserve interword spaces is set to 1, oem = 1 is LSTM,
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
# default_config = r'-c -l eng+ita'
extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)
print(extracted_text)
# saving to a txt file
with open("referto.txt", "w") as text_file:
text_file.write(extracted_text)
比较结果
GITHUB:
我创建了一个 GitHub 存储库自己尝试一下.
I have created a GitHub repository if you want to try it yourself.
感谢您的帮助和时间
推荐答案
image_to_data()
函数提供了更多信息.对于每个单词,它将返回其边界矩形.您可以使用它.
image_to_data()
function provides much more information. For each word it will return it's bounding rectangle. You can use that.
Tesseract
自动将图像分割为块.然后,您可以按块的垂直位置对块进行排序,对于每个块,您可以找到平均字符宽度(取决于块的识别字体).然后,对于块中的每个单词,检查它是否与前一个单词接近,如果没有,则相应地添加空格.我正在使用pandas
来简化计算,但是没有必要使用它.不要忘记,结果应使用等宽字体显示.
Tesseract
segments the image automatically to blocks. Then you can sort block by their vertical position and for each block you can find mean character width (that depends on the block's recognized font). Then for each word in the block check if it is close to the previous one, if not add spaces accordingly. I'm using pandas
to ease on calculations, but it's usage is not necessary. Don't forget that the result should be displayed using monospaced font.
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
此代码将产生以下输出:
This code will produce following output:
ssseeess+ SERVIZIO SANITARIO REGIONALE Pagina 2 di3
seoeeeees EMILIA-RROMAGNA
©2888 800
©9868 6 006 : pe ‘ ‘ "
«ee @@e@ecee Azienda Unita Sanitaria Locale di Modena
Seat se ces Amends Ospedaliero-Universitaria Policlinico di Modena
Dipartimento interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica
Direttore dr. T.Trenti
Ospedale Civile S.Agostino-Estense
S.C. Medicina di Laboratorio
S.S. Patologia Clinica - Corelab
Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015
Responsabile dr.ssa M.Varani
Richiesta (CDA): 49/073914 Data di accettazione: 18/12/2018
Data di check-in: 18/12/2018 10:27:06
Referto del 18/12/2018 16:39:53
Provenienza: D4-cp sassuolo
Sig.
Data di Nascita:
Domicilio:
ANALISI RISULTATO __UNITA'DI MISURA VALORI DI RIFERIMENTO
Glucosio 95 mg/dl (70 - 110 )
Creatinina 1.03 mg/dl ( 0.50 - 1.40 )
eGFR Filtrato glomerulare stimato >60 ml/min Cut-off per rischio di I.R.
7 <60. Il calcolo é€ riferito
Equazione CKD-EPI ad una superfice corporea
Standard (1,73 mq)x In Caso
di etnia afroamericana
moltiplicare per il fattore
1,159.
Colesterolo 212 * mg/dl < 200 v.desiderabile
Trigliceridi 106 mg/dl < 180 v.desiderabile
Bilirubina totale 0.60 mg/dl ( 0.16 - 1.10 )
Bilirubina diretta 0.10 mg/dl ( 0.01 - 0.3 )
GOT - AST 17 U/L (1-37)
GPT - ALT ay U/L (1- 40 )
Gamma-GT 15 U/L (1-55)
Sodio 142 mEq/L ( 136 - 146 )
Potassio 4.3 mEq/L (3.5 - 5.3)
Vitamina B12 342 pg/ml ( 200 - 960 )
TSH 5.47 * ulU/ml (0.35 - 4.94 )
FT4 9.7 pg/ml (7 = 15)
Urine chimico fisico morfologico
u-Colore giallo paglierino
u-Peso specifico 1.012 ( 1.010 - 1.027 )
u-pH 5.5 (5.5 - 6.5)
u-Glucosio assente mg/dl assente
u-Proteine assente mg/dl (0 -10 )
u-Emoglobina assente mg/dl assente
u-Corpi chetonici assente mg/dl assente
u-Bilirubina assente mg/dl assente
u-Urobilinogeno 0.20 mg/dl (0- 1.0 )
sedimento non significativo
Il Laureato:
Dott. CRISTINA ROTA
Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante
Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513;
D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10.
Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it)
i! Laureato: Dr. CRISTINA ROTA
1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna
这篇关于使用Tesseract OCR 4.x保留缩进的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!