本文介绍了使用Tesseract OCR 4.x保留缩进的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Tesseract OCR方面苦苦挣扎.我有一个血液检查图像,它有一个带有压痕的桌子.尽管tesseract能够很好地识别字符,但最终输出中并未保留其结构.例如,查看缩进的"Emocromo con公式"(英语翻译:带有公式的血球计数)下面的行.我想保留该缩进.

I'm struggling with Tesseract OCR.I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation.

我阅读了其他相关讨论,找到了选项preserve_interword_spaces=1.结果略有改善,但是如您所见,它并不完美.

I read the other related discussions and I found the option preserve_interword_spaces=1. The result became slightly better but as you can see, it isn't perfect.

有什么建议吗?

更新:

我尝试过Tesseract v5.0,结果是相同的.

I tried Tesseract v5.0 and the result is the same.

代码:

Tesseract版本为4.0.0.20190314

Tesseract version is 4.0.0.20190314

from PIL import Image
import pytesseract

# Preserve interword spaces is set to 1, oem = 1 is LSTM,
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'

# default_config = r'-c -l eng+ita'

extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)

print(extracted_text)

# saving to a txt file

with open("referto.txt", "w") as text_file:
    text_file.write(extracted_text)

比较结果

GITHUB:

我创建了一个 GitHub 存储库自己尝试一下.

I have created a GitHub repository if you want to try it yourself.

感谢您的帮助和时间

推荐答案

image_to_data()函数提供了更多信息.对于每个单词,它将返回其边界矩形.您可以使用它.

image_to_data() function provides much more information. For each word it will return it's bounding rectangle. You can use that.

Tesseract自动将图像分割为块.然后,您可以按块的垂直位置对块进行排序,对于每个块,您可以找到平均字符宽度(取决于块的识别字体).然后,对于块中的每个单词,检查它是否与前一个单词接近,如果没有,则相应地添加空格.我正在使用pandas来简化计算,但是没有必要使用它.不要忘记,结果应使用等宽字体显示.

Tesseract segments the image automatically to blocks. Then you can sort block by their vertical position and for each block you can find mean character width (that depends on the block's recognized font). Then for each word in the block check if it is close to the previous one, if not add spaces accordingly. I'm using pandas to ease on calculations, but it's usage is not necessary. Don't forget that the result should be displayed using monospaced font.

import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)

# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
    curr = df1[df1['block_num']==block]
    sel = curr[curr.text.str.len()>3]
    char_w = (sel.width/sel.text.str.len()).mean()
    prev_par, prev_line, prev_left = 0, 0, 0
    text = ''
    for ix, ln in curr.iterrows():
        # add new line when necessary
        if prev_par != ln['par_num']:
            text += '\n'
            prev_par = ln['par_num']
            prev_line = ln['line_num']
            prev_left = 0
        elif prev_line != ln['line_num']:
            text += '\n'
            prev_line = ln['line_num']
            prev_left = 0

        added = 0  # num of spaces that should be added
        if ln['left']/char_w > prev_left + 1:
            added = int((ln['left'])/char_w) - prev_left
            text += ' ' * added
        text += ln['text'] + ' '
        prev_left += len(ln['text']) + added + 1
    text += '\n'
    print(text)

此代码将产生以下输出:

This code will produce following output:

    ssseeess+ SERVIZIO SANITARIO REGIONALE                          Pagina 2 di3
   seoeeeees EMILIA-RROMAGNA
     ©2888   800
     ©9868  6 006   :       pe   ‘  ‘        "
     «ee @@e@ecee Azienda Unita Sanitaria Locale di Modena
     Seat se  ces Amends Ospedaliero-Universitaria Policlinico di Modena
         Dipartimento  interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica
                                                  Direttore dr. T.Trenti
                                           Ospedale Civile S.Agostino-Estense
                                             S.C. Medicina  di Laboratorio
                                           S.S. Patologia  Clinica - Corelab
                            Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015
                                              Responsabile dr.ssa M.Varani
        Richiesta (CDA):   49/073914                                    Data di accettazione: 18/12/2018
                                                                        Data di check-in:    18/12/2018 10:27:06
                                                                        Referto del          18/12/2018 16:39:53
                                                                        Provenienza:         D4-cp sassuolo

                                                           Sig.
                                                           Data di Nascita:
                                                           Domicilio:
          ANALISI                                              RISULTATO  __UNITA'DI MISURA VALORI DI RIFERIMENTO
       Glucosio                                                     95     mg/dl            (70  - 110 )
       Creatinina                                                 1.03     mg/dl            ( 0.50 - 1.40 )
       eGFR  Filtrato glomerulare stimato                         >60      ml/min           Cut-off per rischio di  I.R.
             7                                                                              <60. Il calcolo é€ riferito
       Equazione  CKD-EPI                                                                   ad una superfice corporea
                                                                                            Standard  (1,73 mq)x In Caso
                                                                                            di etnia afroamericana
                                                                                            moltiplicare per  il fattore
                                                                                            1,159.
       Colesterolo                                                212   *  mg/dl            < 200 v.desiderabile
       Trigliceridi                                                106     mg/dl            < 180 v.desiderabile
       Bilirubina totale                                          0.60     mg/dl            ( 0.16 - 1.10 )
       Bilirubina diretta                                         0.10     mg/dl            ( 0.01 - 0.3 )
       GOT  - AST                                                   17     U/L              (1-37)
       GPT  - ALT                                                   ay     U/L              (1-   40 )
       Gamma-GT                                                     15     U/L              (1-55)
       Sodio                                                       142     mEq/L            ( 136 - 146 )
       Potassio                                                    4.3     mEq/L            (3.5  - 5.3)
       Vitamina B12                                               342      pg/ml            ( 200 - 960 )
       TSH                                                        5.47  *  ulU/ml           (0.35  - 4.94 )
       FT4                                                         9.7     pg/ml            (7  = 15)
       Urine chimico fisico morfologico
          u-Colore                                     giallo paglierino
          u-Peso specifico                                       1.012                      ( 1.010 - 1.027  )
          u-pH                                                     5.5                      (5.5  - 6.5)
          u-Glucosio                                           assente     mg/dl            assente
          u-Proteine                                           assente     mg/dl            (0  -10 )
          u-Emoglobina                                         assente     mg/dl            assente
          u-Corpi chetonici                                    assente     mg/dl            assente
          u-Bilirubina                                         assente     mg/dl            assente
          u-Urobilinogeno                                         0.20     mg/dl            (0-   1.0 )
          sedimento                                    non significativo
                                                                                          Il Laureato:
                                                                                                     Dott. CRISTINA ROTA
       Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante
       Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513;
       D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10.
       Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it)
       i! Laureato: Dr. CRISTINA ROTA
       1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna

这篇关于使用Tesseract OCR 4.x保留缩进的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 11:17