问题描述
我正在开展一个项目,该项目要求我从商品交易所获取价格.不幸的是,该交易所没有可用的网络服务或其他插件来让我从交易屏幕上获取价格.
我想我可以自动制作价格的屏幕截图并将所有价格拆分为单个图像.之后,我使用 Tesseract 3.0.2 的 Pytesser V 0.0.1 库和 Python v2.7 中的 Pillow 3.1.0 处理它们.然而,图像到文本的转换(通过 image_to_string 函数)是戏剧性的,因为在大多数情况下,0 变为 o 或 5 变为 s,有时转换是随机的,这使得仅替换这些字符变得困难.我已经将图像调整为更大的尺寸并使用了抗锯齿,但结果并没有变得更好.有没有办法将字符集限制为只有数字和小数点?以及如何提高转换质量?
也许我的方法太乏味了,你们知道更好的方法吗?感谢您的帮助:)
是的!使用包 pyslibtesseract:
from pyslibtesseract import TesseractConfig, PageSegModeconfig_line = TesseractConfig(psm=PageSegMode.PSM_SINGLE_LINE)config_line.add_variable('tessedit_char_whitelist', '0123456789.')
如何提高转化质量?
您需要使用 OpenCV 来提高图像质量.
I am working on a project that requires me to get prices from a commodity exchange. Unfortunately the exchange has no webservice or other plugin available that allows me to get the prices from the trading screen.
I figured that I could automatically make a screenshot of the prices and split all prices up in individual images. After that I process them with the Pytesser V 0.0.1 library for Tesseract 3.0.2 combined with Pillow 3.1.0 in Python v2.7. However, the conversion of the image to text (by the image_to_string function) is dramatic, as in most cases a 0 becomes an o or a 5 becomes an s and sometimes the conversion is random, which makes it difficult to just replace these characters. I have already resized the image to a larger size and used anti-aliasis, but the result does not get better. Is there a way to limit the set of characters to only digits and a dot for decimals? And how can the quality of the conversion be improved?
Perhaps my method is too tedious and you guys know a better way to do it? Your help is appreciated :)
Yes! Using the package pyslibtesseract:
from pyslibtesseract import TesseractConfig, PageSegMode
config_line = TesseractConfig(psm=PageSegMode.PSM_SINGLE_LINE)
config_line.add_variable('tessedit_char_whitelist', '0123456789.')
You need use OpenCV to improve the image quality.
这篇关于Pytesser 中的数字字符识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!