本文介绍了为什么在通过 Tesseract 获取文本时会得到额外的字符(箭头符号)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每当我获取任何语言的文本时,输出都会有这个额外的字符(箭头符号),它在图像中没有.我想了解它为什么存在,以及如何在输出中避免这些额外的字符.

Whenever I fetch text in any language, the output has this extra character (arrow symbol), which is not there in the image. I'd like to understand, why it is present, and how to avoid these extra characters in the output.

推荐答案

这很可能是隐式页面分隔符 \f,记事本显示为那个箭头.有关该主题的一些详细信息,请参阅:Tesseract 4.0.0 输出的txt 使用了哪些页面分隔符?

That's most likely the implicit page separator \f, which Notepad shows as that arrow. For some details on that topic, see: What page separators are used in txt output by Tesseract 4.0.0?

您可以尝试将 -c page_separator="" 添加到您的配置中.你不应该在你的输出中看到那个符号.请注意,分页符也会被完全禁用.

You can try to add -c page_separator="" to your config. You shouldn't see that symbol in your output then. Please notice, page breaks are entirely disabled then also.

这篇关于为什么在通过 Tesseract 获取文本时会得到额外的字符(箭头符号)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 11:09