Mac上tesseract-OCR的安装配置
tesseract简介
OCR(Optical Character Recognition)即光学字符识别技术,专门用于对图片文字进行识别,并获取文本。
tesseract-ocr引擎先由HP实验室研发,后来成为一个开源项目,主要由google进行改进优化。
安装步骤
安装homebrew
Homebrew是MacOS上的包管理器,类似于ubuntu中的apt-get,centos中的yum,Homebrew安装很简单
1 |
<span class="kw">ruby</span> <span class="hljs-_">-e</span> <span class="st"><span class="hljs-string">"</span></span><span class="ot"><span class="hljs-string"><span class="hljs-variable">$(</span></span></span><span class="kw"><span class="hljs-string"><span class="hljs-variable">curl</span></span></span><span class="hljs-string"><span class="hljs-variable"> -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install</span></span><span class="ot"><span class="hljs-string"><span class="hljs-variable">)</span></span></span><span class="st"><span class="hljs-string">"</span></span> |
安装完毕后可以用brew -v测试
1 2 |
<span class="kw">Homebrew</span> 1.3.1 <span class="kw">Homebrew/homebrew-core</span> (git revision 0290<span class="kw">;</span> <span class="kw">last</span> commit 2017-08-23) |
安装tesseract
1 |
<span class="kw">brew</span> install --with-training-tools tesseract <span class="co"><span class="hljs-comment">#同时安装附加组件,后面自定义字库会用到</span></span> |
安装完毕后用tesseract -v测试
1 2 3 |
<span class="kw">tesseract</span> 3.05.01 <span class="kw">leptonica-1.74.4</span> <span class="kw">libjpeg</span> 9b : libpng 1.6.31 : libtiff 4.0.8 : zlib 1.2.8 |
基本用法
1 |
<span class="kw">tesseract</span> test.png output <span class="co"><span class="hljs-comment">#识别test.png的图片,把结果放到output.txt中</span></span> |
test.png
output.txt自动生成
更多可选参数的用法可以通过tesseract -h查询
python接口
python有着更加优雅的方式调用系统的tesseract工具,首先安装pytesseract模块
1 |
<span class="kw">sudo</span> pip install pytesseract |
pytesseract是对tesseract的封装,要和PIL联合使用,基本用法如下:
1 2 3 4 5 |
<span class="im"><span class="hljs-keyword">import</span></span> pytesseract <span class="im"><span class="hljs-keyword">from</span></span> PIL <span class="im"><span class="hljs-keyword">import</span></span> Image img <span class="op">=</span> Image.<span class="bu">open</span>(<span class="st"><span class="hljs-string">'./test.png'</span></span>) <span class="co"><span class="hljs-comment">#先创建image对象</span></span> text <span class="op">=</span> pytesseract.image_to_string(img) <span class="co"><span class="hljs-comment">#直接转化成string,更多参数可以查看文档</span></span> <span class="bu">repr</span>(text) <span class="co"><span class="hljs-comment">#"u'Hello world!\\n1234'"</span></span> |
结束语
默认的tesseract-ocr工具识别能力有限,很多地方需要个性化定制(如中文),博主也还在学习过程中,以后再会有进一步说明,欢迎大家学习交流。