pytesseract image_to_boxes有几个音节与image_to

pytesseract image_to_boxes有几个音节与image_to_string函数问题？

点击联系发帖人 时间：2018-07-25 07:43

string.h

just do what you love and fuck the rest!
验证码识别处理--基于python（一）
一、在自动化测试中，遇到验证码的处理方法有以下两种：
1、找开发去掉验证码或者使用万能验证码
2、使用OCR自动识别
这里，方法一只要和研发沟通就行。
方法二就是使用pytesseract自动化识别，一般识别率不是太高，处理一般简单验证码还是没问题，例如下面这种验：
代码很简单，只需要几行代码：
from pytesseract.pytesseract import image_to_string
from PIL import Image
image = Image.open('../new.jpg')
#修改保存图片的路径
print image
vcode = image_to_string(image)
print vcode
在mac系统下，需要安装依赖库（不然会报错误），在终端安装下面两条命令即可
brew install leptonica
brew install tesseract
二、但在使用python自动化测试中会遇到一个难点，验证码怎么获取，python的webdriver API没有这样接口。解决方法：
从页面获取验证码的坐标值得，使用PIL的Image模块，截取特定的区域
思路：将web节目截图保存--&定位到验证码坐标--&从截图中再进行验证码位置的截图
代码如下：
from PIL import Image
import pytesseract
from selenium import webdriver
import time
url='https://www.baidu.com/'
driver = webdriver.Chrome(executable_path="../chromedriver") #修改自己的路径
driver.maximize_window()
#将浏览器最大化
driver.get(url)
time.sleep(3)
driver.find_element_by_xpath("//*[@id=\"u1\"]/a[7]").click()
time.sleep(2)
driver.find_element_by_id("TANGRAM__PSP_8__userName").send_keys("qqqq")
driver.find_element_by_id("TANGRAM__PSP_8__password").send_keys("qqqq")
time.sleep(5)
driver.save_screenshot('../aa.png')
#截取当前网页，该网页有我们需要的验证码
imgelement = driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_8__verifyCodeImg"]')
#定位验证码
location = imgelement.location
#获取验证码x,y轴坐标
size=imgelement.size
#获取验证码的长宽
rangle=(int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height'])) #写成我们需要截取的位置坐标
i=Image.open("../aa.png") #打开截图
frame4=i.crop(rangle)
#使用Image的crop函数，从截图中再次截取我们需要的区域
frame4.save('../frame4.jpg')
qq=Image.open('../frame4.jpg')
text=pytesseract.image_to_string(qq).strip() #使用image_to_string识别验证码
print text
linux python 人工识别验证码的方法
验证码识别
Python图像处理之图片验证码识别
scrapy+python当你的爬虫遇到验证码处理方式之一
【Python爬虫7】验证码处理
用python怎样识别验证码？（含源码）
python+selenium识别验证码并登录
Python验证码识别：利用pytesser识别简单图形验证码
Python验证码识别处理实例
Python完全识别验证码自动登录
没有更多推荐了，选择在海同培训：
在linux系统运维下安装tesseract教程
摘要：本文主要向大家介绍了在linux系统运维下安装tesseract教程，通过具体的内容向大家展现，希望对大家学习Linux运维知识有所帮助。
本文主要向大家介绍了在linux系统运维下安装tesseract教程，通过具体的内容向大家展现，希望对大家学习Linux运维知识有所帮助。centos下安装：centos7安装依赖库安装centos系统依赖yum install -y automake autoconf libtool gcc gcc-c++ yum install -y libpng-devel libjpeg-devel libtiff-develyum -y install &python-develyum -y install openssl-devel & yum -y install opencv yum -y install &java-1.8.0-openjdk & java-1.8.0-openjdk-devel &yum install -y libffi libffi-develyum install libmount &-y以下是安装linux系统所需的软件&如果没有安openssl，会出现”command ‘gcc’ failed with exit status 1 错误提示如果tesseract3.0安装leptonica 1.7.2wget http://www.leptonica.org/source/leptonica-1.72.tar.gztar xvzf leptonica-1.72.tar.gzcd leptonica-1.72/ ./configure make && make install12345如果需要tesseract4.0，则需要安装leptonica 1.74.4http://www.leptonica.org/download.htmlwget http://www.leptonica.org/source/leptonica-1.74.4.tar.gztar xvzf leptonica-1.74.4.tar.gzcd leptonica-1.74.4/ ./configure make && make install1234567安装tesseract3.0-ocrwget https://github.com/tesseract-ocr/tesseract/archive/3.04.zipunzip 3.04.zipcd tesseract-3.04/ autoreconf -I /usr/share/aclocal./configuremake && make install sudo ldconfig1234567安装tesseract4 .0wget https://codeload.github.com/tesseract-ocr/tesseract/zip/4.00.00dev或者下面这个wget http://linux-.cosgz.myqcloud.com/soft/tesseract/4.0.0-beta.1.tar.gzunzip tesseract-4.00.00dev.zipcd tesseract-4.00.00devautoreconf -I /usr/share/aclocal./autogen.sh ./configure --prefix=$HOME/local/makemake installldconfig112运行会提示报错./autogen.sh，缺少autoconf之类的包，需要安装包wget http://ftp.gnu.org/gnu/autoconf/autoconf-2.69.tar.gz tar -zxvf autoconf-2.69.tar.gz cd autoconf-2.69./configuremake && make install12345wget http://ftp.gnu.org/gnu/automake/automake-1.14.tar.gztar -zxvf automake-1.14.tar.gz cd automake-1.14./bootstrap.sh./configuremake && make install123456从这里下载一个autoconf-archive解压之后上传到服务器之后安装http://mirrors.ustc.edu.cn/gnu/autoconf-archive/wget http://mirrors.ustc.edu.cn/gnu/autoconf-archive/autoconf-archive-.tar.xzxz -d autoconf-archive-.tar.xztar xvf autoconf-archive-.tar cd autoconf-archive-./configuremake && make install1234567安装glib需要安装如下wget ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.40.tar.gztar -zvxf pcre-8.40.tar.gz cd pcre-8.40./configure makemake installwget http://ftp.gnome.org/pub/gnome/sources/glib/2.56/glib-2.56.1.tar.xzxz -d glib-2.56.1.tar.xztar xvf glib-2.56.1.tarcd glib-2.56.1./configure --enable-libmount=nomakemake install如果./configure 失败，则运行 & ./configure --enable-libmount=nowget http://pkgconfig.freedesktop.org/releases/pkg-config-0.29.2.tar.gztar -zvxf pkg-config-0.29.2.tar.gzcd pkg-config-0.29.2./configure --with-internal-glibmakemake install如果./configure 失败，则运行 ./configure --with-internal-glib参考如下文章&https://linux.cn/thread-.html&https://blog.csdn.net/windeal3203/article/details/部署模型在https://github.com/tesseract-ocr/tessdata 下载对应语言的模型文件由于目前只需要识别手机号码和英文，只下载一个eng.traineddata文件即可，将模型文件移动到/usr/local/share/tessdata然后即可进行识别12345以下是安装python所需的库pip install pytesseract pip install tesseract pip install tesseract-ocrpython -m pip install --upgrade pip setuptoolspython -m pip install &django&2&pip install image123456ubuntu 安装方式sudo apt-get install tesseract-ocrsudo apt-get install libpng12-dev&sudo apt-get install libjpeg62-dev&sudo apt-get install libtiff4-dev&sudo apt-get install gcc&sudo apt-get install g++&sudo apt-get install automake1.tesseract-ocr安装&sudo apt-get install tesseract-ocr2.pytesseract安装&sudo pip install pytesseract3.Pillow 安装&sudo pip install pillow示例：import pytesseractfrom PIL import Imageimage = Image.open('bb.png')code = pytesseract.image_to_string(image)print code1234567在centos中，出现如下报错1：pytesseract.pytesseract.TesseractError: (127, u’tesseract: error while loading shared libraries: libtesseract.so.3: cannot open shared object file: No such file or directory’)是出现这类错误表示，系统不知道xxx.so放在哪个目录下，这时候就要在/etc/ld.so.conf中加入xxx.so所在的目录。&一般而言，有很多的so会存放在/usr/local/lib这个目录底下，去这个目录底下找，果然发现自己所需要的.so文件。&所以，在/etc/ld.so.conf中加入/usr/local/lib这一行，保存之后，再运行：/sbin/ldconfig –v更新一下配置即可。&sudo ldconfig参考：http://www.eefocus.com/winter1988/blog/13-03/d5b.html报错2：Running automake --add-missing --copysrc/api/Makefile.am:17: error: Libtool library used but 'LIBTOOL' is undefined12这个是因为没有配置正确aclocal的库LIBTOOL.m4的路径&解决方法：&—-查看aclocal的路径aclocal --print-ac-dir12先查看路径：之后执行即可autoreconf -I /usr/local/share/aclocal 12参考解决https://blog.csdn.net/sky_qing/article/details/9707647报错3:在安装tesseract 执行./configure时：出现如下提示：./configure: line 4193: syntax error near unexpected token `-mavx,'./configure: line 4193: `AX_CHECK_COMPILE_FLAG(-mavx, avx=true, avx=false)'123在#647 中能找到：https://github.com/tesseract-ocr/tesseract/issues/647&回答：“You should install autoconf-archive .” 看来时autoconf-archive的原因。已经装了？？？&这货在/local/share/aclocal里面。一堆m4后缀的文件：m4 是一个宏处理器.将输入拷贝到输出,同时将宏展开. 宏可以是内嵌的也可以是用户定义的. 除了可以展开宏,m4还有一些内建的函数,用来引用文件,执行Unix命令,整数运算,文本操作,循环等. m4既可以作为编译器的前端也可以单独作为一个宏处理器.&现在的问题是怎么引用这些文件。 ./configure –help 没有。许久找到了这篇讲自动编译的文章：https://jin-yang.github.io/post/linux-package.html&有这样一句话：”aclocal，将在 configure.ac 同一目录下生成 aclocal.m4，在扫描 configure.ac 的过程中，将第三方扩展和开发者自己编写的宏定义复制进去；这样，autoconf 在遇到不认识的宏时，就会从 aclocal.m4 中查找”. 的确编译是生成了aclocal.m4文件。看来的想办法把/local/share/aclocal里的m4引用进去。&打开autogen.sh : 81行有echo “Running aclocal” aclocal -I config. 应该是这儿引用的宏。操作我看不懂 .所以怎么指定加入自己的宏？&aclocal –help 有: -I DIR add directory to search list for .m4 files –install copy third-party files to the first -I directory –system-acdir=DIR directory holding third-party system-wide files&感觉是这几个参数。。。试一试 .. aclocal -I m4 –install –system-acdir=HOME/local/share/aclocal&aclocal -I m4 –install –system-acdir=$HOME/local/share/aclocal 为aclocal的路径，但是一般为/usr/local/share/aclocal本文由职坐标整理并发布，了解更多内容，请关注职坐标系统运维Linux频道！
本文由 @小标发布于职坐标。未经许可，禁止转载。
不喜欢&| 0
看完这篇文章有何感觉？已经有0人表态，0%的人喜欢
快给朋友分享吧~
后参与评论
加入IT交流圈
JAVA工程师交流群
大数据架构师交流群
人工智能Python交流群
WEB/H5前端交流群
系统运维直通车
海同名师推荐
热门就业培训班
Linux30天热搜词
免费获取海同IT培训资料
验证码手机号，获得海同独家IT培训资料
获取验证码爱生活，爱分享，爱家人，爱自学。本人从2013年6月开始自学java Android至今，随着学习的深入，自己的技术也慢慢增强，在这里与大家分享个人的学习心得，望共进步。...
Python人工智能之图片识别，Python3一行代码实现图片文字识别
自学Python3第5天，今天突发奇想，想用Python识别图片里的文字。没想到Python实现图片文字识别这么简单，只需要一行代码就能搞定
from PIL import Image
import pytesseract
text=pytesseract.image_to_string(Image.open('denggao.jpeg'),lang='chi_sim')
print(text)
我们以识别诗词为例
下面是我们要识别的图片
先看下效果图
我们运行代码后识别的结果,有几个字没有正确识别，但是大多数字都能识别出来。
风急天高猿啸哀渚芸胄芳少白鸟飞凤
无边落木萧萧下, 不尽长量工盲衮宕衮来
万里悲秋常1乍窨, 百年多病独登氤
艰难苦恨擎霜量漂倒新停澍酉帆
一行代码就能识别图片，我们背后要做些准备工作的
这里我们需要用到两个库：pytesseract和PIL
同时我们还需要安装识别引擎tesseract-ocr
下面就来讲讲这几个库的安装，因为只有这几个库安装好以后Python才能实现一行代码实现图片文字识别
一，pytesseract和PIL的安装
安装这两个包可以借助pip
- 1，命令行安装
pip install PIL
pip install pytesseract
- 2，如果你用的pycharm编辑器，就可以直接借助pycharm实现快速安装。
在pycharm的Settings设置页按照下面步骤操作
这样就能成功安装pytesseract，安装PIL只需要在上面第三步里搜索PIL并点击安装即可
这时我们安转好了库，运行下面代码
from PIL import Image
import pytesseract
text=pytesseract.image_to_string(Image.open('denggao.jpeg'),lang='chi_sim')
print(text)
会报下面错误，错误原因是：没有安装识别引擎tesseract-ocr
二，安装识别引擎tesseract-ocr
1.下载下面的安装包，然后直接点击安装即可
解压安装tesseract-ocr后做如下操作，就可以支持中文识别了。因为tesseract-ocr默认不支持中文识别。
2，安装完成tesseract-ocr后，我们还需要做一下配置
在C:\Users\huxiu\AppData\Local\Programs\Python\Python35\Lib\site-packages\pytesseract找到pytesseract.py打开后做如下操作
# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY
#tesseract_cmd = 'tesseract'
tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe'
也可以通过pycharm快速打开pytesseract.py
至此我们所有的配置就完成了，运行下面代码就可以把杜甫的登高这首图片诗解析成文字了
Python 中文OCR
python + tesseract OCR 文字识别
【python 百度文字识别】通用文字识别（高精度版）
Python图像处理之识别图像中的文字
在win10(64位)系统下实现python的文字识别功能
Python3 色情图片识别
python tesseract识别图片文字第一次尝试中的问题记录qwq
Python创建windows服务记录
python之图片文本识别
没有更多推荐了，beautiful soup &
pytesseract+mechanize识别验证码自动登陆
pytesseract+mechanize识别验证码自动登陆
编辑：Run阅读（135）
pytesseract+mechanize识别验证码自动登陆需要的模块安装Pillow,Python平台的图像处理标准库pip&install&pillow安装pytesseract，文字识别库pip&install&pytesseract安装tesseract-ocr，识别引擎windows: 下载tesseract-ocr-setup-3.05.02&或者&tesseract-ocr-setup-4.0.0-alphalinux:github上面下载对应版本遇到问题及解决:pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path解决方法:(我是win环境)找到tesseract-ocr安装目录,复制路径如: &C:\Program Files (x86)\Tesseract-OCR\tesseract.exe找到pytesseract.py文件，修改tesseract_cmd的路径，如下:安装mechanize,是一个 Python 模块,用于模拟浏览器pip&install&mechanize程序思路:1.首先打开目标网站,找到验证码的图片地址，并下载下来2.利用pytesseract识别出图片中的验证码(想要识别率高，可训练)并返回一个str结果3.使用mechanize模拟登陆，找到form表单，提交账号，密码，验证码等信息4.登陆成功，然后爬取想要的内容需要爬取的网站完整代码:#!/usr/bin/env&python
#&coding:&utf-8
import&mechanize
import&sys
from&bs4&import&BeautifulSoup
from&PIL&import&Image
import&pytesseract
#&py2.7声明使用utf-8编码
reload(sys)
sys.setdefaultencoding('utf-8')
class&Item(object):&&#&定义一个Item类,爬取的字段类
&&&&landing_name&=&None&&#&登陆账号
&&&&landing_time&=&None&&#&登陆时间
class&SimulateLogin(object):
&&&&def&__init__(self,&url,&username,&password,&img_url):
&&&&&&&&#&初始化
&&&&&&&&self.url&=&url&&&&&&&&&&&&#&模拟登陆后台地址
&&&&&&&&self.img_url&=&img_url&&&&#&验证码下载地址
&&&&&&&&self.username&=&username&&#&账号
&&&&&&&&self.password&=&password&&#&密码
&&&&&&&&self.bs4_filter()
&&&&def&mechanize_setting(self):
&&&&&&&&#&打开浏览器
&&&&&&&&br&=&mechanize.Browser()
&&&&&&&&#&设置浏览器
&&&&&&&&br.set_handle_equiv(True)
&&&&&&&&br.set_handle_redirect(True)
&&&&&&&&br.set_handle_referer(True)
&&&&&&&&br.set_handle_robots(False)
&&&&&&&&br.set_handle_gzip(False)
&&&&&&&&#&Follows&refresh&0&but&not&hangs&on&refresh&&&0
&&&&&&&&br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),&max_time=1)
&&&&&&&&#&设置user-agent
&&&&&&&&br.addheaders&=&[('User-agent','Mozilla/5.0&(X11;&U;&Linux&i686;&en-US;&rv:1.9.0.1)&Gecko/
&&&&&&&&&Fedora/3.0.1-1.fc9&Firefox/3.0.1')]
&&&&&&&&return&br
&&&&def&login(self):&&#&模拟登陆函数
&&&&&&&&br&=&self.mechanize_setting()
&&&&&&&&self.img_download(br)
&&&&&&&&vf_code&=&self.img_to_str()
&&&&&&&&br.open(self.url)
&&&&&&&&#&打印form表单需要提交的信息
&&&&&&&&for&form&in&br.forms():
&&&&&&&&&&&&print(form)
&&&&&&&&#&注意:
&&&&&&&&#&post&指的是请求方式
&&&&&&&&#&TextControl(username=)对应的是账号
&&&&&&&&#&PasswordControl(password=)对应的是密码
&&&&&&&&#&TextControl(captcha=)对应的是验证码
&&&&&&&&try:
&&&&&&&&&&&&br.select_form(method='post')
&&&&&&&&&&&&br.form['username']&=&self.username
&&&&&&&&&&&&br.form['password']&=&self.password
&&&&&&&&&&&&br.form['captcha']&=&vf_code
&&&&&&&&&&&&br.submit()
&&&&&&&&except&Exception&as&e:
&&&&&&&&&&&&print('form表信息填写错误:%s'&%&e)
&&&&&&&&else:
&&&&&&&&&&&&ret&=&br.response().read()
&&&&&&&&&&&&return&ret
&&&&def&img_download(self,&br):&&#&下载验证码
&&&&&&&&img&=&br.open(self.img_url)
&&&&&&&&with&open('1.jpg',&'wb')&as&f:
&&&&&&&&&&&&f.write(img.read())
&&&&def&bs4_filter(self):&&#&登陆成功后，爬取内容
&&&&&&&&items&=&[]
&&&&&&&&ret&=&self.login()
&&&&&&&&#&利用bs4&获取登陆成功后的一些信息
&&&&&&&&soup&=&BeautifulSoup(ret,&'lxml')
&&&&&&&&print(soup)&&#&这里的返回值已经提示登陆成功了
&&&&def&initTable(self,&threshold=140):
&&&&&&&&#&二值化函数
&&&&&&&&table&=&[]
&&&&&&&&for&i&in&range(256):
&&&&&&&&&&&&if&i&&&threshold:
&&&&&&&&&&&&&&&&table.append(0)
&&&&&&&&&&&&else:
&&&&&&&&&&&&&&&&table.append(1)
&&&&&&&&return&table
&&&&def&img_to_str(self):&&#&验证码识别(数字+字母组合),return一个识别成功的string
&&&&&&&&#&替换列表--识别错误率高的手动添加进来，替换掉
&&&&&&&&rep&=&{'O':&'0',&'I':&'1',&'Z':&'2',&&'&:&'',&'S':&'8',&'R':&'A',
&&&&&&&&&&&&&&&'n':&'M',&'P':&'f',&'M':&'n',
&&&&&&&&&&&&&&&}
&&&&&&&&images&=&Image.open(&1.jpg&)
&&&&&&&&im&=&images.convert('L')
&&&&&&&&binaryImage&=&im.point(self.initTable(),&'1')
&&&&&&&&text&=&pytesseract.image_to_string(binaryImage,&config='-psm&7')
&&&&&&&&for&r&in&rep:
&&&&&&&&&&&&text&=&text.replace(r,&rep[r])
&&&&&&&&vf_code&=&text.encode('utf-8')
&&&&&&&&print('Pytesseract验证码识别:%s'&%&vf_code)
&&&&&&&&return&vf_code
if&__name__&==&'__main__':
&&&&url&=&'目标后台登陆地址'
&&&&img_url&=&'目标随机验证码地址'&&#&会自动下载图片并识别,成功率大概50%左右，可自行训练提高准确率
&&&&username&=&'账号'
&&&&password&=&'密码'
&&&&SimulateLogin(url,&username,&password,&img_url)运行代码:
 1464
 1163
 949
 944
 922
 896
 887
 868
 867
 810
 48°
 74°
 88°
 318°
 80°
 135°
 78°
 117°
 104°
 116°
姓名：Run
职业：谜
邮箱：
定位：上海 & 松江OCR （Optical Character Recognition，光学字符识别）是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，通过检测暗、亮的模式确定其形状，然后用字符识别方法将形状翻译成计算机文字的过程。目前最新的tesseract项目已经全部迁移到了github上，我们可以从中获取所有主要的信息。地址：
整个依赖安装过程如下：（Mac为例）
//先安装依赖库libpng, jpeg, libtiff, leptonica
brew install leptonica
//安装tesseract的同时安装训练工具（一定要安装后面训练使用）
brew install --with-training-tools tesseract
//安装tesseract的同时安装所有语言(不建议安装全部中文简体和英文足以)
brew install
--all-languages tesseract
//只安装tesseract，不安装训练工具
brew install
装好后需要安装一个中文语言库，默认都是英文的语言库。
根据自己的需求选择所要的语言库，在这里我们选择的是简体中文所以选择的库是：chi_sim.traineddata、eng.traineddata
将文件拷贝到到：/usr/local/Cellar/tesseract/3.05.01/share/tessdata目录下。
2.安装python依赖包
需要一个封装好的基于OCR的第三方库：pytesseract
需要一个读取图像识别的库：opencv或者pillow（建议）
sudo pip install pytesseract
sudo pip install PILLOW
3.使用OCR（2种方式）
第一种方式：命令行。
// 默认读取图片是使用英文 image图片地址 result 识别结果result.txt
tesseract image result
// 使用语言包 -l （language缩写）
tesseract -l chi_sim image result
百度上可以找到很多文字图片，这里拿一张举例子。(这里需要的是tiff格式的图片) 有地址可以进行转换：
现在来是识别这张图片中的内容：
新建了个文件夹OCR 下面只有一张train的图片：train.tif
运行命令：
// 记得用中文不然一堆乱码
tesseract -l chi_sim train.tif result
查看结果：
准确率还可以但是因为文字比较简单，"眷顾"的"眷"被认成了"替"，人工智障还不认识"一"。以上方式为命令行操作。
第二种方式：python代码实现。
#!/usr/bin/python
# -*- coding: UTF-8 -*-
# 基础包：OCR
from PIL import Image
import pytesseract
def main(img):
image = Image.open(img)
text = pytesseract.image_to_string(image, lang='chi_sim')
print text
if __name__ == '__main__':
main('train.tif')
需要导入PIL读取图片，然后通过pytesseract读取图片中的字符，语言必须选择中文简体（和命令行一样）
4.训练自己的字体库
上面的文字怎么样让OCR都认识呢？前面安装过训练工具。
需要先生成一个.box文件查看识别内容。命令如下：
// 制作box文件
tesseract train.tif train -l chi_sim batch.nochop makebox
可以查看到生成了一个train.box的文件。
下载工具：jTessBoxEditor
打开train.tif 只有生成了train.box才能查看他的识别过程。
打开train.tif可以明显看到识别的每一个字的正确。
把错误的字修改一下，更改成正确的字。然后保存。
5.制作自己的字体库文件
需要几个命令：
// 制作文字属性文件
echo font 0 0 0 0 0 &font_properties
// 生成训练文件 train.tr 这个文件非常重要之后训练用的
tesseract train.tif train -l chi_sim nobatch box.train
// 生成字符集
unicharset_extractor train.box
// 生成shape
shapeclustering -F font_properties -U unicharset -O unicharset train.tr
// 聚合字符特征文件
mftraining -F font_properties -U unicharset -O unicharset train.tr
//正常化文件
cntraining train.tr
查看结果：
然后多出了很多文件：
unicharset、inttemp、pffmtable、shapetable、normproto
需要全部重命名成train.前缀：
train.unicharset、train.inttemp、train.pffmtable、train.shapetable、train.normproto
然后通过命令合并制作成.traineddata文件：
// 合并训练好的文件
combine_tessdata train.
合并结果如下并且去查看OCR本身安装的语取文件就是这个.traineddata格式的官方训练好的，其实我们是自己制作了一个定制化适用自己的字体库了。
将训练好的.traineddata移动到系统安装tesseract的文件夹下
命令如下：
mv train.traineddata /usr/local/Cellar/tesseract/3.05.02/share/tessdata
6.测试自己的字体库
代码不用修改太多只用把语言库改为自己的库，即lang='chi_sim' 改为我们训练的lang='train'。代码如下：
#!/usr/bin/python
# -*- coding: UTF-8 -*-
# 基础包：OCR
from PIL import Image
import pytesseract
def main(img):
image = Image.open(img)
text = pytesseract.image_to_string(image, lang='train')
print text
if __name__ == '__main__':
main('train.tif')
这次结果非常完美100%正确。但是这个库现在还是个幼小的智障库，只能识别这个图其他还需要训练，文末彩蛋给出如何训练大量字体库。
制作大型字体训练库：中文有3500个汉字一一先制作一个list所有的常用汉字集合。
用Opencv制作一张白底背景100x100像素的图片。
import cv2
a = cv2.imread('1.png')
b = cv2.resize(a,(100,100))
cv2.imwrite('train.png',b)
分别把3500个汉字拼接到图片中,需要下载一个你要训练的字体类型这里苹方为例
#!/usr/bin/python
# -*- coding: UTF-8 -*-
# 基础包：模型训练
from PIL import Image, ImageFont, ImageDraw
import constants as cs
def draw_text(img, type, text, index):
image = Image.open(img)
draw = ImageDraw.Draw(image)
font = ImageFont.truetype(type, 80)
draw.text((10, 0), unicode(text.encode('UTF-8'), 'UTF-8'), font=font, fill='#000000')
image.save('train/%s.tif' % index)
if __name__ == '__main__':
for text in cs.CHINESE_COMMON:
index = cs.CHINESE_COMMON.index(text)
draw_text('train.png', '苹方 PingFang.ttc', text, index)
拼接完的图片如下：
打开jTessBoxEditor工具&tools&merge tiff
merge这些图片做成一个大的tif文件
全选保存为heben.tif文件进行训练。
剩下前文一样训练成heben.box heben.tr... 最后修改完毕后生成一个heben.traineddata的文件就是一个大型字体库代码修改下去识别一些苹方字体试试。
到这里一个半人工智能的人工智障就结束了。
「原创声明：保留所有权利，禁止转载」
屏蔽了此话题：疑似广告贴
后方可回复, 如果你还没有账号请点击这里。
Frand (赫本z)
第 21749 位会员 /
特赞（上海）信息科技有限公司
共收到 1 条回复}

叫阿莫西中心