Tesseract Open Source OCR Engine (Mirror)

zdenop 1e3bf29cf6 Merge pull request #1092 from stweil/fixtext 1 day ago
.github 1263941f1d Make less verbose 2 months ago
android 06fb88495c Update Android.mk 8 months ago
api 7afa05a03e Merge pull request #1072 from stweil/listlangs 1 week ago
arch f9b51d7983 suppress a strict aliasing warning; the original author was very clear about the nature of the problematic code 1 week ago
ccmain 8bb5a89d5a Don't add empty line to text output 1 day ago
ccstruct 4e9665debf Added ADAM optimizer, unless git screwed it up, cos there is no diff 2 weeks ago
ccutil c67c2e9f41 Add combine_lang_model to cmake and cppan builds. 2 weeks ago
classify 7111167497 fix a set-but-not-used warning and add casts for comparing signed+unsigned numbers 1 week ago
cmake 9f763e5466 Update SourceGroups.cmake 8 months ago
contrib 3458e7c981 helper script to generate dawg input files from text 10 months ago
cutil 9929587f36 Remove extra semicolons 1 month ago
dict 2633fef0b6 Part 2 of separating out the unicharset from the LSTM model, fixing command line for training 2 weeks ago
doc 2f48d69bcd doc: Fix use of MAINTAINER_MODE 3 months ago
googletest @ 4bab34d208 f36dc34c4f Add googletest submodule 4 weeks ago
java 5d60444f40 automake: Enable all warnings and fix a warning 3 months ago
lstm 77c44cdecd Added convert to int and directory listing to combine_tessdata 2 weeks ago
opencl da03e4e910 Fixes from pull of cleanups: clang tidied, reviewed, fixed new bugs, undeleted needed code. Probably breaks the build, due to some inclusion of changes in utf8/32 conversion 1 month ago
snap 91afb5540f Download the leptonica source from github 1 month ago
tessdata 82d62f89a2 Update Makefile.am (add 'lstm.train') 4 months ago
testdata 8e55e52be7 Harder unittest that uses file i/o and string manipulation 2 weeks ago
testing 934e612a3e testing: Fix warnings from shellcheck 4 months ago
tests 99755b0732 googletest: Add dummy test 4 weeks ago
textord 6f281c36a7 fix a problem I introduced in a previous commit 1 week ago
training d171488e21 Added CMake option to use system ICU library 5 days ago
unittest 8e55e52be7 Harder unittest that uses file i/o and string manipulation 2 weeks ago
viewer ba95a686aa Use lept_free to free memory allocated by Leptonica 1 month ago
vs2010 1cf8fe51a0 Remove mathfix.h 2 months ago
wordrec 9929587f36 Remove extra semicolons 1 month ago
.gitignore f36dc34c4f Add googletest submodule 4 weeks ago
.gitmodules f36dc34c4f Add googletest submodule 4 weeks ago
.travis.yml a2404ae735 Fix Travis CI for Leptonica 1.74.2 2 months ago
AUTHORS ec99d9f2b2 AUTHORS: Add more contributors 8 months ago
CMakeLists.txt 99755b0732 googletest: Add dummy test 4 weeks ago
CONTRIBUTING.md 3e1099157f Change Mac OS X -> macOS 2 months ago
COPYING 2c837dffc3 Result of clang tidy on recent merge 9 months ago
ChangeLog e2b1e9f977 Fix ChangeLog for Leptonica 1.74 4 months ago
Dockerfile defb399657 Fix and improve Dockerfile 3 months ago
INSTALL bf9f40cac6 Fix typos 6 months ago
INSTALL.GIT.md add00edfba Update documentation for installation 2 months ago
LICENSE 5913d7344f Added missing license headers 9 months ago
Makefile.am a0201831c3 Merge pull request #576 from stweil/shellcheck 8 months ago
NEWS 425d593ebe top-skimming import from sf.net 10 years ago
README.md 4506133aa2 Update readme for 3.05.01 2 months ago
appveyor.yml 6ba14f3909 Update appveyor.yml 3 months ago
autogen.sh 5d60444f40 automake: Enable all warnings and fix a warning 3 months ago
configure.ac 742b303548 Fix hint for training build 2 days ago
cppan.yml c67c2e9f41 Add combine_lang_model to cmake and cppan builds. 2 weeks ago
docker-compose.yml 71ad8c9bff Dockerifying using travis build script 1 year ago
tesseract.pc.cmake 6641989866 Cmake install 4 months ago
tesseract.pc.in ef26b312f9 improve tesseract.pc.in - fixes #241 1 year ago

README.md

Tesseract OCR

Build

Build Status Build status

Other

Coverity Scan Build Status Insight.io

About

This package contains an OCR engine - libtesseract and a command line program - tesseract.

The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub's log of contributors.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

Tesseract supports various output formats: plain-text, hocr(html), pdf, tsv, invisible-text-only pdf.

You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.

This project does not include a GUI application. If you need one, please see the 3rdParty wiki page.

Tesseract can be trained to recognize other languages. See Tesseract Training for more information.

Brief history

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

The latest stable version is 3.05.01, released on June 1, 2017. Latest source code for 3.05 is available from 3.05 branch on github.

Source code for the new LSTM based 4.00.00alpha version is available from the master branch on github. Please note this branch is under active development.

See Release Notes and Change Log for more details of the releases.

Installing Tesseract

You can either Install Tesseract via pre-built binary package or build it from source.

Supported Compilers are:

  • GCC 4.8 and above
  • Clang 3.4 and above
  • MSVC 2015, 2017

Other compilers might work, but are not officially supported.

Running Tesseract

Basic command line usage:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

For more information about the various command line options use tesseract --help or man tesseract.

For developers

Developers can use libtesseract C or C++ API to build their own application. If you need bindings to libtesseract for other programming languages, please see the wrapper section on AddOns wiki page.

Documentation of Tesseract generated from source code by doxygen can be found on tesseract-ocr.github.io.

Support

Before you submit an issue, please review the guidelines for this repository.

For support, first read the Wiki, particularly the FAQ to see if your problem is addressed there. If not, search the Tesseract user forum, the Tesseract developer forum and past issues, and if you still can't find what you need, ask for support in the mailing-lists.

Mailing-lists:

Please report an issue only for a bug, not for asking questions.

License

The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

NOTE: This software depends on other packages that may be licensed under different open source licenses.

Latest Version of README

For the latest online version of the README.md see:

https://github.com/tesseract-ocr/tesseract/blob/master/README.md