Source training data for Tesseract for lots of languages

zdenop ff3afec914 Merge pull request #144 from stweil/master 1 month ago
afr 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
amh 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
ara 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
asm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
aze 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
aze_cyrl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
bel 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
bel_tarask cc97fdcf4a add lexical data (using same character data as bel, for now) 4 years ago
ben 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
bih 60804a42c4 Merge pull request #113 from Shreeshrii/patch-5 1 year ago
bod 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
bos 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
bul 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
cat 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
ceb 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
ces 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
chi_sim 0309ca8ab4 Fix extra intra-word spacing in Chinese and Japanese (GitHub issue #991) 1 month ago
chi_sim_vert 0309ca8ab4 Fix extra intra-word spacing in Chinese and Japanese (GitHub issue #991) 1 month ago
chi_tra 0309ca8ab4 Fix extra intra-word spacing in Chinese and Japanese (GitHub issue #991) 1 month ago
chi_tra_vert 0309ca8ab4 Fix extra intra-word spacing in Chinese and Japanese (GitHub issue #991) 1 month ago
chr 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
cym 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
dan 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
deu 44d3e64acb Merge pull request #56 from stweil/master 2 years ago
div 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
dzo 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
ell 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
eng dbcc748c90 Add ï to desired English characters 2 years ago
enm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
epo 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
est 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
eus 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
fas 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
fin 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
fra 0f0d6e9810 Delete cube langdata 2 years ago
frk e435740a9e Remove more German confusions 2 years ago
frm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
gle 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
gle_uncial a0bb3beca0 Merge pull request #5 from tesseract-ocr/gle_uncial 1 year ago
glg 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
grc a3bbfeb8f7 Update grc to latest version upstream 3 years ago
guj 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
hat 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
heb 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
hin c4d030c908 Added some more Hindi words 1 year ago
hrv 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
hun 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
iast 6d1ecd31c9 add IAST version of san and hin training text 1 year ago
iku 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
ind 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
isl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
ita ee67adc472 ita: Remove user words 2 years ago
ita_old 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
jav 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
jpn 0309ca8ab4 Fix extra intra-word spacing in Chinese and Japanese (GitHub issue #991) 1 month ago
jpn_vert 0309ca8ab4 Fix extra intra-word spacing in Chinese and Japanese (GitHub issue #991) 1 month ago
kan 2a525c1e23 Fix file mode (remove execute permission) 1 year ago
kat 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
kat_old 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
kaz 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
khm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
kir 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
kmr 1ca94d920e smaller representative sample from langdata_lstm/kmr.training_text 4 months ago
kor ebb1ed1e87 remove 'tessedit_load_sublangs chi_tra' 1 year ago
kur_ara fd813901a6 copy files from kur subdirectory 1 year ago
lao 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
lat dfcd1bd091 sort|uniq 1 year ago
lav 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
lit 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
mal 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
mar 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
mkd 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
mlt 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
mri 07739a869e training_text 1 year ago
msa 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
mya 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
nep 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
nld 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
nor 3000efef86 Fixed issue 15 2 years ago
ori bab8af1458 Update desired_characters 1 year ago
pan 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
pol 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
por 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
pus 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
ron 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
rus 0f0d6e9810 Delete cube langdata 2 years ago
rus_accent 4f6a751b99 unicharset from rus; ligatures for accents todo 4 years ago
san 9dd588a674 Merge pull request #15 from Shreeshrii/master 1 year ago
sin 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
slk 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
slv 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
snd 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
spa 0f0d6e9810 Delete cube langdata 2 years ago
spa_old 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
sqi 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
srp 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
srp_latn 8207ac3d8c do not load Serbian Cyrillic for Serbian latin OCR 3 years ago
swa 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
swe 34af0cd7c4 Added ÅÄÖåäö to swedish desired_characters 7 months ago
syr 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
tam 2a525c1e23 Fix file mode (remove execute permission) 1 year ago
tel 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
tgk 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
tgl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
tha 57d1568395 Addresses extra spaces problem with 4.00 1 year ago
tir 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
tur 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
tyv 85400a5693 Updated training text. 3 years ago
uig 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 2 years ago
ukr 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
urd d7691b34df urd.wordlist 1 year ago
uzb 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
uzb_cyrl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
vie 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
yid 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
zlm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
Arabic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Arabic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Armenian.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Armenian.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Bengali.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Bengali.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Bopomofo.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Bopomofo.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Canadian_Aboriginal.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Canadian_Aboriginal.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Cherokee.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Cherokee.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Common.unicharset 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
Cyrillic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Cyrillic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Devanagari.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Devanagari.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Ethiopic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Ethiopic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Georgian.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Georgian.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Greek.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Greek.xheights 240bf0160c Add Ancient Greek langdata 3 years ago
Gujarati.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Gujarati.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Gurmukhi.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Gurmukhi.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Han.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Han.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hangul.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hangul.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hebrew.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hebrew.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hiragana.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hiragana.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Kannada.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Kannada.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Katakana.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Katakana.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Khmer.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Khmer.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
LICENSE b6068ee0b2 Add Apache license file 1 month ago
Lao.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Lao.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Latin.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Latin.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Malayalam.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Malayalam.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Myanmar.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Myanmar.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Ogham.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Ogham.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Oriya.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Oriya.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
README.md 5851060ff1 Create README.md 4 years ago
Runic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Runic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Sinhala.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Sinhala.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Syriac.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Syriac.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Tamil.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Tamil.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Telugu.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Telugu.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Thai.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Thai.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Tibetan.unicharset 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 4 years ago
common.punc 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
common.unicharambigs 2a525c1e23 Fix file mode (remove execute permission) 1 year ago
font_properties befed5697d Merge pull request #19 from nickjwhite/addgrc 1 year ago
forbidden_characters_default 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
radical-stroke.txt 3e32be3dc0 Changed stroke encoding to be based on wubi instead of radical stroke index 2 years ago

README.md

langdata

Source training data for Tesseract for lots of languages

Want to re-train tesseract for a specific language, by modifying/augmenting the original training data? Then you have come to the right place!

If you want to find a language data set to run Tesseract, then look at our tessdata repository instead.

To re-create the training of a single language, lang, you need the following:

  • All the data in the lang directory.
  • The corresponding unicharset/xheights files for the script(s) used by lang.
  • All the remaining non-lang-specific files in the top-level directory, such as font_properties.
  • You also need to obtain the fonts needed to train the language. Some languages were trained with commercially available fonts, so you will need to buy them in order to reproduce the training exactly, or use substitutes.