Source training data for Tesseract for lots of languages

zdenop c171566095 Merge pull request #58 from stweil/eng 1 month ago
afr 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
amh 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
ara 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
asm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
aze 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
aze_cyrl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
bel 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
ben 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
bih 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
bod 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
bos 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
bul 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
cat 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
ceb 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
ces 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
chi_sim 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
chi_sim_vert 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
chi_tra 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
chi_tra_vert 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
chr 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
cym 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
dan 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
deu 44d3e64acb Merge pull request #56 from stweil/master 9 months ago
div 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
dzo 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
ell 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
eng dbcc748c90 Add ï to desired English characters 9 months ago
enm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
epo 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
est 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
eus 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
fas 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
fin 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
fra 0f0d6e9810 Delete cube langdata 11 months ago
frk e435740a9e Remove more German confusions 9 months ago
frm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
gle 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
gle_uncial fe4df312e8 files for gle_uncial 2 months ago
glg 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
guj 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
hat 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
heb 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
hin 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
hrv 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
hun 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
iku 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
ind 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
isl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
ita ee67adc472 ita: Remove user words 5 months ago
ita_old 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
jav 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
jpn 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
jpn_vert 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
kan 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
kat 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
kat_old 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
kaz 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
khm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
kir 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
kor 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
kur 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
kur_ara 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
lao 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
lat b6cf9e3c75 Fix German words with "ließ" 9 months ago
lav 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
lit 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
mal 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
mar 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
mkd 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
mlt 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
msa 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
mya 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
nep 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
nld 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
nor 3000efef86 Fixed issue 15 10 months ago
ori 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
pan 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
pol 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
por 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
pus 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
ron 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
rus 0f0d6e9810 Delete cube langdata 11 months ago
san 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
sin 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
slk 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
slv 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
snd 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
spa 0f0d6e9810 Delete cube langdata 11 months ago
spa_old 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
sqi 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
srp 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
srp_latn 8207ac3d8c do not load Serbian Cyrillic for Serbian latin OCR 1 year ago
swa 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
swe 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
syr 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
tam 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
tel 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
tgk 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
tgl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
tha 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
tir 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
tur 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
uig 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
ukr 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
urd 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 10 months ago
uzb 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
uzb_cyrl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
vie 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
yid 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
zlm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
Arabic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Arabic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Armenian.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Armenian.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Bengali.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Bengali.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Bopomofo.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Bopomofo.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Canadian_Aboriginal.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Canadian_Aboriginal.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Cherokee.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Cherokee.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Common.unicharset 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
Cyrillic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Cyrillic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Devanagari.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Devanagari.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Ethiopic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Ethiopic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Georgian.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Georgian.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Greek.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Greek.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Gujarati.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Gujarati.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Gurmukhi.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Gurmukhi.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Han.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Han.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Hangul.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Hangul.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Hebrew.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Hebrew.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Hiragana.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Hiragana.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Kannada.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Kannada.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Katakana.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Katakana.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Khmer.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Khmer.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Lao.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Lao.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Latin.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Latin.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Malayalam.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Malayalam.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Myanmar.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Myanmar.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Ogham.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Ogham.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Oriya.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Oriya.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
README.md 5851060ff1 Create README.md 2 years ago
Runic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Runic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Sinhala.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Sinhala.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Syriac.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Syriac.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Tamil.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Tamil.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Telugu.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Telugu.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Thai.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Thai.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
Tibetan.unicharset 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 2 years ago
common.punc 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
common.unicharambigs 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
font_properties 2ff3173a4b * Lines with less than six fields fixed. 1 year ago
forbidden_characters_default 9204c02c18 Initial commit of *all* the language source data (87 langs) 3 years ago
radical-stroke.txt 3e32be3dc0 Changed stroke encoding to be based on wubi instead of radical stroke index 3 months ago

README.md

langdata

Source training data for Tesseract for lots of languages

Want to re-train tesseract for a specific language, by modifying/augmenting the original training data? Then you have come to the right place!

If you want to find a language data set to run Tesseract, then look at our tessdata repository instead.

To re-create the training of a single language, lang, you need the following:

  • All the data in the lang directory.
  • The corresponding unicharset/xheights files for the script(s) used by lang.
  • All the remaining non-lang-specific files in the top-level directory, such as font_properties.
  • You also need to obtain the fonts needed to train the language. Some languages were trained with commercially available fonts, so you will need to buy them in order to reproduce the training exactly, or use substitutes.