Source training data for Tesseract for lots of languages

zdenop 106c9b31be Merge pull request #123 from Shreeshrii/patch-1 4 months ago
afr 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
amh 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
ara 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
asm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
aze 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
aze_cyrl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
bel 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
bel_tarask cc97fdcf4a add lexical data (using same character data as bel, for now) 4 years ago
ben 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
bih 60804a42c4 Merge pull request #113 from Shreeshrii/patch-5 5 months ago
bod 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
bos 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
bul 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
cat 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
ceb 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
ces 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
chi_sim e585075d88 Fixes extra intra-word spacing in Chinese for 4.0 5 months ago
chi_sim_vert 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
chi_tra 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
chi_tra_vert 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
chr 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
cym 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
dan 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
deu 44d3e64acb Merge pull request #56 from stweil/master 1 year ago
div 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
dzo 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
ell 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
eng dbcc748c90 Add ï to desired English characters 1 year ago
enm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
epo 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
est 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
eus 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
fas 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
fin 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
fra 0f0d6e9810 Delete cube langdata 1 year ago
frk e435740a9e Remove more German confusions 1 year ago
frm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
gle 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
gle_uncial a0bb3beca0 Merge pull request #5 from tesseract-ocr/gle_uncial 5 months ago
glg 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
grc a3bbfeb8f7 Update grc to latest version upstream 2 years ago
guj 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
hat 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
heb 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
hin 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
hrv 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
hun 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
iast 6d1ecd31c9 add IAST version of san and hin training text 5 months ago
iku 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
ind 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
isl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
ita ee67adc472 ita: Remove user words 1 year ago
ita_old 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
jav 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
jpn 6acaf7704a Remove parameter textord_tabfind_vertical_horizontal_mix 4 months ago
jpn_vert 6acaf7704a Remove parameter textord_tabfind_vertical_horizontal_mix 4 months ago
kan 2a525c1e23 Fix file mode (remove execute permission) 4 months ago
kat 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
kat_old 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
kaz 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
khm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
kir 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
kor ebb1ed1e87 remove 'tessedit_load_sublangs chi_tra' 4 months ago
kur 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
kur_ara fd813901a6 copy files from kur subdirectory 4 months ago
lao 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
lat dfcd1bd091 sort|uniq 5 months ago
lav 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
lit 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
mal 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
mar 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
mkd 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
mlt 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
mri 07739a869e training_text 5 months ago
msa 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
mya 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
nep 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
nld 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
nor 3000efef86 Fixed issue 15 1 year ago
ori bab8af1458 Update desired_characters 8 months ago
pan 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
pol 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
por 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
pus 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
ron 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
rus 0f0d6e9810 Delete cube langdata 1 year ago
rus_accent 4f6a751b99 unicharset from rus; ligatures for accents todo 3 years ago
san 9dd588a674 Merge pull request #15 from Shreeshrii/master 5 months ago
sin 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
slk 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
slv 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
snd 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
spa 0f0d6e9810 Delete cube langdata 1 year ago
spa_old 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
sqi 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
srp 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
srp_latn 8207ac3d8c do not load Serbian Cyrillic for Serbian latin OCR 2 years ago
swa 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
swe 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
syr 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
tam 2a525c1e23 Fix file mode (remove execute permission) 4 months ago
tel 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
tgk 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
tgl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
tha 57d1568395 Addresses extra spaces problem with 4.00 5 months ago
tir 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
tur 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
tyv 85400a5693 Updated training text. 2 years ago
uig 3ab6581a11 Updates to desired/forbidden characters to include Arabic diacritcs, extra Devanagari characters vertical Chinese, Japanese, some more sublang defaults 1 year ago
ukr 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
urd d7691b34df urd.wordlist 7 months ago
uzb 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
uzb_cyrl 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
vie 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
yid 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
zlm 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
Arabic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Arabic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Armenian.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Armenian.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Bengali.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Bengali.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Bopomofo.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Bopomofo.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Canadian_Aboriginal.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Canadian_Aboriginal.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Cherokee.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Cherokee.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Common.unicharset 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
Cyrillic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Cyrillic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Devanagari.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Devanagari.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Ethiopic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Ethiopic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Georgian.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Georgian.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Greek.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Greek.xheights 240bf0160c Add Ancient Greek langdata 2 years ago
Gujarati.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Gujarati.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Gurmukhi.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Gurmukhi.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Han.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Han.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hangul.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hangul.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hebrew.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hebrew.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hiragana.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Hiragana.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Kannada.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Kannada.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Katakana.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Katakana.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Khmer.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Khmer.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Lao.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Lao.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Latin.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Latin.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Malayalam.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Malayalam.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Myanmar.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Myanmar.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Ogham.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Ogham.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Oriya.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Oriya.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
README.md 5851060ff1 Create README.md 3 years ago
Runic.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Runic.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Sinhala.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Sinhala.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Syriac.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Syriac.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Tamil.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Tamil.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Telugu.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Telugu.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Thai.unicharset 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Thai.xheights 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
Tibetan.unicharset 05ec588fc0 Updated all langdata with newly generated source training data for 3.04 3 years ago
common.punc 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
common.unicharambigs 2a525c1e23 Fix file mode (remove execute permission) 4 months ago
font_properties befed5697d Merge pull request #19 from nickjwhite/addgrc 5 months ago
forbidden_characters_default 9204c02c18 Initial commit of *all* the language source data (87 langs) 4 years ago
radical-stroke.txt 3e32be3dc0 Changed stroke encoding to be based on wubi instead of radical stroke index 1 year ago

README.md

langdata

Source training data for Tesseract for lots of languages

Want to re-train tesseract for a specific language, by modifying/augmenting the original training data? Then you have come to the right place!

If you want to find a language data set to run Tesseract, then look at our tessdata repository instead.

To re-create the training of a single language, lang, you need the following:

  • All the data in the lang directory.
  • The corresponding unicharset/xheights files for the script(s) used by lang.
  • All the remaining non-lang-specific files in the top-level directory, such as font_properties.
  • You also need to obtain the fonts needed to train the language. Some languages were trained with commercially available fonts, so you will need to buy them in order to reproduce the training exactly, or use substitutes.