Core part of tesseract.js, which compiles original tesseract from C to JavaScript WebAssembly.
To build tesseract-core.js by yourself, please install docker and run:
bash build-with-docker.sh
The generated files will be stored in root path. When compiling, errors sometimes occur due to race conditions (some dependencies do not appear to compile properly in parallel). Re-running generally resolves.
- Build scripts are in
build-scripts
folder - Javascript/wrapper files are in
javascript
folder - All dependencies (including Tesseract) are in
third_party
folder- All dependencies are unmodified except for Tesseract, which uses a forked repo
- The Tesseract repo has the following changes:
- Modified
CMakeLists.txt
to build with emscripten - Modified
ltrresultiterator.h
andltrresultiterator.cpp
to addWordChoiceIterator
class - Added
src/arch_sse
folder, which is used instead ofsrc/arch
for the simd-enabled build- This hard-codes the use of the SSE function
- Commented out "Empty page!!" message in
src/textord/colfind.cpp
to prevent this from printing to console - Added functions for detecting page angle and applying rotation
- Modified
src/ccmain/thresholder.cpp
,src/ccmain/thresholder.h
,src/api/baseapi.cpp
, andinclude/tesseract/baseapi.h
to addexif
andangle
arguments for rotating images - Changed
FindLines
from "protected" to "public" inbaseapi.h
to expose to Javascript- Allows for lines (and therefore page angle) to be detected without running unnecessary steps afterwards
- Added public
GetGradient
function tobaseapi.h
andbaseapi.cpp
for reporting page angle- Also required minor changes to
src/ccmain/tesseractclass.h
,src/ccmain/pagesegmain.cpp
,src/textord/textord.cpp
, andsrc/textord/textord.h
- Also required minor changes to
- Modified
- Added
WriteImage
function tobaseapi.h
andbaseapi.cpp
for saving images (original, grey, and binary) - Added
SaveParameters
andRestoreParameters
functions tobaseapi.h
andbaseapi.cpp
for saving and restoring parameters - Added calls to
EM_ASM_ARGS
tosrc/ccmain/control.cpp
for progress logging (and added<emscripten.h>
header) - Rewrote
tprintf
function insrc/ccutil/tprintf.cpp
to force flushing - Added new version of
SetImage
tosrc/api/baseapi.cpp
andinclude/tesseract/baseapi.h
that reads image from filesystem- This was done to resolve memory leak--see this issue
- Edited
ParamUtils::PrintParams
insrc/ccutil/params.cpp
to remove description text (resolves bug)- The bug was reported in this Git Issue, so we can cut this point if resolved in a future version of Tesseract
- Edited
src/ccmain/tessedit.cpp
to save error log to separate file (/debugDev.txt
) - Added JSON as an ouput format
- Added
src/api/jsonrenderer.cpp
, modifiedCMakeLists.txt
,include/tesseract/baseapi.h
, andinclude/tesseract/renderer.h
- Added
- Modified
To run the browser examples, launch a web server in the root of the repo (i.e. run http-server
). Then navigate to the pages in examples/web/minimal/
in your browser.
To run the node examples, navigate to examples/node/minimal/
and then run e.g. node index.wasm.js [input_file]
.
The "benchmark" examples behave similarly, except that they take longer to run and report runtime instead of recognition text. All other examples are experimental and should not be expected to run.
As we leverage git-submodule to manage dependencies, remember to add recursive when cloning the repository:
git clone --recursive https://github.com/naptha/tesseract.js-core