[DRAFT] tesseract-ocr
Overview
Tesseract OCR is an open-source engine for extracting text from images and PDFs. It works best with clean, high-contrast text and supports many languages via language packs.
Install
macOS (Homebrew)
brew install tesseract
Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng
Python bindings
pip install pytesseract pillow
Basic usage (CLI)
tesseract input.png output -l eng --psm 6
# output.txt will contain recognized text
-l eng: language (install additional packs for other languages).--psm: page segmentation mode; 6 = single uniform block of text. Common values:- 3: fully automatic page segmentation.
- 6: single block of text.
- 7: single text line.
Basic usage (Python)
import pytesseract
from PIL import Image
img = Image.open("input.png")
text = pytesseract.image_to_string(img, lang="eng", config="--psm 6")
print(text)
Tips for better accuracy
- Preprocess images: grayscale, threshold/binarize, deskew, denoise.
- Use the closest language pack; for mixed languages, specify multiple (e.g.,
eng+spa). - Provide the right
--psmfor the layout. - Increase DPI (>=300) when scanning; avoid heavy compression artifacts.
Working with PDFs
- Convert pages to images first (e.g.,
pdftoppmorpdf2image), then run Tesseract.
When to choose Tesseract
- Offline/air-gapped environments.
- Cost-free batch OCR where accuracy is acceptable after preprocessing.
- Scripts/pipelines that avoid cloud dependencies.
When to consider alternatives
- Need handwriting support or complex layouts: consider cloud OCR (Google Vision, Azure OCR) or specialized models.
- Need structured extraction (tables/forms): use layout-aware models (e.g., LayoutLM-based) or services that return bounding boxes and structure.
References
- Tesseract repo: https://github.com/tesseract-ocr/tesseract
- Pytesseract: https://pypi.org/project/pytesseract/