[DRAFT] tesseract-ocr

Overview

Tesseract OCR is an open-source engine for extracting text from images and PDFs. It works best with clean, high-contrast text and supports many languages via language packs.

Install

macOS (Homebrew)

brew install tesseract

Ubuntu/Debian

sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng

Python bindings

pip install pytesseract pillow

Basic usage (CLI)

tesseract input.png output -l eng --psm 6
# output.txt will contain recognized text

-l eng: language (install additional packs for other languages).
--psm: page segmentation mode; 6 = single uniform block of text. Common values:
- 3: fully automatic page segmentation.
- 6: single block of text.
- 7: single text line.

Basic usage (Python)

import pytesseract
from PIL import Image

img = Image.open("input.png")
text = pytesseract.image_to_string(img, lang="eng", config="--psm 6")
print(text)

Tips for better accuracy

Preprocess images: grayscale, threshold/binarize, deskew, denoise.
Use the closest language pack; for mixed languages, specify multiple (e.g., eng+spa).
Provide the right --psm for the layout.
Increase DPI (>=300) when scanning; avoid heavy compression artifacts.

Working with PDFs

Convert pages to images first (e.g., pdftoppm or pdf2image), then run Tesseract.

When to choose Tesseract

Offline/air-gapped environments.
Cost-free batch OCR where accuracy is acceptable after preprocessing.
Scripts/pipelines that avoid cloud dependencies.

When to consider alternatives

Need handwriting support or complex layouts: consider cloud OCR (Google Vision, Azure OCR) or specialized models.
Need structured extraction (tables/forms): use layout-aware models (e.g., LayoutLM-based) or services that return bounding boxes and structure.

References

Tesseract repo: https://github.com/tesseract-ocr/tesseract
Pytesseract: https://pypi.org/project/pytesseract/

Overview​

Install​

macOS (Homebrew)​

Ubuntu/Debian​

Python bindings​

Basic usage (CLI)​

Basic usage (Python)​

Tips for better accuracy​

Working with PDFs​

When to choose Tesseract​

When to consider alternatives​

References​

Overview

Install

macOS (Homebrew)

Ubuntu/Debian

Python bindings

Basic usage (CLI)

Basic usage (Python)

Tips for better accuracy

Working with PDFs

When to choose Tesseract

When to consider alternatives

References