Skip to main content

[DRAFT] tesseract-ocr

Overview

Tesseract OCR is an open-source engine for extracting text from images and PDFs. It works best with clean, high-contrast text and supports many languages via language packs.

Install

macOS (Homebrew)

brew install tesseract

Ubuntu/Debian

sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng

Python bindings

pip install pytesseract pillow

Basic usage (CLI)

tesseract input.png output -l eng --psm 6
# output.txt will contain recognized text
  • -l eng: language (install additional packs for other languages).
  • --psm: page segmentation mode; 6 = single uniform block of text. Common values:
    • 3: fully automatic page segmentation.
    • 6: single block of text.
    • 7: single text line.

Basic usage (Python)

import pytesseract
from PIL import Image

img = Image.open("input.png")
text = pytesseract.image_to_string(img, lang="eng", config="--psm 6")
print(text)

Tips for better accuracy

  • Preprocess images: grayscale, threshold/binarize, deskew, denoise.
  • Use the closest language pack; for mixed languages, specify multiple (e.g., eng+spa).
  • Provide the right --psm for the layout.
  • Increase DPI (>=300) when scanning; avoid heavy compression artifacts.

Working with PDFs

  • Convert pages to images first (e.g., pdftoppm or pdf2image), then run Tesseract.

When to choose Tesseract

  • Offline/air-gapped environments.
  • Cost-free batch OCR where accuracy is acceptable after preprocessing.
  • Scripts/pipelines that avoid cloud dependencies.

When to consider alternatives

  • Need handwriting support or complex layouts: consider cloud OCR (Google Vision, Azure OCR) or specialized models.
  • Need structured extraction (tables/forms): use layout-aware models (e.g., LayoutLM-based) or services that return bounding boxes and structure.

References