Steps to Repair and OCR a Scanned or Corrupted PDF in Ubuntu

1. Clean or Repair the PDF

Use Ghostscript to rebuild damaged cross-reference tables and fix malformed PDF structure.

sudo apt install ghostscript

gs -o fixed.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -dNOPAUSE -dBATCH "input.pdf"

What it does:

  • Repairs broken references (xref errors)
  • Normalizes streams and compression
  • Outputs a clean, standards-compliant PDF (fixed.pdf)

If Ghostscript cannot fix the file, try qpdf:

sudo apt install qpdf
qpdf --repair "input.pdf" fixed.pdf

2. Run OCR on the Cleaned PDF

Use OCRmyPDF to embed searchable text into the PDF.

sudo apt install ocrmypdf tesseract-ocr tesseract-ocr-eng tesseract-ocr-fil
ocrmypdf --jobs 4 --deskew --clean -l eng+fil fixed.pdf output_ocr.pdf

What it does:

  • Performs OCR using Tesseract (English + Filipino)
  • Deskews and cleans pages
  • Embeds text layer for search and selection

If OCRmyPDF fails on rendering, use an alternate renderer:

ocrmypdf --pdf-renderer sandwich fixed.pdf output_ocr.pdf

If the PDF is too broken, force rasterization and OCR:

ocrmypdf --force-ocr fixed.pdf output_ocr.pdf

3. Verify OCR Success

Check if text extraction works:

pdftotext output_ocr.pdf - | head

If you see readable text, the OCR worked successfully.


✅ Summary Workflow

Step Tool Command Purpose
1 Ghostscript gs -o fixed.pdf -sDEVICE=pdfwrite ... Clean and repair corrupted PDF
2 QPDF qpdf --repair input.pdf fixed.pdf Alternate PDF repair if GS fails
3 OCRmyPDF ocrmypdf --jobs 4 --deskew --clean fixed.pdf output_ocr.pdf Add searchable text layer
4 Verify `pdftotext output_ocr.pdf - head` Confirm OCR success
Discard
Save
This page has been updated since your last edit. Your draft may contain outdated content. Load Latest Version

On this page

Review Changes ← Back to Content
Message Status Space Raised By Last update on