Steps to Repair and OCR a Scanned or Corrupted PDF in Ubuntu

Use Ghostscript to rebuild damaged cross-reference tables and fix malformed PDF structure.

sudo apt install ghostscript

gs -o fixed.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -dNOPAUSE -dBATCH "input.pdf"

What it does:

If Ghostscript cannot fix the file, try qpdf:

sudo apt install qpdf
qpdf --repair "input.pdf" fixed.pdf

Use OCRmyPDF to embed searchable text into the PDF.

sudo apt install ocrmypdf tesseract-ocr tesseract-ocr-eng tesseract-ocr-fil
ocrmypdf --jobs 4 --deskew --clean -l eng+fil fixed.pdf output_ocr.pdf

What it does:

If OCRmyPDF fails on rendering, use an alternate renderer:

ocrmypdf --pdf-renderer sandwich fixed.pdf output_ocr.pdf

If the PDF is too broken, force rasterization and OCR:

ocrmypdf --force-ocr fixed.pdf output_ocr.pdf

Check if text extraction works:

pdftotext output_ocr.pdf - | head

If you see readable text, the OCR worked successfully.

Step	Tool	Command	Purpose
1	Ghostscript	`gs -o fixed.pdf -sDEVICE=pdfwrite ...`	Clean and repair corrupted PDF
2	QPDF	`qpdf --repair input.pdf fixed.pdf`	Alternate PDF repair if GS fails
3	OCRmyPDF	`ocrmypdf --jobs 4 --deskew --clean fixed.pdf output_ocr.pdf`	Add searchable text layer
4	Verify	`pdftotext output_ocr.pdf -	head`	Confirm OCR success