2025

AI document classification system

Built a Claude Code-powered workflow to classify, name, and file a collection of 10,000 scanned paper documents -- ranging from financial filings and bank statements to correspondence with insurance companies and others.

Each page runs through an open-source OCR tool. The system then:

  • Extracts text from the document
  • Rotates and corrects the orientation when a scan is skewed
  • Names the document and assigns its category based on an extensive rule set
  • Creates a searchable OCR PDF
  • Moves the document into the right place in the file system

Recognizing document boundaries

The pages were scanned in a single concerted effort, but pages belonging to the same document were not merged. The system had to learn to recognize document boundaries and stitch together the pages that form one document.

This required a long process of training and iteratively expanding the rule set -- but it is working fantastically now. The 10,000 documents are being classified and ordered at a rate of 200 to 300 per hour.

Token-optimized and economical

The key to making this workflow viable with thousands of documents without ramping up a gigantic AI API bill was extreme optimization and a heavy reliance on deterministic scripts, command line tools and Apple's high-quality "Vision" OCR framework. As a result, the entire workflow runs within the limits of a $20 Claude Max subscription and is still capable of handling thousands of documents per month.