Using AI to Extract Data from 18 Million PDF Files

How to extract PDF files and parse specific data using AI

The client tasked us with tackling a large-scale data extraction project involving 18 million PDF files hosted on a website. These files contained financial reporting details from companies. The focus was on identifying and parsing information on company turnover.

Key challenges in the project

Original format

PDFs were scanned documents. This meant they didn’t contain embedded text that could be easily searched or extracted. The absence of readable text made identifying turnover data directly within the files impossible.

Varied structure

The turnover data we needed could appear as tables, plain text, or hidden among detailed descriptions. However, we uncovered two critical patterns during our investigation. First, the required information was usually located within the initial pages, reducing the need to process full documents and lowering operational costs. Second, the presence of the keyword “turnover” allowed us to filter out irrelevant files.

AI prompting

Initially, Anthropic tended to round numerical values, output numbers as text, or provide inconsistent responses. This unpredictability made it difficult to extract reliable turnover data. To mitigate this, we implemented a structured output format where the AI’s responses adhered to predefined fields. We also added a dedicated subprompt to pinpoint both the year and the turnover value.

Processing speed

Running the AI model locally posed serious time constraints. Given the scale of the task—processing 18 million PDF files—it would have taken months to complete on a single computer, virtual machine, or EC2 instance. This delay was unacceptable given the project’s deadlines and the client’s expectations.

Need assistance with your project?

Contact us to discuss how we can help you overcome bottlenecks and efficiently collect data from the web.

Our solution: How to extract a PDF document

How Intsurfing approached PDF parsing
Using PDF reader Python to convert to text

For parsing a PDF, we used Python and two libraries: Pdf2image to convert PDF into images and Pytesseract to extract text from images. We gathered information about all PDF files (S3 paths or EFS/EBS volumes, keywords to search for, and the number of pages to process). This information was then pushed to the AWS SQS queue.

We further containerized the entire setup with Docker. This container was deployed to AWS ECR through AWS ECS with scaling capabilities. The service starts processing when SQS has a certain number of messages and scales down when the queue is empty for a set time. The resulting TXT files were stored in a designated output destination: S3, EFS, or EBS.

Applying AI PDF analyzer to extract data | Intsurfing

Using Python, we relied on the Anthropic to locate and extract turnover data from the text. We implemented a second SQS queue specifically for this step. Messages in this queue included the location of the .txt files and keywords.

The extracted results of turnover values and their corresponding years were saved into structured CSV files.

Technologies we used in the project

Python

Nannostomus

Docker

Anthropic

EFS

S3

EBS

SQS

ECR

ECS

The results: parse a PDF file with speed and cost-savings

Using a carefully designed workflow, Intsurfing filtered the initial dataset of 18 million files down to 5 million after converting them to text. From these, 1.2 million files contained the required turnover values.

  • Even when accounting for AWS infrastructure and Anthropic, the cost per 1,000 files was only 0.6–0.7 cents (excluding the use of Nannostomus).
  • While downloading the PDFs took several weeks, the actual processing—text conversion and parsing—was completed in just three days.
How to extract a PDF fast and budget-friendly | Intsurfing

Make big data work for you

Reach out to us today. We'll review your requirements, provide a tailored solution and quote, and start your project once you agree.

Contact us

Complete the form with your personal and project details, so we can get back to you with a personalized solution.