How to extract PDF files and parse specific data using AI
The client tasked us with tackling a large-scale data extraction project involving 18 million PDF files hosted on a website. These files contained financial reporting details from companies. The focus was on identifying and parsing information on company turnover.
Key challenges in the project
Original format
PDFs were scanned documents. This meant they didn’t contain embedded text that could be easily searched or extracted. The absence of readable text made identifying turnover data directly within the files impossible.
Varied structure
The turnover data we needed could appear as tables, plain text, or hidden among detailed descriptions. However, we uncovered two critical patterns during our investigation. First, the required information was usually located within the initial pages, reducing the need to process full documents and lowering operational costs. Second, the presence of the keyword “turnover” allowed us to filter out irrelevant files.
AI prompting
Initially, Anthropic tended to round numerical values, output numbers as text, or provide inconsistent responses. This unpredictability made it difficult to extract reliable turnover data. To mitigate this, we implemented a structured output format where the AI’s responses adhered to predefined fields. We also added a dedicated subprompt to pinpoint both the year and the turnover value.
Processing speed
Running the AI model locally posed serious time constraints. Given the scale of the task—processing 18 million PDF files—it would have taken months to complete on a single computer, virtual machine, or EC2 instance. This delay was unacceptable given the project’s deadlines and the client’s expectations.
Our solution: How to extract a PDF document
We used Nannostomus, our in-house PDF parsing tool, to extract files from the website. These files were then stored in Amazon EFS/ S3. Our approach focused on minimizing costs by optimizing resource utilization. Additionally, the automated management of worker lifecycles streamlined the process further. The downloaded files were organized within the storage system to make them easily accessible for the next steps in the pipeline.
For parsing a PDF, we used Python and two libraries: Pdf2image to convert PDF into images and Pytesseract to extract text from images. We gathered information about all PDF files (S3 paths or EFS/EBS volumes, keywords to search for, and the number of pages to process). This information was then pushed to the AWS SQS queue.
We further containerized the entire setup with Docker. This container was deployed to AWS ECR through AWS ECS with scaling capabilities. The service starts processing when SQS has a certain number of messages and scales down when the queue is empty for a set time. The resulting TXT files were stored in a designated output destination: S3, EFS, or EBS.
Using Python, we relied on the Anthropic to locate and extract turnover data from the text. We implemented a second SQS queue specifically for this step. Messages in this queue included the location of the .txt files and keywords.
The extracted results of turnover values and their corresponding years were saved into structured CSV files.
Technologies we used in the project
Python
Nannostomus
Docker
Anthropic
EFS
S3
EBS
SQS
ECR
ECS
The results: parse a PDF file with speed and cost-savings
Using a carefully designed workflow, Intsurfing filtered the initial dataset of 18 million files down to 5 million after converting them to text. From these, 1.2 million files contained the required turnover values.
- Even when accounting for AWS infrastructure and Anthropic, the cost per 1,000 files was only 0.6–0.7 cents (excluding the use of Nannostomus).
- While downloading the PDFs took several weeks, the actual processing—text conversion and parsing—was completed in just three days.
Make big data work for you
Reach out to us today. We'll review your requirements, provide a tailored solution and quote, and start your project once you agree.
Contact us
Complete the form with your personal and project details, so we can get back to you with a personalized solution.