Convert PDF Files to Text at Scale

Turning PDF into text is tricky.

Sometimes, the text isn’t even text—it’s embedded in images.

Other times, the formatting turns into a mess as you try to process it.

Maybe you’ve tried PDF to text OCR tools, which didn’t satisfy your accuracy criteria.

Or wrestled with PDFs hundreds of pages long and ended up with memory resources wasted.

Perhaps you’ve hit a wall trying to automate the whole process.

These challenges can be a real headache, especially when you’re dealing with large datasets or building pipelines that demand precision.

So, in this article, we’ll dive into the tools, techniques, and workflows you can use to tackle these issues head-on.

How to Get Text From PDF Using pdf2image

We need to convert PDF to image format first.

When you turn PDF file to image, you can then extract that text with OCR tools.

The Python (3.7+) module pdf2image is a solid choice for this.

It wraps pdftoppm and pdftocairo from Poppler-utils to convert PDF to a PIL Image object.

If you’ve got Poppler-utils installed, you’re good to go. With its help, you’ll be able to transform PDF to image (one file page = one image) and:

Choose the format (JPEG, PNG, etc.) using the save() method.
Specify first_page and last_page to save memory. If the PDF has 80 pages, but you only need the first 15, limit the range. In our PDF processing project, this allowed us to speed up processing and reduce the cost by 15%.

from pdf2image import convert_from_path

# Specify the file path and the page range
images = convert_from_path('example.pdf', first_page=1, last_page=15)

# Save images to a folder
for i, img in enumerate(images):
    img.save(f'output/page_{i+1}.jpg', 'JPEG')

This way, you have max_pages_per_file of the converted images.

How to Extract Text from PDF with PyTesseract

For this, we’ll use PyTesseract. It’s a Python wrapper for Google’s Tesseract-OCR engine, which means you get the full power of Tesseract with Python’s flexibility.

Let’s break it down.

Why Use PyTesseract to Turn PDF Pictures to Text

PyTesseract makes it simple to convert image-based text into machine-readable formats.

With its help, you’ll turn PDF into readable text, whether you’re dealing with scanned documents or PDFs with embedded images.

So, these are the features of PyTesseract that will be handy for this particular task:

Text extraction from PNG, JPEG, BMP, GIF, and TIFF. Handles single characters, lines, or full paragraphs.
Support of multiple languages out of the box. You can add new languages by training Tesseract with language-specific data files.
Image preprocessing as it works well with OpenCV or Pillow (e.g., thresholding, resizing, and noise removal).
Many output formats. Returns text as plain strings or as structured data (hOCR or TSV), which include positional information of recognized text.
Support of configuration options (e.g., specifying page segmentation modes, OCR engine modes, or disabling certain OCR features).

Setting Up PyTesseract

Before jumping into the code, ensure you have everything installed.

Install PyTesseract and Tesseract-OCR with the dependencies.

RUN apt-get update && apt-get install -y \

   libsm6 libxext6 libxrender-dev tesseract-ocr \

   && apt-get clean

By default, Tesseract uses English. For other languages, install the relevant language data. For instance, for the Ukrainian language:

sudo apt-get install tesseract-ocr-ukr

To install all supported languages:

sudo apt-get install tesseract-ocr-all

You can find supported languages and data files here.

How to Extract Text From a PDF Document

Here is how to process an image, extract text from it, and combine the extracted text files into a single output file.

These scripts are for Python 3.8. But they will work just fine for Python 3.12.

text = pytesseract.image_to_string(new_path)
text_path = str(os.path.join(txt_folder, str(i) + "_" + file_path.rsplit('/', 1)[-1]).replace('.pdf', '.txt'))

with open(text_path, "w") as file:
    file.write(text)

output_path = os.path.join(txt_final_folder, file_clean_name + '.txt')
combine_text_files(txt_folder, output_path, date_path)

Managing Storage: S3, EBS, and EFS

You’ll often need to juggle object storage, block storage, and file systems depending on your workflow. AWS offers three options: S3, EBS, and EFS. Let’s break them down.

Aspect	Amazon S3	Amazon EBS	Amazon EFS
Type	Object storage	Block storage	File system
Scalability	Virtually unlimited	Limited to 64 TiB per volume	Automatically scales with usage
Performance	Higher latency; strong consistency for most operations	Low-latency, high IOPS	Moderate; supports bursting and provisioned
Accessibility	Accessible globally via HTTP/HTTPS	Attached to a single EC2 instance (or Multi-Attach)	Concurrent access by multiple EC2 instances
Cost	Low cost; tiers like S3 Glacier for infrequent access	More expensive, especially for high-performance	Higher cost; must be managed carefully
Durability	99.999999999% durability across multiple AZs	Persistent storage with backup snapshots	Data is distributed across multiple AZs
Management	Simple; lifecycle policies, cross-region replication	Needs manual provisioning and snapshot management	No need to manage file servers or provisioning
Key Use Cases	Backups, archival, data lakes, static website hosting	Databases, transactional applications	Shared storage, big data analytics

Often, the best approach is to combine these services when you transform PDF to text. For example:

Upload PDFs to EFS: Use it as a shared staging area for processing.
Process Text with EBS: Attach EBS volumes to EC2 instances for quick, temporary processing.
Store Outputs in S3: Move final files to S3 for long-term storage and cost savings.

In one of our projects where we converted PDF text to text, we used Nannostomus to extract PDF files from the source and load them to EFS. Then, we brought the processed files to S3 bucket.

Handling Workflows with AWS SQS

Dealing with large-scale PDF-to-text pipelines often means juggling hundreds, or thousands, of files. You need a system that can handle these tasks, distribute the workload, and avoid bottlenecks.

For these tasks, AWS SQS (Simple Queue Service is a sound solution.

Below, we’ll explore how SQS works, its strengths and weaknesses, and how you can integrate it into your processing pipeline with practical examples. Let’s get to it.

Pros and Cons of AWS SQS

SQS is a fully managed message queuing service that lets you decouple components in your workflow. It allows you to enqueue tasks, scale your processing dynamically, and ensure each file gets handled without losing track of the workflow. If you’re tired of dealing with local task lists or custom queue implementations, SQS simplifies all of that.

Pros	Cons
Simplifies task management and decoupling of services	Cannot directly view all queued messages
Handles large-scale workloads well	Requires an active connection to AWS
Supports flexible delays and scheduling	Region-specific queues; switching regions requires additional setup
Integrates seamlessly with AWS services like Lambda and ECS	Higher complexity compared to simple task queues
Provides visibility into sent and processed messages	Latency-sensitive workloads may require additional configuration

With SQS, you don’t need to maintain information about queued files locally, and you can monitor how many messages have been sent, processed, or are still pending.

How SQS Fits into the PDF-to-Text Pipeline

Here’s a typical workflow.

Use the AWS Management Console, CLI, or SDK to create a queue. For most use cases, a Standard Queue works well.
To push files into the queue, you’ll first create a list of files and package them as messages. Batch sending improves performance by reducing the number of API calls. Here’s how:

import json
from concurrent.futures import ThreadPoolExecutor, as_completed

detail = json.dumps({
    "fileName": part,
    "keyWords": KEY_WORDS,
    "maxPagesPerFile": MAX_PAGES_PER_FILE
})
message_id = f"msg-{index}"

message = {
    'Id': message_id,
    'MessageBody': detail
}

batch_messages.append(message)
valid_message_count += 1

print(f"Message prepared for {part}")

if len(batch_messages) == max_messages_per_batch:
    accumulated_batches.append(batch_messages)
    batch_messages = []

if len(accumulated_batches) == max_batches_accumulated:
    try:
        print(f"Sending {len(accumulated_batches)} batches to SQS...")

        with ThreadPoolExecutor(max_workers=10) as executor:
            futures = [executor.submit(send_batch, batch) for batch in accumulated_batches]
            for future in as_completed(futures):
                future.result()

        accumulated_batches = []
    except Exception as e:
        print(f"Error sending batches to SQS: {e}")

Once the messages are queued, you can retrieve and process them. Here’s how to handle this part of the workflow:

response = sqs.receive_message(
                QueueUrl=SQS_QUEUE_URL,
                MaxNumberOfMessages=10,  # Adjust batch size as needed
                WaitTimeSeconds=20  # Long polling
            )

If the request contains messages, we parse the data we need and execute them:

 if 'Messages' in response:
                with ThreadPoolExecutor() as executor:
                    futures = []
                    for message in response['Messages']:
                        # Parse message body
                        body = json.loads(message['Body'])
                        file_path = body['fileName']
                        key_words = body['keyWords'].split(",")
                        max_pages_per_file = body['maxPagesPerFile']

                        print(f"Scheduling processing for file: {file_path}")
                        # Submit file processing to the thread pool
                        future = executor.submit(process_file, file_path, message, processed_files, key_words, max_pages_per_file)
                        futures.append(future)
                        processed_files += 1  # Increment the counter

                        # Delete the message after successful processing
                        sqs.delete_message(
                            QueueUrl=SQS_QUEUE_URL,
                            ReceiptHandle=message['ReceiptHandle']
                        )
                        print(f"Deleted message from SQS: {message['MessageId']}")
                        
                    # Wait for all tasks to complete before checking the queue again
                    for future in futures:
                        future.result()

Futures operate in a way that avoids unloading the entire queue at once. Instead, they process messages gradually. This approach makes it possible to run the corresponding service on ECS, scaling dynamically based on the number of messages in the queue. You don’t have to worry about tasks stopping before all the files are processed.

futures.append(future)
                        processed_files += 1  # Increment the counter

                        # Delete the message after successful processing
                        sqs.delete_message(
                            QueueUrl=SQS_QUEUE_URL,
                            ReceiptHandle=message['ReceiptHandle']
                        )
                        print(f"Deleted message from SQS: {message['MessageId']}")
                        
                    # Wait for all tasks to complete before checking the queue again
                    for future in futures:
                        future.result()

Wrapping It Up

With the right tools and a clear plan, you can transfer PDF to text with ease.

Break the flow into smaller steps. Convert PDF files to images. Use PDF to text OCR converter. Choose the right storage for your data. And don’t forget to streamline everything with SQS.

Every choice you make matters. The storage you pick affects speed and cost. Preprocessing images can make or break OCR accuracy. And the way you handle your queues decides how well your system scales under pressure.

pdf2image, PyTesseract, S3, EBS, EFS, and SQS are here to make your life easier. Experiment with them. Refine your process. Scale it as you grow.

You’ve got everything you need to build a robust PDF-to-text pipeline. Now it’s time to get to work. You’ve got this!

PDF to Text Conversion Using PDF2Image and PyTesseract

How to Get Text From PDF Using pdf2image

How to Extract Text from PDF with PyTesseract

Why Use PyTesseract to Turn PDF Pictures to Text

Setting Up PyTesseract

How to Extract Text From a PDF Document

Managing Storage: S3, EBS, and EFS

Handling Workflows with AWS SQS

Pros and Cons of AWS SQS

How SQS Fits into the PDF-to-Text Pipeline

Wrapping It Up

Related articles

PDF to Text Conversion Using PDF2Image and PyTesseract

Your message has been sent successfully.

How to Get Text From PDF Using pdf2image

How to Extract Text from PDF with PyTesseract

Why Use PyTesseract to Turn PDF Pictures to Text

Setting Up PyTesseract

How to Extract Text From a PDF Document

Managing Storage: S3, EBS, and EFS

Handling Workflows with AWS SQS

Pros and Cons of AWS SQS

How SQS Fits into the PDF-to-Text Pipeline

Wrapping It Up

Related articles