Turning PDF into text is tricky.
Sometimes, the text isn’t even text—it’s embedded in images.
Other times, the formatting turns into a mess as you try to process it.
Maybe you’ve tried PDF to text OCR tools, which didn’t satisfy your accuracy criteria.
Or wrestled with PDFs hundreds of pages long and ended up with memory resources wasted.
Perhaps you’ve hit a wall trying to automate the whole process.
These challenges can be a real headache, especially when you’re dealing with large datasets or building pipelines that demand precision.
So, in this article, we’ll dive into the tools, techniques, and workflows you can use to tackle these issues head-on.
How to Get Text From PDF Using pdf2image
We need to convert PDF to image format first.
When you turn PDF file to image, you can then extract that text with OCR tools.
The Python (3.7+) module pdf2image
is a solid choice for this.
It wraps pdftoppm
and pdftocairo
from Poppler-utils to convert PDF to a PIL Image object.
If you’ve got Poppler-utils installed, you’re good to go. With its help, you’ll be able to transform PDF to image (one file page = one image) and:
- Choose the format (JPEG, PNG, etc.) using the
save()
method. - Specify
first_page
andlast_page
to save memory. If the PDF has 80 pages, but you only need the first 15, limit the range. In our PDF processing project, this allowed us to speed up processing and reduce the cost by 15%.
from pdf2image import convert_from_path
# Specify the file path and the page range
images = convert_from_path('example.pdf', first_page=1, last_page=15)
# Save images to a folder
for i, img in enumerate(images):
img.save(f'output/page_{i+1}.jpg', 'JPEG')
This way, you have max_pages_per_file
of the converted images.
How to Extract Text from PDF with PyTesseract
For this, we’ll use PyTesseract. It’s a Python wrapper for Google’s Tesseract-OCR engine, which means you get the full power of Tesseract with Python’s flexibility.
Let’s break it down.
Why Use PyTesseract to Turn PDF Pictures to Text
PyTesseract makes it simple to convert image-based text into machine-readable formats.
With its help, you’ll turn PDF into readable text, whether you’re dealing with scanned documents or PDFs with embedded images.
So, these are the features of PyTesseract that will be handy for this particular task:
- Text extraction from PNG, JPEG, BMP, GIF, and TIFF. Handles single characters, lines, or full paragraphs.
- Support of multiple languages out of the box. You can add new languages by training Tesseract with language-specific data files.
- Image preprocessing as it works well with OpenCV or Pillow (e.g., thresholding, resizing, and noise removal).
- Many output formats. Returns text as plain strings or as structured data (hOCR or TSV), which include positional information of recognized text.
- Support of configuration options (e.g., specifying page segmentation modes, OCR engine modes, or disabling certain OCR features).
Setting Up PyTesseract
Before jumping into the code, ensure you have everything installed.
- Install PyTesseract and Tesseract-OCR with the dependencies.
RUN apt-get update && apt-get install -y \
libsm6 libxext6 libxrender-dev tesseract-ocr \
&& apt-get clean
- By default, Tesseract uses English. For other languages, install the relevant language data. For instance, for the Ukrainian language:
sudo apt-get install tesseract-ocr-ukr
To install all supported languages:
sudo apt-get install tesseract-ocr-all
You can find supported languages and data files here.
How to Extract Text From a PDF Document
Here is how to process an image, extract text from it, and combine the extracted text files into a single output file.
These scripts are for Python 3.8. But they will work just fine for Python 3.12.
text = pytesseract.image_to_string(new_path)
text_path = str(os.path.join(txt_folder, str(i) + "_" + file_path.rsplit('/', 1)[-1]).replace('.pdf', '.txt'))
with open(text_path, "w") as file:
file.write(text)
output_path = os.path.join(txt_final_folder, file_clean_name + '.txt')
combine_text_files(txt_folder, output_path, date_path)
Managing Storage: S3, EBS, and EFS
You’ll often need to juggle object storage, block storage, and file systems depending on your workflow. AWS offers three options: S3, EBS, and EFS. Let’s break them down.
Aspect | Amazon S3 | Amazon EBS | Amazon EFS |
---|---|---|---|
Type | Object storage | Block storage | File system |
Scalability | Virtually unlimited | Limited to 64 TiB per volume | Automatically scales with usage |
Performance | Higher latency; strong consistency for most operations | Low-latency, high IOPS | Moderate; supports bursting and provisioned |
Accessibility | Accessible globally via HTTP/HTTPS | Attached to a single EC2 instance (or Multi-Attach) | Concurrent access by multiple EC2 instances |
Cost | Low cost; tiers like S3 Glacier for infrequent access | More expensive, especially for high-performance | Higher cost; must be managed carefully |
Durability | 99.999999999% durability across multiple AZs | Persistent storage with backup snapshots | Data is distributed across multiple AZs |
Management | Simple; lifecycle policies, cross-region replication | Needs manual provisioning and snapshot management | No need to manage file servers or provisioning |
Key Use Cases | Backups, archival, data lakes, static website hosting | Databases, transactional applications | Shared storage, big data analytics |
Often, the best approach is to combine these services when you transform PDF to text. For example:
- Upload PDFs to EFS: Use it as a shared staging area for processing.
- Process Text with EBS: Attach EBS volumes to EC2 instances for quick, temporary processing.
- Store Outputs in S3: Move final files to S3 for long-term storage and cost savings.
In one of our projects where we converted PDF text to text, we used Nannostomus to extract PDF files from the source and load them to EFS. Then, we brought the processed files to S3 bucket.
Handling Workflows with AWS SQS
Dealing with large-scale PDF-to-text pipelines often means juggling hundreds, or thousands, of files. You need a system that can handle these tasks, distribute the workload, and avoid bottlenecks.
For these tasks, AWS SQS (Simple Queue Service is a sound solution.
Below, we’ll explore how SQS works, its strengths and weaknesses, and how you can integrate it into your processing pipeline with practical examples. Let’s get to it.
Pros and Cons of AWS SQS
SQS is a fully managed message queuing service that lets you decouple components in your workflow. It allows you to enqueue tasks, scale your processing dynamically, and ensure each file gets handled without losing track of the workflow. If you’re tired of dealing with local task lists or custom queue implementations, SQS simplifies all of that.
Pros | Cons |
---|---|
Simplifies task management and decoupling of services | Cannot directly view all queued messages |
Handles large-scale workloads well | Requires an active connection to AWS |
Supports flexible delays and scheduling | Region-specific queues; switching regions requires additional setup |
Integrates seamlessly with AWS services like Lambda and ECS | Higher complexity compared to simple task queues |
Provides visibility into sent and processed messages | Latency-sensitive workloads may require additional configuration |
With SQS, you don’t need to maintain information about queued files locally, and you can monitor how many messages have been sent, processed, or are still pending.
How SQS Fits into the PDF-to-Text Pipeline
Here’s a typical workflow.
- Use the AWS Management Console, CLI, or SDK to create a queue. For most use cases, a Standard Queue works well.
- To push files into the queue, you’ll first create a list of files and package them as messages. Batch sending improves performance by reducing the number of API calls. Here’s how:
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
detail = json.dumps({
"fileName": part,
"keyWords": KEY_WORDS,
"maxPagesPerFile": MAX_PAGES_PER_FILE
})
message_id = f"msg-{index}"
message = {
'Id': message_id,
'MessageBody': detail
}
batch_messages.append(message)
valid_message_count += 1
print(f"Message prepared for {part}")
if len(batch_messages) == max_messages_per_batch:
accumulated_batches.append(batch_messages)
batch_messages = []
if len(accumulated_batches) == max_batches_accumulated:
try:
print(f"Sending {len(accumulated_batches)} batches to SQS...")
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(send_batch, batch) for batch in accumulated_batches]
for future in as_completed(futures):
future.result()
accumulated_batches = []
except Exception as e:
print(f"Error sending batches to SQS: {e}")
- Once the messages are queued, you can retrieve and process them. Here’s how to handle this part of the workflow:
response = sqs.receive_message(
QueueUrl=SQS_QUEUE_URL,
MaxNumberOfMessages=10, # Adjust batch size as needed
WaitTimeSeconds=20 # Long polling
)
- If the request contains messages, we parse the data we need and execute them:
if 'Messages' in response:
with ThreadPoolExecutor() as executor:
futures = []
for message in response['Messages']:
# Parse message body
body = json.loads(message['Body'])
file_path = body['fileName']
key_words = body['keyWords'].split(",")
max_pages_per_file = body['maxPagesPerFile']
print(f"Scheduling processing for file: {file_path}")
# Submit file processing to the thread pool
future = executor.submit(process_file, file_path, message, processed_files, key_words, max_pages_per_file)
futures.append(future)
processed_files += 1 # Increment the counter
# Delete the message after successful processing
sqs.delete_message(
QueueUrl=SQS_QUEUE_URL,
ReceiptHandle=message['ReceiptHandle']
)
print(f"Deleted message from SQS: {message['MessageId']}")
# Wait for all tasks to complete before checking the queue again
for future in futures:
future.result()
Futures
operate in a way that avoids unloading the entire queue at once. Instead, they process messages gradually. This approach makes it possible to run the corresponding service on ECS, scaling dynamically based on the number of messages in the queue. You don’t have to worry about tasks stopping before all the files are processed.
futures.append(future)
processed_files += 1 # Increment the counter
# Delete the message after successful processing
sqs.delete_message(
QueueUrl=SQS_QUEUE_URL,
ReceiptHandle=message['ReceiptHandle']
)
print(f"Deleted message from SQS: {message['MessageId']}")
# Wait for all tasks to complete before checking the queue again
for future in futures:
future.result()
Wrapping It Up
With the right tools and a clear plan, you can transfer PDF to text with ease.
Break the flow into smaller steps. Convert PDF files to images. Use PDF to text OCR converter. Choose the right storage for your data. And don’t forget to streamline everything with SQS.
Every choice you make matters. The storage you pick affects speed and cost. Preprocessing images can make or break OCR accuracy. And the way you handle your queues decides how well your system scales under pressure.
pdf2image, PyTesseract, S3, EBS, EFS, and SQS are here to make your life easier. Experiment with them. Refine your process. Scale it as you grow.
You’ve got everything you need to build a robust PDF-to-text pipeline. Now it’s time to get to work. You’ve got this!