Quick facts about automated data collection software

Existing tools for collecting data are either too expensive or require specialized skills. Our client needed a solution that would allow their customers to launch data acquisition projects without building complex in-house systems or hiring technical experts.
The goal was to automate the process, reduce costs, and make data accessible to non-technical users. We developed a versatile data gathering tool that handles large-scale information collection with minimal setup.
Challengeswhen building an app for data collection
Uneven load
Previous scraping setups didn’t make the best use of AWS resources. For example, due to poor organization, virtual machines were unevenly loaded. Some EC2 instances stayed underutilized while others were overworked. This imbalance led to higher costs, especially when operating at scale.
Dynamic content
Sites change formats, load content dynamically, and mix structured with unstructured data. That’s a problem—unless your system can adapt on the fly. We developed a data collection program that processes dynamic content and identifies patterns.
High load
Another issue was the high database load when saving results. When too many machines tried to write data simultaneously, it caused conflicts with updates to the same items. We used Optimistic Locking to manage version control, ensuring parallel writes didn’t overwrite each other.
The architecture of this data collection application

This cloud app for collecting data employs a microservice architecture. This ensures you can update or repair each part of the system without disrupting the entire operation. The system has Mediator, a serverless application that interfaces between the Console (the web interface you use) and backend services (WBalancer for task distribution, WRegistry for worker management, and CRepository for code deployment). This tool for data collection is integrated with AWS.

Mediator ties Console to backend services. When you make a request through the console of this online data gathering tool, the Mediator ensures your request is passed to the right service and gets a response back. Since it`s "serverless", you don’t need to worry about managing any underlying servers.

WBalancer (Work Balancer) distributes and balances all the tasks happening in the system. It looks at the available virtual machines (called NWorkers) and assigns small tasks (called chunks) to them. Thus, every machine gets just the right amount of work based on its current capacity. If a machine finishes its tasks early or starts to slow down, the WBalancer shifts the tasks around to other machines.

The WRegistry (Worker Registry) manages the lifecycle of NWorkers in the system. It automates the start-up, shutdown, and health checks of each NWorker so that it`s always ready to handle tasks. This continuous management helps prevent system interruptions and reduce downtime. The WRegistry guarantees that no matter the demand, our NWorkers are operational and available to tackle any job.

The CRepository (Code Repository) is a serverless application where specialized software packages, called CRunner images, are built. These images perform certain tasks. The CRepository takes a standard template (base CRunner image) and adds specialized modules for a particular job. It combines these elements to create a complete, ready-to-use software package (Docker image). Once the package is built, the CRepository sends it off to the NWorkers. It ensures every task-specific software package is correctly put together and ready to go.

NWorker is a C# application that manages smaller workers (called CRunners) inside a virtual machine. It`s in charge of starting, stopping, and checking the health of the CRunners, which are the programs that do tasks. It makes sure each CRunner is set up correctly and working as it should. The NWorker can manage several CRunners at the same time, all within a single virtual machine (AWS EC2 instance), allowing the system to process many tasks at once.

CRunner (Chunk Runner) is a C# application that takes on executing tasks. Each CRunner is given a specific job (or chunk of work) to do. The NWorker supervises the CRunner to ensure it starts and runs correctly, and finishes the task without issues. For large projects, the work is divided into smaller chunks, and multiple CRunners work on these tasks at the same time. This allows the system to handle big jobs faster by processing many small pieces at once.
Tools & Technologies we used for software to collect data

C#

Docker

RabbitMQ
AWS

Amplify

AppSync

SNS

Lambda

DynamoDB
