Developing data collection software

Quick facts about automated data collection software

Existing tools for collecting data are either too expensive or require specialized skills. Our client needed a solution that would allow their customers to launch data acquisition projects without building complex in-house systems or hiring technical experts.

The goal was to automate the process, reduce costs, and make data accessible to non-technical users. We developed a versatile data gathering tool that handles large-scale information collection with minimal setup.

Challengeswhen building an app for data collection

Uneven load

Previous scraping setups didn’t make the best use of AWS resources. For example, due to poor organization, virtual machines were unevenly loaded. Some EC2 instances stayed underutilized while others were overworked. This imbalance led to higher costs, especially when operating at scale.

Dynamic content

Sites change formats, load content dynamically, and mix structured with unstructured data. That’s a problem—unless your system can adapt on the fly. We developed a data collection program that processes dynamic content and identifies patterns.

High load

Another issue was the high database load when saving results. When too many machines tried to write data simultaneously, it caused conflicts with updates to the same items. We used Optimistic Locking to manage version control, ensuring parallel writes didn’t overwrite each other.

Need assistance with your project?

Contact us to discuss how we can help you develop a custom system.

The architecture of this data collection application

Components of online data collection tool

This cloud app for collecting data employs a microservice architecture. This ensures you can update or repair each part of the system without disrupting the entire operation. The system has Mediator, a serverless application that interfaces between the Console (the web interface you use) and backend services (WBalancer for task distribution, WRegistry for worker management, and CRepository for code deployment). This tool for data collection is integrated with AWS.

Mediator in data gathering application

Mediator ties Console to backend services. When you make a request through the console of this online data gathering tool, the Mediator ensures your request is passed to the right service and gets a response back. Since it`s "serverless", you don’t need to worry about managing any underlying servers.

How to balance load using tool for data gathering

WBalancer (Work Balancer) distributes and balances all the tasks happening in the system. It looks at the available virtual machines (called NWorkers) and assigns small tasks (called chunks) to them. Thus, every machine gets just the right amount of work based on its current capacity. If a machine finishes its tasks early or starts to slow down, the WBalancer shifts the tasks around to other machines.

Intsurfing built automatic data collection software

The WRegistry (Worker Registry) manages the lifecycle of NWorkers in the system. It automates the start-up, shutdown, and health checks of each NWorker so that it`s always ready to handle tasks. This continuous management helps prevent system interruptions and reduce downtime. The WRegistry guarantees that no matter the demand, our NWorkers are operational and available to tackle any job.

Software for data collection with code repository

The CRepository (Code Repository) is a serverless application where specialized software packages, called CRunner images, are built. These images perform certain tasks. The CRepository takes a standard template (base CRunner image) and adds specialized modules for a particular job. It combines these elements to create a complete, ready-to-use software package (Docker image). Once the package is built, the CRepository sends it off to the NWorkers. It ensures every task-specific software package is correctly put together and ready to go.

Digital data collection tool for task execution

NWorker is a C# application that manages smaller workers (called CRunners) inside a virtual machine. It`s in charge of starting, stopping, and checking the health of the CRunners, which are the programs that do tasks. It makes sure each CRunner is set up correctly and working as it should. The NWorker can manage several CRunners at the same time, all within a single virtual machine (AWS EC2 instance), allowing the system to process many tasks at once.

Batch processing with web data collection tool

CRunner (Chunk Runner) is a C# application that takes on executing tasks. Each CRunner is given a specific job (or chunk of work) to do. The NWorker supervises the CRunner to ensure it starts and runs correctly, and finishes the task without issues. For large projects, the work is divided into smaller chunks, and multiple CRunners work on these tasks at the same time. This allows the system to handle big jobs faster by processing many small pieces at once.

Tools & Technologies we used for software to collect data

C#

Docker

RabbitMQ

AWS

Amplify

AppSync

SNS

Lambda

DynamoDB

Systems Manager

The results: automated data collection tool

We've built a scalable, microservices-based tool used for data collection. It equips you with everything you need to automate your data-gathering processes. Here’s a closer look at its core features:

  • C# library. Thanks to code snippets, even junior developers can create bots for data extraction.
  • Resource management. The system ensures each virtual machine operates at its peak efficiency as it spreads the workload across your machines.
  • Batch website processing. This business data collection tool allows you to gather every piece of available data from any number of sources.
  • Services. The integrated proxy network allows using multiple IP addresses.
  • Project management. Keep tabs on everything with detailed statistics and billing modules.
Easy to use standardized tool for data collection

Make big data work for you

Reach out to us today. We'll review your requirements, provide a tailored solution and quote, and start your project once you agree.

Contact us

Complete the form with your personal and project details, so we can get back to you with a personalized solution.