Optimizing address parsing for high-volume data

Efficient architecture for U.S. address processing

The client's existing system for address parsing from Address Information System (AIS) and Topologically Integrated Geographic Encoding and Referencing (TIGER) was built on MS SQL Server and SSIS. It used a combination of stored procedures, C# assemblies with regular expressions, and functions for parsing, verification, and cleaning.

While this setup worked, it struggled with high traffic and large datasets. Our team introduced an optimized system architecture BAIST, migrating functions to a Redis-based setup, optimizing algorithms, implementing caching, and setting up dedupe mechanisms. These improvements resulted in a 50% increase in processing speed, while the traffic load was minimized.

Key challenges in the project

Slow performance

The system started slowing down when processing more than a million records. MS SQL Server and SSIS were unable to scale with this volume, and the complexity of the stored procedures exacerbated the issue. The delays were unacceptable for the client, who needed rapid processing for business operations. To resolve this, we had to rethink the architecture entirely

Hard to control

Controlling the data flow in the old system was another major hurdle. As requests were processed, it took time for the system to respond, and the delays were inconsistent. Thus, it was impossible to predict when the data would actually be delivered. Besides, we lacked the tools to take corrective actions when delays occurred.

Updates

One of the client’s main requests was to receive fresh data every month. Around 80-90% of the new data turned out to be duplicates of what was already stored. Loading this old data over and over again would have led to unnecessary delays and wasted storage space. Our goal was to implement a solution that could differentiate between new and existing records and allow only fresh data to be processed.

Incorrect input

The system’s strict matching algorithms couldn’t accommodate minor variations, which resulted in a large number of valid addresses going unfound. For example, if there was a single-letter mistake in a street or city name, the system would return a negative result. To address this, we focused on improving the system’s algorithms which could still yield correct results even with slight input errors.

Different formats

Unlike the standard US mainland format, Puerto Rican addresses follow a slightly different structure. The system treated all addresses the same, which resulted in misinterpretations and data mismatches when it came to Puerto Rican addresses. This caused a significant number of errors in the data. To overcome this, we had to design a separate algorithm to adapt to these regional differences.

Tech limitations

The combination of MS SQL Server, SSIS, and custom stored procedures wasn’t designed to handle the rapid growth in data volume. As the system expanded, it became increasingly complex to manage and maintain. It was clear that the current technologies had reached their limits, and we needed a more flexible, scalable solution that could handle the client’s evolving needs.

Face similar problems in your project?

Contact us to discuss how we can help you overcome bottlenecks and achieve peak system performance.

Steps for delivering an optimized address processing system

The outdated system for address parsing

We conducted a detailed performance analysis using JetBrains DotTrace. This allowed us to measure the speed of each function and identify where the system was being slowed down. Our primary focus was on MSSQL queries and the address parsing algorithms, which were responsible for the largest performance bottlenecks.

How Intsurfing revamped address parsing system

To enhance performance, we transitioned the data from MSSQL, where it was initially stored in JSON format, to Redis. This enabled faster read/write operations and better handling of large datasets. Additionally, we split the data across multiple Redis instances, which helped optimize the address parsing and geocoding processes. This distributed architecture ensured that the system could process data much faster.

Reworked algorithms for parsing addresses | Intsurfing

We implemented pre-compiled MSSQL queries that directly interacted with Redis, which allowed us to boost the data flow between the parsing algorithms and the database. Intsurfing developers worked out six algorithms (including one for Puerto Rico specifically). They enabled the system to find the correct address even with minor input errors. Moreover, we broke down ZIP+4 records into two parts: the first one was used for the initial parsing, and the second for building a complete address result.

Caching & dedub solution for BAIST | Intsurfing

To minimize redundant operations, we introduced an internal caching system. This allowed us to store and quickly access recently processed records. Thanks to this, the system`s speed had improved, especially during batch processing where similar data was frequently encountered. For scaling and data deduplication, we used Hadoop. Apache Solr was employed for indexing, which improved the searchability and retrieval of data throughout the system.

Maintenance and monitoring for BAIST | Intsurfing

Throughout the project, we continuously tested the system using JetBrains DotTrace to ensure each component was performing at optimal levels. We fine-tuned the algorithms and caching mechanisms based on these tests, ensuring that the final solution met the speed, scalability, and performance goals we had set.

Technologies we used in the project

REDIS

MS SQL

C#

Hadoop

Apache Solr

JetBrains DotTrace

The results: enhanced speed and scalability

We built a highly structured, scalable address directory using data from AIS and TIGER. The new system enables the client to automatically enrich incomplete addresses with missing elements (city, zip code, or state) and provides a robust validation mechanism to ensure all address components are accurate.

  • 2x faster data processing
  • 40% reduction in processing traffic through structured directories
  • Stable and easily scalable system that handles over 1 million records
  • Optimized batch processing with caching
  • Data deduplication for faster and resource-efficient new data loading
  • New algorithms for detecting addresses even with errors
Example of inputting false data and BAIST results

Make big data work for you

Reach out to us today. We'll review your requirements, provide a tailored solution and quote, and start your project once you agree.

Contact us

Complete the form with your personal and project details, so we can get back to you with a personalized solution.