Efficient architecture for U.S. address processing
The client's existing system for address parsing from Address Information System (AIS) and Topologically Integrated Geographic Encoding and Referencing (TIGER) was built on MS SQL Server and SSIS. It used a combination of stored procedures, C# assemblies with regular expressions, and functions for parsing, verification, and cleaning.
While this setup worked, it struggled with high traffic and large datasets. Our team introduced an optimized system architecture BAIST, migrating functions to a Redis-based setup, optimizing algorithms, implementing caching, and setting up dedupe mechanisms. These improvements resulted in a 50% increase in processing speed, while the traffic load was minimized.
Key challenges in the project
Slow performance
The system started slowing down when processing more than a million records. MS SQL Server and SSIS were unable to scale with this volume, and the complexity of the stored procedures exacerbated the issue. The delays were unacceptable for the client, who needed rapid processing for business operations. To resolve this, we had to rethink the architecture entirely
Hard to control
Controlling the data flow in the old system was another major hurdle. As requests were processed, it took time for the system to respond, and the delays were inconsistent. Thus, it was impossible to predict when the data would actually be delivered. Besides, we lacked the tools to take corrective actions when delays occurred.
Updates
One of the client’s main requests was to receive fresh data every month. Around 80-90% of the new data turned out to be duplicates of what was already stored. Loading this old data over and over again would have led to unnecessary delays and wasted storage space. Our goal was to implement a solution that could differentiate between new and existing records and allow only fresh data to be processed.
Incorrect input
The system’s strict matching algorithms couldn’t accommodate minor variations, which resulted in a large number of valid addresses going unfound. For example, if there was a single-letter mistake in a street or city name, the system would return a negative result. To address this, we focused on improving the system’s algorithms which could still yield correct results even with slight input errors.
Different formats
Unlike the standard US mainland format, Puerto Rican addresses follow a slightly different structure. The system treated all addresses the same, which resulted in misinterpretations and data mismatches when it came to Puerto Rican addresses. This caused a significant number of errors in the data. To overcome this, we had to design a separate algorithm to adapt to these regional differences.
Tech limitations
The combination of MS SQL Server, SSIS, and custom stored procedures wasn’t designed to handle the rapid growth in data volume. As the system expanded, it became increasingly complex to manage and maintain. It was clear that the current technologies had reached their limits, and we needed a more flexible, scalable solution that could handle the client’s evolving needs.
Steps for delivering an optimized address processing system
We conducted a detailed performance analysis using JetBrains DotTrace. This allowed us to measure the speed of each function and identify where the system was being slowed down. Our primary focus was on MSSQL queries and the address parsing algorithms, which were responsible for the largest performance bottlenecks.
To enhance performance, we transitioned the data from MSSQL, where it was initially stored in JSON format, to Redis. This enabled faster read/write operations and better handling of large datasets. Additionally, we split the data across multiple Redis instances, which helped optimize the address parsing and geocoding processes. This distributed architecture ensured that the system could process data much faster.
We implemented pre-compiled MSSQL queries that directly interacted with Redis, which allowed us to boost the data flow between the parsing algorithms and the database. Intsurfing developers worked out six algorithms (including one for Puerto Rico specifically). They enabled the system to find the correct address even with minor input errors. Moreover, we broke down ZIP+4 records into two parts: the first one was used for the initial parsing, and the second for building a complete address result.
To minimize redundant operations, we introduced an internal caching system. This allowed us to store and quickly access recently processed records. Thanks to this, the system`s speed had improved, especially during batch processing where similar data was frequently encountered. For scaling and data deduplication, we used Hadoop. Apache Solr was employed for indexing, which improved the searchability and retrieval of data throughout the system.
Throughout the project, we continuously tested the system using JetBrains DotTrace to ensure each component was performing at optimal levels. We fine-tuned the algorithms and caching mechanisms based on these tests, ensuring that the final solution met the speed, scalability, and performance goals we had set.
Technologies we used in the project
REDIS
MS SQL
C#
Hadoop
Apache Solr
JetBrains DotTrace
The results: enhanced speed and scalability
We built a highly structured, scalable address directory using data from AIS and TIGER. The new system enables the client to automatically enrich incomplete addresses with missing elements (city, zip code, or state) and provides a robust validation mechanism to ensure all address components are accurate.
- 2x faster data processing
- 40% reduction in processing traffic through structured directories
- Stable and easily scalable system that handles over 1 million records
- Optimized batch processing with caching
- Data deduplication for faster and resource-efficient new data loading
- New algorithms for detecting addresses even with errors
Make big data work for you
Reach out to us today. We'll review your requirements, provide a tailored solution and quote, and start your project once you agree.
Contact us
Complete the form with your personal and project details, so we can get back to you with a personalized solution.