A Story About Liberating Public Data

For the past few years we have been working with the CT Secretary of State (SOTS) on a variety of initiatives related to the business registration data that they process and manage. This dataset is the official data set of record for tracking businesses that are required to register to operate within Connecticut. However, accessing this data has been historically difficult. The CONCORD system was built and optimized for the important work of managing transactions around business registration and reporting. It was not designed as a system to be search or mined for data. The data is also old, with data going back centuries, which presents other challenges. When we started on this project we were faced with a significant initial obstacle. We were given raw data on a CD-ROM and a "data dictionary" that consisted of a photocopied database schema.

Variable names were not particularly readable and while some of the relationships could be inferred from the table names, there was a lot of information that we were missing. Another technical challenge was that most of the values were stored as character strings.  This meant  that the validation of data being input was taking place in the application that staff used to input data. We couldn't rely on the database to have done much work to catch "mistakes" that made it through the business systems. Therefore, many of the input errors such as spelling mistakes or typos remained in the data.

For example, we had to determine a solution for the many instances where a business has both a "primary" address and also a mailing address. If you've ever had to fill out a form online where your address is required, you've probably encountered a case where you are presented with a check box that gives you the option to use your primary address as your mailing address. Checking that box usually results in all of your address data being copied, or sometimes that system will grey out the form fields and keep you from editing them. Regardless, something is happening that captures the fact that both address are the same. Most likely this means that when the data is saved to a database, the same data is entered in both fields. This didn't seem to be the case with this data. We encountered situations with a primary address and no mailing address, which was easy enough to  reconcile (the assumption is that they are the same).

But we also encountered the opposite, where a mailing address was present but no primary address. This poses a challenge. Does this mean that the business does not have a physical location? Is it a business on paper only. Or was there a data entry error resulting in the wrong set of address fields being updated? If the registering agent filled out a paper form, did they leave primary address blank? What should the staff entering the data do in that case: copy the mailing to the primary or enter the data as submitted? These were hard questions to get answers to because these issues weren't limited to a particular time range. We saw instances of address issues going back many decades and business practices / systems have obviously changed over the years. That said, these inconsistencies make seemingly trivial questions, such as how many businesses were formed in town X in time period Y, hard to answer.

Given these issues, the first step in bringing this data into a modern, flexible, search context was to build out the data dictionary and develop an understanding of the values that were present in the data. We conducted informational interviews with staff and contractors, explored the range of values in the data and bit by bit, built a model of how data flowed, what it meant, and how it was structured in the context of the regulations and procedures that business owners go through when submitting information to the Secretary of the State. Once we were confident that our model was accurate and complete enough for our purposes, we set about building the search system.

We had a data dictionary and a data schema, but the data we received was still in the form of flat text files. To get this data back into a searchable database, we built a ETL (extract, transform, and load) pipeline that would iterate through the rows of each datatable, validate that the values were "valid" according to our schema, and then load them into a database that had more stringent data type requirements. We built this pipeline using Python and Postgres, an open source database that is widely trusted in the technology community. We chose Postgres because it supports natural language search out of the box with minimal pain.

The search interface offered by the CONCORD system is quite fast, but suffers from a few significant limitations, primarily that there is limited support for wild card searches (e.g. a search for pizza will not find pizzeria) and that the search is very character/word sensitive. For example, say you wanted to search for businesses that contain the term "New Haven" in their name. This probably indicates something about where they are located, but it might also be used as a name by housing developers or in a variety of other circumstances. With Concord, you'd get roughly 26 pages of results, all for businesses starting with "New Haven". However, you won't find the Advanced Nursing & Rehabilitation Center of New Haven, LLC, Peoples Church of New Haven, League of Women Voters of New Haven, or any number of other businesses whose name features but does not start with New Haven. Moreover, if you want to search for business whose address is in New Haven, you cannot. These were the two problems that we wanted to resolve.

Since we were using Postgres, we were able to leverage a number of "full text" search features with minimal additional work. Most importantly, word stemming and fuzzy search to handle misspellings. Word stemming is the process of reducing words to common roots. For example, politics and political can both be reduced to 'polit'. There are other full text search solutions, but the ability to handle the searching in the same database that supports the rest of the application greatly simplifies the work required.

The Secretary of the State also wanted to support searching by place and by date of formation. We handled this by building a search index table that would allow us to very quickly find matches and return enough information to the user for them to decide if they wanted to explore a given business in more detail. We moved back to our ETL process and added a step that would extract relevant fields, pre-process them, and add them to the search index. Addresses added a wrinkle, given the issues discussed above. We decided to handle this by first looking for a business address and if it was missing, we'd shift to using the mailing address.

In the course of doing this work, we've worked with folks from other state agencies. Given the importance of business activity, a number of state agencies place high value on the business registration data. However, the structure of the live database limited how data could be extracted and worked with. Our workflow enabled us to support more free-form exploration of the data on their behalf and led to a number of additional projects where we linked this data with other internal datasets. Recently we worked with an agency to develop a methodology to link businesses registered with SOTS with internal and third party data. Linking business names is hard to do manually. In some cases, it is possible to handle look ups on an ad hoc basis, but bulk work is very time consuming. Moreover, it is quite common for businesses to use slight variations of their names in different contexts, particularly small businesses. They might be required to add something to their name when registering to avoid name conflicts, but in practice they may advertise themselves with a more simple name. Businesses also can formally change their name with the Secretary of the State, but those changes are almost certainly not reflected in other datasets and databases. Larger companies are often structured in complex ways for legal purposes. We may attach one name to a company when discussing it in policy terms, but from a legal perspective, that business might exists as group of distinct entities. All of these issues make matching lists of businesses challenging, so much so that there is a technical term for the problem: entity resolution.

Entity resolution takes two forms, grouping entities that can be functionally considered one "business" for a given context, and linking different representations for the same underlying entity. For example, a chain retail operation might be considered one business in a given context, despite having multiple locations. We again turned to open source solutions, specifically a tool called dedupe, which uses machine learning techniques to cluster, deduplicate, and link entities using a variety of criteria and fields. We were able to conduct a number of matching runs using business name, address, city, and the name of principals to build out a list of linked entities and eliminating the need to try to manually link tens of thousands of entities.

We have more to do with these technologies and with this data. We want to add a map interface to our business search portal, which would enable users to search for businesses within a certain geographic area which would make it possible to ask questions like: How many businesses were formed in our main street district since we altered the zoning? We also like to optimize our search backend with more recent open source technology solutions. Since we’ve undertaken this work, more powerful open source solutions have been developed for this piece as well.

Liberating this data has been a fun challenge. It has demonstrated to us the value in investing in open source solutions and approaches and we have also learned through by making this data open the value it has for many users across the state; not only other state agencies but also economic development organizations, regional planning associations, and chambers of commerce.

Sasha Cuerda
September 14, 2017