Skip to main content

Data Build Process

Incorporating data from a multitude of data sources brings up issues such as standardization, de-duplication, and predictive AI enhancements. We discuss these quesions here.

Standardization

Most data sources have their own way of organizing their data, which needs to be normalized before importing. Common fields with differing values include company employee range estimates, industry classifications, and office addresses. For example, one source saying the company's industry as "Software" and another source reporting it as "Technology".

In order to have a single common schema, before importing a new source our team normalizes the source data and maps all possible values to our internal schema. Then as data is imported and merged, it is normalized using the associated data mapping.

De-Duplication

Our database is built around company domains. We believe that whether a company has an active website is a good indicator that they are still in business. Thus the domain is used as a unique primary key that sources are reduced down to.

When merging and de-deduplicating sources, we group records based on similar keys and internally developed predictive matching algorithms.

AI Enhancements

A number of the fields we provide are predictive, AI powered enhancements based on other data points. Examples includes annual revenue estimates for private companies or descriptive tags such as whether a business is B2B, B2C, or B2G.

We are constantly working to add new models and improve the existing ones. If you have a request for a new data point you'd like to see added, feel free to reach out to our team.