Data Quality Testing for Data Platforms

Jan 19, 2021 18:00 · 237 words · 2 minute read data data-quality

I’m currently working on a new approach for data-quality testing of a data platform (data warehouse). The whole approach is based on the assumption that you can estimate the probability distribution from quantiles and compare distributions for subsequent loads/days for each attribute (e.g. column in a table).

Divide the data of an attribute into buckets of equal size (e.g. ntile function for MSSQL and store the start / end for each bucket
Calculate a numeric value for each bucket-start/-end (convert strings to numbers)
Create a histogram for both loads to compare (e.g. latest and penultimate load)
Calculate the distance between both histograms (e.g. Earth-Mover-Distance)
Compare the distance over time with something like an outlier-detection. Either the distance should be 0 (nothing changed) or similar to the previous days.

I’m curious if it will work and deliver tangible results. In my opinion the approach is quite innovative as it’s purely based on the distributions of attributes and not rule-based. Also, if it’s working as intended it should be universally applicable due to the generic approach. On the other hand step #3 (string to number conversion) is quite outlandish but I’d say it ain’t stupid if it works.

What do you think about the approach? Feasible or a stupid assumption? Are you aware of anything similar for data quality testing? I’m curios to hear what you think and probably will share the code once it’s in a usable state.