Incomplete data are quite common, which can affect evidence-based policymaking. An example is the Business Longitudinal Analysis Data Environment (BLADE), an Australian government national data asset.
In this paper, we aimed to help BLADE practitioners achieve greater data coverage.
To do this we developed PyImpuyte. This Python package expands data coverage by using artificial intelligence and machine learning to intelligently impute missing values.
Using PyImpuyte, we carried out a series of repeated controlled machine learning experiments to predict missing values. We then evaluated each algorithm’s performance.
We found:
- the ensemble family of algorithms performed best
- the extra trees regressor was most accurate and efficient.
We are now applying these algorithms to enhance coverage in other data sets.
Authors: Marcus Suresh, Ronnie Taib, Yanchang Zhao and Warren Jin.