Generative data modelling for diverse populations in Africa: insights from South Africa

Simmons, S. S.ORCID logo, Hagan Jr, J. E. & Schack, T. (2025). Generative data modelling for diverse populations in Africa: insights from South Africa. Information, 16(7). https://doi.org/10.3390/info16070612
Copy

Studies on the demography and health of racially diverse African populations are scarce, particularly due to lingering data challenges. Generative data modelling has emerged as a valuable solution to this burden. The study, therefore, examined the efficacy of Conditional Tabular GAN (CTGAN), CopulaGAN, and Tabula Variational Autoencoder (TVAE) for generating synthetic but realistic demographic and health data. This study employed the World Health Organisation stigy on global ageing and adult health survey (SAGE) Wave 1 South African data (n = 4227). Information missing from SAGE Wave 1, including demographic (e.g., race, age) and health (e.g., hypertension, blood pressure) indicators, were imputed using Generative Adversarial Imputation Nets (GAIN). CopulaGAN, CTGAN, and TVAE, sourced from the sdv 1.24.1 python library, generated 104,227 synthetic records based on the SAGE data constituents. The outcomes were accessed with similarity and machine learning (XGBoost) augmentation metrics (sourced from the sdmetrics 0.21.0 python library), including column shapes and overall and precision ratio scores. Generally, the GAIN imputations resulted in data with properties that were comparable to original and with no missing information. CTGAN’s (89.20%) overall quality of performance was above that of TVAE (86.50%) and CopulaGAN (88.45%). These findings underscore the usefulness of generative data modelling in addressing data quality challenges in diverse populations to enhance actionable health research and policy implementation.

picture_as_pdf

subject
Published Version
Creative Commons: Attribution 4.0

Download

Export as

EndNote BibTeX Reference Manager Refer Atom Dublin Core JSON Multiline CSV
Export