Developing statistical methods to handle Structured Missingness in large complex databases

Project ID: 2228cd1430 (You will need this ID for your application)

UCL Lead department: Statistical Science

Lead Supervisor: Robin Mitra

Project Summary:

There is an urgent need to develop principled and well understood methods that can process and analyse large complex databases. These databases are often combined at scale across different modalities and present some unique missing data challenges to overcome. In particular, the phenomenon of Structured Missingness (SM) is becoming increasingly encountered. This is where missing values themselves have an underlying association and transcends the traditional framework characterising different missing data mechanisms. As a result, this phenomenon impedes analysis of the data and is crucial to address.

This project will establish the theoretical properties of SM, and in doing so develop statistical methods to mitigate challenges posed by SM. In particular, we will assess the potential for SM to introduce bias into analyses and inflate variances. We will consider developing bespoke Bayesian models to account for SM, as well as state of the art machine learning methods. We will also seek to address SM at the design, i.e. data collection, stage. Additionally, we will explore the paradigm shifting perspective where we consider SM as information itself to be leveraged and incorporated into our analysis.

A key objective when utilising large databases is to develop good predictive models that also provide an appropriate level of uncertainty with associated predictions. The presence of SM complicates this, and it is crucial to develop methods that can both optimise models’ predictive performance when faced with incomplete data, as well as reflect the appropriate level of uncertainty in relevant prediction intervals. The project will start by assuming this primary goal is of interest and look to develop methods in particular for complex medical databases that combine information from clinical, genomic, and electronic health records. However, the methodology will be developed in a general way, with a view to be applicable to any databases affected by SM.