Fragile Families Challenge powered by CodaLab

> Treatment of missing data?

One of the FFC blogs draws attention to the issue of missingness in the data, and suggests ways to address that - see http://www.fragilefamilieschallenge.org/missing-data/

Is anyone taking up this challenge and with any progress?

So far I've either ignored the issue (through listwise deletion) or treated missing values as different kinds of valid values (typically for the Don'tKnow and Refused codes). I know this is less than optimal -- but a lot quicker than writing/learning new routines. My favourite source on this issue is Paul Allison, Missing Data, in the Sage 'little green books' series.

Posted by: the_Brit @ April 23, 2017, 3:18 p.m.

That's interesting that you've been relying entirely on deletion and recoding of the missing values, particularly since we can see that your method works (nice scores you've got there)!

So far I've done all my analyses on multiply imputed datasets. There's a few tricky aspects to this:
- To the best of my knowledge, no package can impute for the vast number of missing values in this dataset, with only 4242 rows.
- Many columns need to be removed from the imputation analyses for not being meaningful, sure, but this is 1. time consuming, and 2. there will probably still be too many columns to run the analysis.
- Some background vars we are interested in are coded in somewhat funny ways, and one would need to recode hundreds of columns to get them to impute correctly.

My approach has been to use a machine learning technique that can handle missing values to identify features of interest, and then I've imputed datasets containing just those features. Then my prediction models only use those variables. This has worked okay, but my sense is that I am still missing some features that would be worthwhile to have.

Imputation methods also fall short as soon as you need to produce a single prediction value. Imputations tell you more about the error of your imputed variables, and packages (in R) like Mice or Amerlia and Zelig allow you to make models which account for that uncertainty. You're not really supposed to just take a mean or median, but at some point during the process, one has to if one is taking this sort of approach. I don't know of a better way around that.

I suppose this it isn't all that surprising that listwise deletion can be as or more successful that imputation, though. Individuals in later waves, with more complete background data, are more likely to also appear in the age 15 data, so imputing loads of data to predict values for individuals who don't even appear in the test set, that might not bring much added value to one's predictions.

Posted by: dremalt @ April 23, 2017, 7:02 p.m.

Thanks for the interesting thoughts.

I guess we are coming up with predictions *conditional* on people being participants at the age-15 interview, but we don't know who participated and who did not in the background data. In the training data we have 'NA' to cover all eventualities, whether attrition from the study, or a Don't Know or another kind of missing data. In the training data I think we have 655/2121 cases (30.9%) with all missing values (presumably, non-response in most cases?); 1014 cases with no missing values (47.8% of the cases), but 452 with some data missing, some not (21.3%). Intuitively it seems likely that unit non-response (attrition) is affected by something different to item non-response (missing a particular question).

In terms of imputation methods, a lot of longitudinal studies use past values to help impute missing data -- so if a variable/feature is missing at this wave, start from the value at the previous wave. Of course, that relies on consistent questions, and ideally (!) consistent names of variables/features, over time.

Anyway, I'm teaching a lot next week, so won't be doing much if any analysis until May. Too many interesting questions, not enough time!

Posted by: the_Brit @ April 23, 2017, 7:54 p.m.

It's not obvious to me here why multiple imputation is germane or preferable to the task in the challenge. I usually think of the advantage of multiple imputation over the best single imputation as being in correct variance estimation. So my intuition would be that multiple imputation would be valuable for forecasting the uncertainty of predictions, but not for making the predictions that minimize the root MSE out-of-sample.

Posted by: jeremyfreese @ April 26, 2017, 4:45 p.m.

Post in this thread

Forums

Fragile Families Challenge Forum

> Treatment of missing data?