The questions used to measure materialHardship in the wave-6 data are replicated in the fifth wave (year 9) data. Therefore, it made sense to me to create a version of the target variable based on the earlier wave. In my homespun R (see end of post) I created ‘MH’ as a replication of materialHardship at the earlier wave. I then ran a simple OLS model with MH converted into a series of 0/1 dummy variables. (one-hot encoded features, as I think the data-science people would say). That model got a MSE on the leaderboard of 0.02664. Not a leading score, but in the top third or so of submissions - and one assumes unlikely to have any element of 'over-fitting'(!).
A naïve mean submission score returns 0.02880, so we improve about 7.5% on that, or about half-way to the long-time leading score of 0.0431.
Example R code to do this.
# for material hardship
backg$MH1[backg$m5f23a==1] <- 1
backg$MH1[backg$m5f23a==2] <- 0
backg$MH11[backg$m5f23k==1] <- 1
backg$MH11[backg$m5f23k==2] <- 0
backg$MH[is.na(backg$MH)] <- 0 # turn all missing into zero.
# simple OLS model
lm(formula = materialHardship ~ as.factor(MH), data = m1) #m1 is training data with background data
(model has R-sq of around 0.16).
Treating the missings as zero is not great of course. But if the wave-5 missings are also missing in wave-6, that ought to mitigate any issues (I thought).Posted by: the_Brit @ May 10, 2017, 8:53 a.m.
My other advice would be to look at summary statistics of your prediction for material hardship before submission, as a couple of times I've had negative values produced from OLS and similar models. Naturally a zero value must be closer to the true value than a negative.Posted by: the_Brit @ June 7, 2017, 12:35 p.m.