MSE for the best on the leaderboard is currently 0.21176, or around 3.7% lower than just using the mean as a prediction (MSE=0.21997). My best submission returned 0.21364, which would place me at 5th if I re-submitted it now (my latest submission did worse). I hope it is OK to offer some thoughts.
A few points, to begin. First, in Britain we recognise the term 'grit' readily, but social scientists tend not to use it. Probably the nearest label is ‘resilience’. Second, I’ve been sticking to ‘social science’, which I know, rather than ‘data science’, which I know rather less well. So my best scoring model was a simple OLS regression with only a small number of additive terms (i.e. no interactions). Third, I’m interested in trying to understand these things as much, if not more, than predicting them.
Since grit is measured from the responses of the child, it seemed to me sensible to look at the responses of the child. And I got most traction from the k5 variables – particularly, closeness to mother, attitudes to getting work done – plus I used the gender of the child. I haven’t used anything relating to the mother or family … but it is still early days. I haven't tried, and I don't plan to, mixing up those terms to see which combinations of variables score better.
It seems to me that grit is potentially quite an important concept. The relative lack of improvement compared to a mean prediction makes me question how easy it is to find its causes.
Anyway – happy modelling!
Thank you the_Brit
It is great to know that Grit is mostly about childs' answers. Limiting the set of features to consider is really helpful for data driven models. I had a similar observation about GPA . My best model use features from the wave5 only. I didn't observe any significant contributions of features from earlier surveys.
I am wondering is there a way to group childs by location? I haven't see any feature that indicates location. Do you have any suggestion?Posted by: ovarol @ April 18, 2017, 3:15 p.m.
Thanks, ovarol .
I think the survey researchers have generally decided to conceal the locations of respondents, presumably to protect anonymity, so I don't think there are clear geographic identifiers/features anywhere. I'd be happy to be told otherwise, too. A partial exception is innatsm (in the national sample, rather than a top-up of other cities - or so I believe). I tried innatsm in a few early models, and it helped a tiny bit, but as I've developed ideas I've dropped it.
I'll have to look more closely at GPA, thanks, as my model for that is rather weak. I suspect it may be more amenable to data-driven approaches than some of the other outcomes -- though perhaps that's just a lame excuse on my part!