02 May Progress 8: Looks like Data Garbage to me.
(UPDATES – see DataLab posts via link at the bottom for an interesting challenge, my response to it, and other bits of analysis..)
I’ve been looking into the detail of Progress 8, following the DFE guidance that was published in March. Is it just me or is this a massive pile of data garbage? A house of cards of data validity dressed up to look rigorous? I’m seriously worried that we’re being duped by the wizards of Data Delusion. I’m also hugely disappointed. A while ago, I attended a Heads’ Roundtable gathering where DFE data-King Tim Leunig outlined the ideas behind Attainment 8 and Progress 8. It all sounded pretty sensible: measures that encompass a broad range of subjects (not just five), where every grade counts (not just C+) and where progress against prior attainment would take priority over narrow benchmarks.
I’ve waited to see the details behind the mechanism and now that I have, I’m horrified. It’s all so convoluted; so removed from what learning looks like, turning ‘Progress’ into some kind of absolute metric.
To begin with, there is an input measure – a fine sublevel – that is derived from the raw scores on two tests in different subjects. If you read my posts The Data Delusion or The Assessment Uncertainty Principle, you will see how far we move away from understanding learning even with raw marks. However, it appears that raw marks in different subjects are to be put through a convoluted mincing machine where 74 and 77 become 5.1. One number representing EVERYTHING a student has learned at KS2. On average.
Then, we take these fine sublevels and line them up against the pattern of outcomes at GCSE over time. Using the new Attainment 8 measure (uncontroversially adding up points scores for GCSE grades; arbitrarily giving extra weight to Maths and English), we get an estimated outcome for each KS2 fine sublevel. Sounds reasonable – but look at the table. FOUR SIGNIFICANT FIGURES????? For an Estimate??
Let’s spell this out. A student with a fine KS2 level of 4.7 might be expected to get an Attainment 8 score of 50.67, whereas a student whose data mangle yielded 4.8 might be expected to achieve a score of 52.84??? You have GOT TO BE KIDDING! For an Estimate?? It’s the nerdiest data joke ever – told on a national scale. Any A level scientist could tell you that you can’t exceed the accuracy of your measuring devise simply by using it over and over again and you can’t ever increase accuracy beyond the limit of any one piece of data in a calculation. The best you could hope for in this mapping exercise is to say 4.7 maps to 51 and 4.8 maps to 53 – (and that is assuming that, on average, linear progress for a whole cohort is actually ‘expected’. Actually ‘to the nearest 5’ might be more reasonable.
For me, the whole credibility of the Progress 8 is ripped to bits by this table; the people who have designed it appear to know no more about data than my eager Y12s who can’t bear it when you tell them to ignore all the dazzling decimal places on their calculators – because they aren’t real!
But there is hope. The DFE does recognise that Statistical Significance needs to be applied. There is a whole Appendix dedicated to it in the guidance explaining the concept of a 95% confidence interval and the use of error bars to highlight significance:
This all adds weight to the delusion. It says ‘we’re serious about data; we’re not dumbing it down; we know that errors matter’. There’s a handy graph to really spell it out:
But then we get to the crux. Despite all the four sig fig nonsense, we actually end up with an outcome, in the worked example, where Progress 8 is 0.3 +/- 0.2. In other words; 95% certain to fall somewhere between 0.1 and 0.5. (Coincidentally, these are the same numbers for my school.). What we end up with is a super-crude 1 significant figure number falling somewhere within a range that is bigger than the number itself. Essentially, the whole palaver divides the Progress measure into three categories: Significantly above; average; significantly below. That’s it. The numbers actually don’t tell us anything significant at all.
I suppose that, as long as we recognise this, we’ll be OK. However, I worry that people will not really understand much about this and they will assume that scores of 0.5, 0.4 and 0.3 are really different; people will assume that schools will have performed better than others even though, within the limits of confidence, that assumption doesn’t hold up. If the error bars overlap – essentially we have to assume that the data doesn’t tell us enough to tell the schools apart. Similarly, if one school ‘improves’ from Prog 8 0.1 to 0.2 from one year to the next, actually they’re kidding themselves. The error bars will overlap to the point that there’s actually a chance they did worse.
Will people listen? Of course not. We’ll get league tables of Progress 8 measures ranking schools; Governors and prospective parents across the land will be fretting about the school next door having a higher score – all based on the most convoluted algorithm founded on the data validity equivalent of thin air; a number that says nothing of substance about how much learning has taken place over the course of five years. Nothing.
Is there a better idea?
When we met Tim Leunig back in 2013, I asked why we couldn’t use a more nuanced cohort profiling system to compare outcomes to intake. Tim said that it was too complicated – that people wouldn’t understand it. They’d tested this out. Sigh. How ironic to have a measure of educational outcomes at GCSE where basic GCSE maths is regarded as ‘too complicated’ for people to understand. What I was talking about was moving away from the crude and misleading use of averages – using visual cohort profiles. Imagine that the intake data is divided into cohort deciles. This would remove spurious conversions from marks to levels, to average fine-levels; it would simply tell you the spread of students’ performance on any set of raw data against a national profile -where, by definition, 10% of the national cohort is in each decile. The intake profile could be compared; the outcome profile of Attainment 8 scores could be compared. Here are some examples:
School A has a top-end profile skew on intake. However, the outcomes show a move towards the lower deciles. This school would get a negative Progress 8 score in all likelihood but here you can see the dynamics of it without any averaging out.
School B has an intake skewed to the lower deciles but still has quite broad range. The outcomes show a significant shift out of the lowest decile and strong improvement in moving students into the higher deciles. This school adds value and you can see where.
School C has an intake that is very close to the average national distribution. However the outcome profile is very different; the lower decile students make strong progress but the higher decile students fall back. This subtle picture would be lost in the averaging of Progress 8 that would probably be close to zero.
It would be possible to assign a numerical measure that averaged out this decile shift effect; arguably it would be similar to Progress 8 although it would skip over all the arbitrary data-garbage nonsense of fine levels and four sig-fig estimates of outcomes! But what would be the point? Are we really that interested in comparing schools in this way? The data profile would help schools to look at where they’re succeeding and not succeeding relative to the national trend. Parents and inspectors would get a better idea of the profile of the intake and outcomes as well as a feel for the pattern of progress in a way that Prog 8 = 0.3 +/- 0.2 utterly fails.
Sadly, the Progress 8 train is already running down the tracks. Our best hope is to chuck as many pinches of salt in its path as we can! And we should all agree to sound the DATA GARBAGE KLAXON any time someone says School A, Prog 8 0.4 has outperformed School B, Prog 8 0.3.
UPDATE 1: Dr Becky Allen has posted an interesting response for DataLab tackling some of my criticisms. I have then written a response. Essentially I am saying that this is more a matter of principle, disputing the validity of the concept of producing a single data point to average out the very nebulous notion of ‘progress’ – not one of mathematical methodology.
I’m also interested in the twitter responses and comments below that suggest a statistical significance test is entirely misplaced because there isn’t a random sample involved. Which means what? I’m not sure.. The idea that lurking in this fog is a ‘true value’ measuring progress is hilarious to me. We’re not talking about the height of plants over time….this is about as far from understanding learning as we can get.
Here is my comment on the DataLab blog:
This is an interesting response. It makes me realise that in highlighting my concerns via a discussion of the maths, I’ve given the impression that I dispute the maths itself. That’s my mistake. I totally understand that the fine levels would produce a correlation to GCSE outcomes that is virtually identical to raw marks; similarly I understand that keeping lots of sig. figs. in a mathematical calculation is useful to avoid overall rounding errors. The thrust of what I’m saying – or trying to say and possibly failing – is that at both the input and output stage we are assigning numbers to learning as if they are measurable to that degree and in that way. A fine level of 5.1 is really just a pure bell-curve marker; that’s it. Let’s not kid ourselves that the marks given for specific pieces of learning feed into that such that 5.2 signifies ‘better learning’ in a meaningful way. The same for Attainment 8 outcomes. Datalab has shown how huge the spread of outcomes is for any given starting point – and I don’t think averaging all this out to suggest a neat correlation as in that look-up table is healthy. It’s now a pure numbers game; we’re not talking about children and what they’ve learned. We certainly can’t be talking about school effectiveness surely.
Regarding the final point – Progress 8 doesn’t replace %5A*-CEM; it replaces the RAISEOnline VA score. At least %5A*-CEM is an actual thing – it measures what it says it does – regardless of the pernicious effects it has had. I’d suggest that Attainment 8 average totals (not reduced to approx grades eg C- which I find patronising) AND a profile showing the % of students scoring in certain ranges – or in deciles as I suggest in my blog – would be far better.
Let’s just run as far away as possible from this idea that one number can inform discussions about a school. My post was meant to shine a light on this – not to quibble about mathematical methodology per se. Once we forget that we’re talking about learning in complex human beings, we can do what we like with numbers. The least we can do is leave it at the maximum level of complexity we can stand so that we don’t squeeze out the meaning all together.
UPDATE 2: Education Datalab produced this graph after I made an enquiry on twitter. It’s plot of Attainment 8 vs KS2 fine level. It’s important to look at the scale of the spread. For example, the inter-quartile range (ie from the 25th percentile to the 75th percentile) appears to be about 15 10 for a KS2 fine score of 4.7. The scores for that middle 50% of students range from 45 to 60. Surely that puts the ‘precision’ of an average of 50.67 into perspective. Garbage? I think so. Of course there is a strong overal trend with limits for all to see – but the illusion of numerical precision needs to be fully recognised for what it is – an illusion.
UPDATE 3. Read this post from EduDataLab: http://www.educationdatalab.org.uk/Blog/May-2015/Why-do-pupils-at-schools-with-the-most-able-intake.aspx#.VU3LTNpVhBd
To me it is inevitable that this happens, because of the same virtuous circles of parental support, student attitudes and teacher confidence that results in higher KS2 levels continues up to KS4. Progress is a geometric effect, not an arithmetic one; it’s exponential, not linear. However, note Datalab’s conclusion. “Is the progress ‘boost’ due to the pupils themselves or due to something the school has done? The honest answer is that we can’t tell from the data we have available. And because we can’t resolve this conundrum it means that Progress 8 is not a measure of school effectiveness.”
Take note people; take note. It is NOT a measure of school effectiveness.