Evaluation standards are the threshold points that we apply when making a judgement. Regardless of whether we are working with qualitative or quantitative data, evaluation standards shape the way we interpret the trends and patterns in our data.
For an outcome evaluation, the evaluation standards determine the nature and extent of change considered to be acceptable.
This is crucial for how we decide whether a result is, for example, ‘ok’, ‘good’ or ‘great’. It may determine whether a program is continued or extended to other settings.
Example: We are looking at a trend in our school over time. How much change would we need to see before we say 'that’s a big improvement' or 'that’s enough to keep going'?
If comparing to a standard or benchmark, it’s important to be clear about what the goals are.
Example: How much better do the outcomes need to be before we say that’s good enough? Are we seeking to lift performance to a similar level compared with statistically similar schools? Or are we striving to reach the top echelon?
Articulating standards as part of the evaluation planning phase helps avoid bias in the interpretation of the data down the track. Without articulating explicit standards, evaluators can find themselves consciously or unconsciously judging something to be a success or a disappointment when others might look at the same data and draw a different conclusion. This leaves evaluative judgements vulnerable to positivity bias, where everything is always a success, or negativity bias, where nothing is ever good enough. Learn more about cognitive bias.
Less or more?
Not all evaluation criteria work the same way. Some criteria can be assessed on a sliding scale where more is better, such as the proportion of students achieving in the top two NAPLAN bands; while for others fewer is better, such as the number of injuries on excursions.
Other criteria have an acceptable range in the middle of the scale, such as 'too much', 'about right', or 'not enough'. This could apply to everything from how much time students spend doing homework to the amount of professional learning teachers undertake in a year.
Other criteria are binary, for example yes/no or pass/fail. Compliance with policy often falls into this category.
Statistical significance and high impact
When answering evaluation questions with quantitative data, it is important not to confuse statistical significance with high impact. Statistical significance indicates that an observed difference between groups or change over time is probably not the product of random chance.
When we see a difference, we need to think carefully about how meaningful the difference is from an educational perspective, not just how significant it is from a statistical perspective. This is because the mathematical calculation of statistical significance is heavily influenced by sample size:
- When working with system-level data that has thousands of students, a minor change over time will probably meet the threshold for statistical significance.This does not necessarily constitute ‘success’. It simply means that the change is probably not due to chance.
- When working with data from a single school, differences that are not statistically significant (due to the small sample size) may still be important for how we think about the impact of our work.
- Working at a single school level, we may also see differences that appear large when in fact they are just the result of random ‘noise’ effects that come from small sample sizes. For example, if there are 35 students in Stage 3, each student represents just under 3% of the student population in the stage. An increase of seven percentage points may mean that just two students have moved up from one category to another.
Resist quantifying qualitative data
When answering evaluation questions using qualitative data, we need to resist the temptation to quantify our findings.
If the evaluation question relates to the range of new teaching strategies used by teachers as the result of an innovative program, identification of the different strategies in use may be more important than counting the number of teachers using each one.