Advanced Software (return to the homepage)
Menu

How AI is transforming assessment in education

12/10/2022 minute read OneAdvanced PR

At bksb, we believe technology is vital in developing English and maths skills which is why we’re always pushing the boundaries of EdTech to ensure that our products are as effective as possible.

Utilising decades of data and industry leading expertise, the bksb system has been intelligently designed to assess, measure and promote learning.

Artificial Intelligence and Item Response Theory

We’ve integrated artificial intelligence and Item Response Theory (IRT) into our assessment system. This allows our software to intelligently assess the difficulty of a question as well as a learner’s improvement.

Measuring and Promoting Improvement

The bksb Assessment Engine uses complex algorithms based on probability and “best fit” calculations to produce highly accurate measurements.

These can be extremely precise and identify not only a level but also where a student sits in terms of the distance to the next level.  For example, a student could be regarded as E3.1 (a low Entry 3) through to E3.5 (midway through Entry 3) and on to E3.9 (almost reaching low Level 1).

The bksb Functional Skills measurements run from Pre-Entry through to a point above Level 2.

Progress vs Improvement

Most computer-based learning systems for Functional Skills (and other courses) have tended to track “progress” by giving students tasks to undertake and then monitoring whether they have been completed.  Unfortunately, the term “completed” is subjective as merely accessing a resource can at times render it “complete”.   Another problem with “progress” tracking is that it is merely a checklist and tells us nothing about whether the tasks that have been prescribed have improved the student’s ability.  Most knowledge testing in a resource is incomplete, hence the data thereby provided is limited in use.  As a result, a better approach would be to track “improvement” rather than “progress”, as that is the real metric that we are interested in.

To create a reliable system for tracking improvement, there first needs to be accurate and precise measurement which allows for the detection of even small advances.  For example, a student could be measured as being E3.2 (fairly low Entry 3), and then as E3.4 (almost halfway), and then again as E3.7 (well on the way to Level 1).  Whilst the student has stayed at Entry 3, it is nonetheless clear that he or she is improving.  The data also suggests that it will not be long before the student is Level 1, and there is a good chance that they will achieve this upon the next measurement.  Without this level of precision, however, the measurement data would simply show results of Entry 3, Entry 3 and Entry 3, and it would be impossible to draw any conclusions from this.

Thankfully, the new bksb course structures make full use of the ability to measure accurately and detect incremental advances.  Using these measurements, the system will take an improvement in a module such as “Calculations” and apply it to both its parent element “Number” and its parent subject “Maths”.  As students then work through resources and take assessments, the platform will use these changes to create estimates as to what their overall ability for a subject might be.  For example, a “Subject” such as Maths can be broken down into “Elements” such as “Number”, “Measure, Shape and Space” and “Handling Data”:

Elements are subsequently broken down into “Modules” such as “Calculations” or “Measure”.  Using the “improvement tracking”, it is then possible to take a detected improvement in a module and estimate a new ability level for the related element.  Likewise, this improvement to an element can be applied to the overall subject.  Therefore, as a student completes resources and takes progress check assessments, the system is constantly applying these results across the entire subject to produce estimates of overall ability.

Promoting Improvement

As the bksb system no longer places students on a single level, they are allowed to access content at all levels, with the data obtained from the Assessment Engine ensuring that this is done in a sensible and effective way. For example, where a student is identified as working at L1.1 (a low Level 1) they will be given access to both Level 1 and Level 2 content for that module, but the platform will recommend that they study at Level 1. When the same student improves to L1.5 (midway between Level 1 and Level 2), however, the recommendation will change to Level 2.  The system recommends content at the “closest” level, thereby introducing higher content when the student is ready for it. As the journey from one level to another is managed, students are “stretched and challenged” in a sensible and realistic way.

With the new course structure allowing for a more dynamic and realistic profile (students are not restricted to a single level and are likely to have different ability scores for different modules), it is likely that students will be recommended content at different levels for different modules.  This allows for weaknesses to be addressed properly with a realistic path towards the desired level, whilst also ensuring that strengths are improved and that students are given the opportunity to be challenged.

The end result is a system which is far better at developing students and managing their learning, whilst also providing more useful data for tutors in terms of a likely destination and the progress being made.

The motivations and technology behind the new bksb initial and diagnostic assessments

The new initial and diagnostic assessments from bksb have been created to ensure that each candidate is given a unique set of questions and is identified at an accurate level. During early development, the phrase “random questions” was frequently touted, but it quickly became apparent that this would be a hazardous route to take. That’s because there are two very simple truths which should immediately deter anyone from using an assessment based on “random questions” or a “randomised question bank”, namely:

  • no two questions are equal in terms of difficulty;
  • and where multiple-choice and free-text questions are present within the same assessment, candidates answering the former will receive an advantage.

As an example, if the questions 5×6 and 7×8 were given to candidates for whom multiplication was a challenge, they would not have equal pass rates. One would be more difficult than the other.  Consequently, the random selection of questions would introduce an unacceptable element of “luck” into any assessment. And whilst the difference in difficulty between one question and another might well be small, when the sum total of those differences was multiplied over 20 questions or more, the overall effect would undoubtedly be significant.

Moreover, where an assessment blends multiple-choice questions which can be guessed with free-text questions which cannot, controls must be introduced to balance the value of all correct answers.

For instance, if candidates were offered one of the three questions below, it is clear that the increased chance of guessing question 2 would give those who received it a distinct advantage over those presented with questions 1 or 3.

Question 1: 8×9=?

Answer:    a)72    b)71    c)73    d)67

Question 2: 6×5=?

Answer:    a)32    b)31    c)30

Question 3: 7×4=?

Answer:

Furthermore, even where the mixture of question types within a selection is equal, the presence of such mixes still creates problems when calculating results. Consider, for example, that the previous three questions form a sequence. If a mark is given for each, and 2 marks is considered a pass, then a number of problems will be encountered.

First, even though these questions are clearly unequal in terms of their difficulties, in the final calculation they are nonetheless awarded identical values.This means that there is an 8% chance that a candidate can pass simply by guessing the first two questions. Secondly, the third question is surely a better indicator of ability than the other two. Should we therefore allot it 2 marks and change the pass mark to 3? We may be able to agree that question 3 is more significant, but is it exactly twice as significant as the other two? Moreover, if we accept that question 3 is more significant – due to there being no opportunity to guess the answer – then to a lesser extent question 1 is more significant than question 2, because with more answer options, there’s a reduced chance of someone guessing it. Beware, therefore, any assessment which features “randomised questions”.

How, then, can question variables be balanced in order to produce an accurate and valid result?

The solution lies in candidates being given different questions depending upon how they perform within the assessment. Following several years of intensive research,  bksb concluded that the only way that an adaptive assessment could be delivered was through the implementation of Item Response Theory (IRT).

Item Response Theory (IRT) has been in existence long enough to be critically appraised by leading psychometricians, statisticians and mathematicians, and the academic consensus is that IRT is by far the most reliable and accurate method for testing ability. Indeed, it is now accepted that it is impossible to create any form of adaptive assessment (any assessment where candidates are not presented with the same questions), without implementing the IRT algorithms.

Much of the intelligence behind this lies in the handling of multiple-choice questions. Whereas some have no guessing element, others might have 3, 4, 6 or even 8 options. As a result, there is a variable “chance” or “likelihood” of the answer being guessed, which in mathematical terms would be referred to as probability.

If we consider that the probability of a high-ability candidate successfully answering a low-difficulty question is thus high, whilst the probability of a low-ability candidate answering a high-difficulty question is low, we should be able to calculate the probability of a candidate of any given ability answering a question of any given difficulty. Likewise, the probability calculations can be combined with the “guessing” probability so that the issues around multiple-choice are themselves resolved. From this, the responses that a candidate makes can be repeatedly tested against the “likelihood” that he or she fits a certain ability group until a “best fit” is determined. This in turn provides a sophisticated system for calculating ability that is immune to the problems caused by both “randomisation” and/or the inclusion of questions with differing likelihoods of “guessing”. Given the type of content, this is the perfect approach for measuring ability as it produces the fairest, most accurate, and most valid results.

The necessary algorithms are hence deployed throughout the assessment, with constant adjustments made to a candidate’s “ability estimate” following every question. Rather than being random, the “best” question for each candidate is actually chosen after their previous responses have been analysed. Therefore, an assessment  will contain 20 to 30 questions specifically selected for each candidate, with artificial intelligence used to create the most suitable experience for them.

How do we know how difficult a question is?

To tackle this, we have used the responses that have been submitted for each question to calculate the pass rates for each ability group. We then compare the latter to the probabilities for every possible difficulty, with a “best fit” algorithm thereafter suggesting which value most closely resembles the behaviour pattern taken from the database. In contrast to competing systems which merely total students’ scores and use boundaries to determine their results, this use of artificial intelligence is a massive step forward. Learners and teaching staff can therefore be assured that the values attributed to each question are entirely valid, having been obtained through the manipulation of large datasets containing hundreds of millions of student answers.