Testing AI in the National Health Service (NHS)

Panelists

Matthew Heusser

Connect with me

Michael Larsen

Connect with me

Adam Byfield

Daniel Geater

Connect with me

References

Transcript

Michael Larsen (INTRO):
Hello and Welcome to the Testing Show

Testing AI in the National Health Service

This show was recorded on Wednesday, July 2, 2025

In today’s episode, we discuss how AI is changing the shape and processes of the United Kingdom’s National Health Service, or NHS. Matt Heusser and Michael Larsen welcome Daniel Geater, VP of Delivery for Qualitest for the European, Middle East and Africa Regions, as well as Adam Byfield, a principal technical assurance specialist with a unique journey from astrophysics to AI assurance. Adam leads a team that is adapting traditional software testing to help ensure that AI tools used by the NHS are safe, effective, and fair, with the goals of augmenting, not replacing, clinicians. Join us as we we unpack what that all means as we discover a bit more about AI in the NHS.

And with that, on with the show.

Matthew Heusser (00:00):
Well, welcome back to the show. This week we’ve got Daniel Geater. If you don’t know Daniel, you need to listen to the show more. He’s been on before. He’s a working consultant and a people leader at Qualitest where he’s Vice President of delivery for EMEA, which remind me, what does EMEA stand for?

Daniel Geater (00:21):
That’s the Europe, Middle East, and Africa regions. So thank you very much for having me back on the show, guys.

Matthew Heusser (00:27):
I’ve had the opportunity to work with Daniel a few times when he came to the United States. He’s in Europe, so in England, and Michael already introduced you to Adam Byfield, but we’d like to learn a little bit more. So Adam, Principal Technical Assurance Specialist for the NHS. Can you tell us a little bit more about your team and what your team does and how they operate?

Adam Byfield (00:51):
Yeah, absolutely. Thanks Matt. My team is a community of practice, so it’s a small core team of permanent staff and then a much larger team of volunteer colleagues across the NHS. So we actually sit within a slightly larger team, which is the Assurance Center of Expertise. So that is where all our colleagues who focus on the technical assurance of software and systems sits. We are the sub-team within that that focuses specifically on AI. So between us, we’ve probably got more decades than I would like to admit, experience of testing assuring traditional technology. But for the last four years we’ve been focusing specifically on in healthcare and our role primarily is to develop new techniques by which to assure and then support our colleagues throughout the NHS in applying those techniques.

Matthew Heusser (01:38):
And if we had to guess, the number of people in the NHS involved in the software testing process would be in the hundreds? In the thousands?

Adam Byfield (01:49):
Oh, certainly in the thousands. I mean, if you bear in mind that the NHS is the fifth largest employer in the world, total number of staff, vast in terms of obviously that’s primarily clinicians and administrative, but yet a significant presence in terms of software. So certainly in the thousands I would say.

Matthew Heusser (02:04):
So you’ve got thousands of people. You’ve got a smaller center of excellence. Within that, you’ve got an even smaller focus group, which you’re leading. How do you partner with teams? Classically, you said center of expertise, not excellence, A classic center of excellence would be here’s how to do it. We don’t know how to do it yet. How do you partner with teams? How do you work with them?

Adam Byfield (02:26):
Absolutely. So from the beginning, our primary approach has been to try and make use of that expertise in traditional software testing and adapt it to so we’re not starting with a blank sheet. Primarily it has been project by project over the years, so certainly when we started four years ago, there wasn’t a huge amount of present, so it was possible to deal with things on a case by case basis. The main approach we’ve taken is work through a project, work out how to do that, how to add value, and then disseminate that and put it on the shelf so that next time someone comes along, one of those, we’ve got something ready to go. The other aspect of that is we do have some university partners. So we work with the city of Birmingham University, for example, so we do have a second thread where we’re doing some academic research into new techniques as well.

Michael Larsen (03:12):
First off, Adam, thanks so much for joining us today. I want to draw some attention to one of the papers that you have published, and it’s an interesting phrase here that maybe many people might be familiar with, but some might not be. And that phrase is MUBA, M-U-B-A. It’s in regards to how you can look at and determine if the data that you’re seeing is going to give you a better answer. And again, this is focused on machine learning techniques so that as they get better, it can notice things in regards to, say, imagery or a collection of images that you’re looking at if you’re trying to diagnose something. I find this fascinating, but again, maybe part of it is I’m missing a little bit of the context of how this would work. Excuse me, MUBA is “Mixed Up Boundary Analysis”. How would that apply? Could you give just a maybe 5,000-foot view case example of how that would work?

Adam Byfield (04:10):
Absolutely, yeah. So MUBA is an example of something we worked on a couple of years ago. We’ve moved that forward a little bit more recently. Primarily what we’re talking about; image classification AI, in a clinical setting, this might be a tool that looks at chest x-rays and perhaps says, yes, I think this is cancer or no, I don’t think this is cancer. That’s the kind of technology that we’re considering. One thing we’d found coming into this space was that from a testing assurance point of view, that was primarily being covered by either clinicians or data scientists who obviously have a slightly different focus when they’re assuring these tools. Typically both those groups focus a lot on happy path and in distribution values, obviously coming from a test insurance background, what we’re interested in is edge cases. What is going to break this, what is it going to get wrong?

(04:59):
So this was a really good example of deploying some traditional testing assurance techniques. The new technology and specifically boundary analysis was the technique that we wanted to apply. So the concept is that if we imagine an image classification AI like that, a simple binary classifier, it’s one or the other. What we imagined was that there will be some form of boundary between those classes. There will be a tipping point at which the tool changes its mind from A to B. And so from traditional software testing, we know that boundaries are common sources of defects. We wanted to be able to explore that boundary. Now obviously in additional tech boundaries are defined in advance in AI. With an image classified like that, the AI develops the boundary itself during training, we don’t actually know where the boundary is. And so the first aspect of this piece of work was looking at, okay, how do we work out where that boundary is?

(05:51):
At what point does this tool change its mind from cancer to not cancer, for example? And then once we’ve established where that boundary is, we can get a clinician to tell us if they think it’s in the right place. So in order to try and determine where the boundary is for the image classifier, the initial way we approached that was by using the mix up technique. So this is a means of combining two images to form a single mixed up image. So we generated a large cohort of carefully curated mixed up images which were: certain proportion A, certain proportion B, to varying degrees, ran those through the system, saw how they were classified, and used that to attempt to infer the location of this imagined boundary. So in relatively simple terms, that was what the MUBA work was. Yes,

Matthew Heusser (06:38):
And I think this is a really great example of sort of mixing AI and human. I imagine you have radiologists or doctors who review these images and say, yeah, that’s a tumor. Yeah, that’s benign. We need to do some kind of more testing on that. They probably couldn’t articulate to you where the boundaries are. It would be very difficult when you’re coming up with requirements for classifier, but if you give them a thousand images and say, these have been assessed by software, they should say, that’s right, that’s wrong, that’s right, that’s wrong, that’s right, that’s right, that’s right. And that’s fundamentally how you train a large language model or image generation model to do identification detects, the very similar process we are using their expertise.

Adam Byfield (07:28):
Yep, that’s exactly right. This is kind of reverse engineering it effectively to work out if that boundary’s in the right place. An interesting point, what we discovered quite quickly was it’s not possible to actually draw a line on a graph and say that is where the boundary is bearing in mind that most of these tools are multi-dimensional, that boundary’s not a line. It is a multidimensional object. So even if we could visualize it, we probably couldn’t conceptualize it, we probably couldn’t understand it. What we can do, however, just as you say, is create this library of images and we can say this is the side of that imaginary line on which they fell. And then we can ask a clinician to say if they agree. That’s not a perfect process. Bear in mind that obviously there’s the issue of what we call label noise. In the clinical space you could show the same scan to multiple different radiologists and they might all have a slightly different opinion.

(08:20):
So we have to remember in terms of the ground truth, there isn’t necessarily a definitive right answer, so we do need to account for that. There is also the issue of how the clinicians actually feel about the technology. So I know anecdotally we’d heard of a small scale experiment that presented clinicians with a series of images, told the clinicians half the images had been diagnosed by AI and half by humans, and they were asked for a second opinion. Those particular clinicians consistently agreed more with the ones they thought had been diagnosed by humans, disagreed more with the ones they thought had been diagnosed by AI. In fact, they’d all been diagnosed by people. That was a very small sample group, but among those particular clinicians, there was an obvious bias where they trusted the AI’s judgment less and so we’re more minded to disagree with it. Yes, we need to do it at scale because we need to flatten out these various biases that are in there.

Daniel Geater (09:13):
I think you hit on quite an interesting point there, which is basically the automation bias. It’s the cognitive bias to either implicitly trust or distrust what the system has given you. We see the same obviously a lot of what Qualitests work is is involving AI in other use cases. For example, helping test teams optimize their testing and one of the challenges we get is quite a familiar one, how do we know that the AI is prioritizing or deprioritizing or focusing or defocusing the right areas of quality? What sort of approaches were you using to help your clinicians firstly understand when they were basically dealing with a cognitive bias for the system or against the system? And also if you found a way to do it, how did you help educate them away from that bias? Say, actually, I know you don’t want to trust it because it’s a system, but actually it’s coming from a place of science and data. No, you can’t draw an exact line, but with a certain level of mathematical confidence, it is going to give you the right answer. How did you start to come around those things with the team?

Adam Byfield (10:11):
Yeah, it’s a really good question, Dan. So this work is still not finished. We have a nonclinical proof of concept and we’re now starting to move into the clinical space. We haven’t addressed that problem fully yet because we haven’t reached it yet, but it is a really good point. On the one hand, in theory we have access to a huge number of clinicians being within the NHS. However, those clinicians are very busy, so their time is at a premium. What that tends to mean is that there’s a certain amount of self-selection. So the clinicians that often get involved in helping us with this kind of work tend to be the clinicians that are interested in technology and are probably more likely to be on the pro bias than the negative. I think no, it’s probably just down to standard statistical sampling. So I think when we get to the point of doing that at a larger scale, more clinicians is better.

(10:56):
That’s one way. So try and make the group larger and make a specific attempt to try and vary them to cover those things. I mean the automation bias is a really key point that you raise. I saw some research presented recently that had looked at a couple of big NHS trusts in the UK and they’d done another one of these exercises of, I think it was over a few months, and they’re asking radiologists to agree or disagree with the AI tool. They had one cohort who were relatively consistent. They agreed and disagreed across several months at a pretty standard rate. There was another cohort who you could visibly see drifting upwards in terms of, over time, they were agreeing more and more often. The key difference between those cohorts was that the second cohort who agreed more often were more junior colleagues, so they had been clinicians for a short amount of time, they had less experience. The clinicians with more experience, the more senior clinicians, appeared to be more stable in health and they agreed. So that would be one axis that we would vary our clinicians along when we were pulling together that group, I think.

Daniel Geater (12:06):
Just one other quick question, quite rightly you point out there is no clear line as there isn’t in most forms of classify, there’s never a clear line of definitely cancer, definitely not cancer. Did you work with them to flip the sample around? So for example, as you said, the edge cases is the main part. If it’s a hundred percent confident it’s not a cancer, it’s a hundred percent confidence it is a cancer, it’s probably going to be right. It gets blurry in the middle and it’s like 50 50. Did you sample for this and what did you see, if anything where it was those ones that are more close to the boundary of classify, not classify, did you almost flip the sample to give the clinicians a lot of ones where it was a slightly more blurred line and what did you start to see in font of your actual model itself? Had your mix ups worked or had you not really found what you were looking for?

Adam Byfield (12:51):
Yeah, so there’s a couple of things there. I think in terms of the nonclinical proof of concept, that’s probably a good place to start. We did that internally. We built a cat/dog classifier as something that any of our colleagues could therefore label effectively. Obviously our focus was on, we are interested in the ones that are near this mythical boundary. That’s the focus. That’s what we’re interested in. A really interesting thing that came out of that was that we were actually able to identify specific features that caused that image to be closer to the boundary with the cat/dog one, for example, what we found was that there were plenty of images of dogs holding tennis balls in their mouths. For example, what we found was that a lot of the cat images that were very close to the boundary, so the cat images that the tool was not clear about, we had cats with tennis balls in their mouths.

(13:38):
That was super useful in terms of not only saying no, these are vaguely closer or vaguely further away, but actually able to identify. There seem to be specific reasons why the AI is less confident in that. I don’t know yet. We haven’t seen it firsthand yet, but what I would very much hope is that when we move it into the clinical space, there will be a similar correlation there, so it’ll be able to say, actually when you get x-rays with these particular features, that is where the tool is weaker because ultimately that’s our aim is to be able to get the best use out of the tool. If we can identify limitations, then we can mitigate them. The other thing in terms of mixup, we started with mixup, but over time we’ve moved away from that only because initially the mixup of basically just combining the pixel values to generate this image, the concern over that was how lifelike is that? But also as ever, we don’t know exactly what facets the AI is relying on to make its judgment, and so as the technology has advanced, we’ve been able to start using some of the generative LLMs instead. So rather than mixing up images, we can have generate images from prompts that’s quicker for us and gives us a bit more flexibility about the types of images. So for example, the cat holding a tennis ball, we couldn’t have done that with mix up. We now can do that using a generative LLM.

Michael Larsen (14:58):
That brings up an interesting point. I’ve seen some examples of this from where generative AI, for example, you take an image and then you iterate on that image and then you iterate on the iteration of that image so you can get a very quick degradation of the quality of the data that happens in this. Do you use any of those type of techniques to see if the AI model flags it or says, Hey, something’s wrong here, something does not seem right here. What do we want to do with this? And especially if you’re looking at it for a training method for medical stuff, it seems that this is something you want to make sure you get right, and if there’s the danger of long-term degradation, how would you gate for that?

Adam Byfield (15:43):
That’s a really good point, Mike. Yeah, it’s a key challenge. I mean in terms of that specific work, I think the context made it slightly easier. It’s a very narrow channel. If we want a chest x-ray, there’s not a huge amount of variety in that. We don’t need to worry as much about it drifting away from that and certainly if we’re augmenting real images as well, so we’ve got certain amount of control over that. In that use case, because it is a very narrow channel, there’s not a huge amount of room for variety. One thing where I think that’s going to be in the future, where that is going to become a much greater problem is on my list of things to worry about in the future. Basically an amazing application of generative AI in the clinical space, the potential for what called digital twins.

(16:26):
This is an amazing application wherein, If a surgeon wants to practice a very, very rare surgery, until recently, there’s not a lot you can do about that. If it’s a one in a million surgery, you can watch videos of other people doing it, you can read about it, but really you just have to wait until that surgery comes along. With generative AI, it is potentially feasible to create a virtual version of a patient with that surgery and then the surgeon can practice it as many times as they want. It’s also feasible that we can have what we call digital twins, which is we generate a virtual version of the patient, and so the clinicians can inspect that rather than the patient. That potentially reduces the need for exploratory surgery, stuff like that. What an amazing application. Definitely want to use that. From my point of view in assurance, how on earth do I ensure that those virtually generated images are sufficiently lifelike and accurate?

(17:18):
Because at that stage we are talking about there being such variety and such scope that the issues you raise could come into that. One thing you can do with that is manually check. So we could get lots of clinicians to look at lots of virtual versions and compare them to the real version, manually check, tell us if they think they’re close enough. That’s going to be massively labor intensive. It’s not going to be super useful beyond that individual patient until you get to a big enough number of those. So I think the other option that we might have to consider to address this is basically AI, as a judge. So actually, it may be that we need to use a second AI tool to judge the first, so that’s how we assess that quality. That brings with its own risks, the kind of cascading risks through multiple LLMs for example. It’s LLMs all the way down effectively, but I honestly don’t see another way of doing it, so I think that is the circle we’re going to have to square in the very near future I think.

Daniel Geater (18:15):
So I think you hit on a couple of interesting things there and I want to come back to obviously NHS and Qualitest first worked together about four or five years ago and fairly early in some of the NHS’s journey with AI assurance, and one of the things that we were looking at at the time was effectively how do you modernize approaches and the foundation work for these topics? How do you deal with non-determinism and the usual fun questions that come with systems? You know, fairness, is it bias or is it just an imbalance problem made in all of this, and a lot of the work that we put together was around effectively baselining your data to a known set with a certain allowed amount of variance. No two chest x-rays containing this weakness or no two MRIs with this particular illness and we’re going to look exactly the same, but they are close enough that a clinician will say this one and this one are, and the same applies to all the forms of AI learning.

(19:07):
If I’m building financial modeling data, I can take my data set and I can augment it a little bit with a few decimal places here, a few decimal places there, and it won’t change the grand scheme of things. One of the key parts to this is no, you can’t take your clinicians or your practitioners or your business users away from what they’re doing all the time, but you can do an investment piece with ’em to say, look, here is a canned data set with a certain amount of things and this kind of feeds towards use your LLM, your LVM, your mixed model models as a judge. Again, you can say within a certain acceptable margin of error is this answer right? If I’ve asked you to generate me a paragraph of text for this prompt and I can get another LLM in the case of LLMs that is graded to a similar level has been shown to perform the same on a standard industry benchmark and say, is that a good enough answer?

(19:56):
We can start through that and we can accept a certain margin of error. The fact that you say dark green and I say a slightly darker shade of green doesn’t change the fact that it’s probably the same kind of color. When you look to these things, you say if I want to synthesize images, I can provide mathematical functions that will say within reason, it can vary by this much. There comes a point where this is no longer accurate. If I’m looking at paragraphs of text, I can use mathematical functions to say this piece of text is far more distant than that one should be, and there are already frameworks that exist to start to do this for LLMs. They’ve still got a long way to go, they’re still growing. I think those same concepts will start to be applied to the vision models in the same way, say, you can vary by an amount, but you can’t vary by all of it.

(20:38):
I think the bigger challenge will be more conceptual. I think the more straightforward things like for example, does this x-ray have this particular kind of growth or tumor in them is one thing. I think other challenges will also be actually how many ribs does it have because I think we’ve all seen, for example gen AIs that give everyone seven fingers and three hands. I think those things are harder to deal with, but I think when you take one of those systems and you give it very specific prompts with very specific outputs, you do have the ability to use a certain amount of condition time saying this is with an acceptable regional margin and then you can grade it, put another way, you’re never going to get it absolutely perfect, but you could never get any piece of software perfect. We’ve all known that for a very long time. It’s about minimize the margin of error and there are still techniques we’re learning for all of that, but I think some of the groundwork does exist. We’ve already been doing some of it in Qualitest inside and outside of our work with the NHS. The NHS themselves are doing it. There’s a long way to go, but I think we’re starting to get close to the point where we can do this.

Matthew Heusser (21:34):
I’d like to piggyback on that. What I was going to ask was what’s the end game? Is the end game that you have this radiology scan and you get a null hypothesis from an AI tool that says, I think this circled area is problematic and then a human spends 30 seconds on it instead of 10 minutes? Is the end game to eliminate the human from the process? I’d really like to hear your thinking because there’s so much, there’s going to be some hype around AI that’s unrealistic and then we’re going to be disappointed with it and then eventually at some point who knows how long out we’re going to reach reality. I can almost imagine the same conversation 25 years ago. Calculators, people don’t trust calculators, man. Maybe 50 years ago. These computers are coming into our shop and they’re telling us how to place our advertisements on NBC and CBS and ABC and we don’t trust it man, and we do now. So how do you see that curve working out and what do you think the end game is for this particular application?

Daniel Geater (22:36):
I’d like to answer on the general one, but then I’d also love to get Adam’s take on the specifics of the medical setting. For me, it’s an interesting parallel that you draw, Matt, because actually one of the examples that I give, obviously you and I worked together way back in the dim and distant mists of testing. One of the parallels I like to give is when I first started my career I was a test automation engineer and I used to have an awful lot of conversations about why would I trust test automation when I’ve got testers that could do a perfectly good job. And then after a couple of years it was more a case of this test automation stuff seems to be going somewhere, but I’m not sure it’s right for me. And now talk about test automation. Well, everybody’s got test automation, why wouldn’t you have test automation?

(23:11):
I think that a lot of those AI technologies will go the same way, but you’re going to have to ride out some of that disappointment. We know that even today after 20 years of it, you can’t automate everything because some things are not right to be automated and one of the questions we have quite a lot with helping customers determine what to apply AI to, is AI actually appropriate for this? There are still situations when a rule-based system is better for you. Are you just using AI for AI’s sake or do you need to do this? So we have that and what we do with a lot of our customers is kind of more on the former part of what you’re saying, which case of, look, I don’t expect AI to make this problem go away right now, but I do think it will give me some acceleration so I can focus on things that need me.

(23:51):
Again, drawing on that parallel with automation, automation didn’t make my need for testing go away, it let me do more testing, so my testing engineers could focus on actual creative and dynamic testing and advanced combinations and so on. I see AI in a lot of contexts going the same way. It’s not here to replace what you’re doing. My personal belief is that the technology’s available to us today, even the top tier models are not ready to do that. Then you’re back into the philosophical questions about when’s AGI going to be here? We’re not there yet. Right now, as a force magnifier and an ability to tighten focus. Very, very powerful, more powerful than teams of people can be in certain ways of saying this data points me in this direction, but the actual making the judgment, making the call, I would still see that for a long time residing with the practitioners in whatever subject matter of expertise, but particularly in something as life critical as the medical care. But that’s my take on it. I’d love to hear Adam’s take on how the NHS sees these conversations evolving.

Adam Byfield (24:49):
Yeah, thanks Dan. Obviously I’ll speak from my own experience. I’m not going to set NHS policy here and now on this podcast, but certainly happy to give my own take on it. Ultimately, we’re going to have to make a decision. Is this a productivity tool or is it a replacement? Probably important to say. Certainly in the NHS at this point, there are no AIs making clinical decisions on their own. That is not happening at the moment. Any clinical related AI has a human in the loop, so that’s the stage we’re still at currently. I think the answer to your question about ultimately is it depends who you ask. Different stakeholders are going to have very different opinions. I suspect, I’m not going to claim to speak for all clinicians, but I suspect there are a lot of clinicians who would say exactly as you said, Dan, that the test automation example is a really good one.

(25:30):
This should actually help them just focus on the stuff to where they can add the most value. So it is more of a productivity tool. I’m reminded of a conference that I attended where there was a neurologist and a data scientist who were working together on an AI tool that attempted to diagnose Parkinson’s from video. A patient walking across a room and the AI tool based on their gait would attempt to diagnose Parkinson’s. At the end of their presentation, they were asked this question, will this eventually make the diagnosis for you? And I was really struck by the fact that they very publicly disagreed. It was quite fun. The neurologist said, yes, absolutely. Medical diagnosis is just pattern recognition. AI is really good at pattern recognition. So yeah, definitely we should use this tool to just make the diagnosis for us. The data scientist standing next to him very quickly countered that and said absolutely not, primarily on the basis that I’m not taking liability for that as the creator of the tool, and I think liability is going to be the key driver that makes that decision.

(26:32):
In the healthcare space, liability is going to be the key thing. I could imagine in the future it being feasible that you might have a tool that is good enough to make decisions on its own, but we still wouldn’t on the basis that we would be so nervous about what happens if it does get it wrong. I think in this instance it’s the non-technical considerations are probably actually the bigger factor. All the assurance that we do is risk-based assurance. That’s the main technique that we use and what we found is that works really well in the medical space because clinicians are really good at working with risk. It’s what they do day to day. So rather than being able to say this product is perfect or it’s not perfect, if we present the clinician with a risk profile, that’s quite accessible to a clinician because they do that every day, weighing up risk and making the least worst decision. They do it all the time and that’s quite a nice fit for this.

Michael Larsen (27:27):
I want to pivot just a little bit here because I’m going back in time, so I had the benefit or disadvantage depending on how you want to look at it, of having a father who is a pediatrician. By virtue of that, actually when I was younger, I had toyed with the idea of going into medicine and getting involved with that, so I got a chance to go spend time with him during my youth where I would sometimes shadow him and join him on his rounds on the pediatric ward and sometimes he would do these long shifts. My dad would be, Hey, I’m doing the clinic shift for the week. I’m going to be staying at the hospital, so if you would like to come down and hang out with me, you’re welcome to do so. In fact, you could even stay one of the nights if you want to, and I did.

(28:09):
One of the reasons I’m bringing up this whole memory was a lot of what my dad had to do and some of the things that I remember him doing, he’d be talking to people, he’d be dealing with situations, I’d go with his rounds and then he would come back and he would sit down with a dictaphone or a tape recorder and he would just go through papers and he had this method of speaking. I didn’t even understand what he was saying half the time, but then I realized there’s this method for speaking about medical stuff so that it’s quickly transcribable, and now we’re looking at this idea of ambient voice tech to where you could have a recording device on you. You’re actually talking maybe in real time with a patient and based on that conversation it says, okay, based on what you talked about, here’s the main critical things you have to deal with. Boom, they’re already in your notes and you go on. How is that working into what these models are? How do you incorporate that? Is what I’m describing a good example or am I missing the mark with this?

Adam Byfield (29:09):
Yeah, I think that’s a brilliant example. ABT is a huge topic currently in the NHSI would say right now of all the different forms of AI, ABT is the one that’s getting the most attention currently, and that’s primarily because it is seen as potentially the biggest time saver for clinicians. That’s certainly how it’s been sold. There is a significant amount of AVT already in use in the frontline NHS, so the structure of the NHS means that frontline colleagues do have a certain amount of autonomy to procure their own tech and people come ahead and done that. There is a range of application within AVT, so they are all recording consultations in some form, but there is then a range of application as to what you do with the transcript. So in some instances it’s really just a transcriber. Maybe a summarizer, but as you say, there are some tools that will actively point out or suggest maybe these are the meds you should prescribe.

(30:01):
Maybe these are the next steps. There are also some of those products that perform what we call clinical coding. Clinicians will write notes in English, for example, currently human beings called clinical coders will then convert that text into a series of clinical codes and some of these tools offer to do that as well. So it is an enormous potential time saver. It is the one area of AI on which the NHS has issued formal guidance, so that was relatively recently. Some of the pitfalls; I think the big challenge at the moment is because all of those products still require a human in loop, so I’m not aware of any AVT products currently being offered in the UK that says you could use this for a clinical purpose and you don’t need to check it. The clinician is still required to read through afterwards and check that it hasn’t missed anything or got anything wrong.

(30:50):
That’s the main obstacle to this really risk in the head because for some clinicians that wipes out all the time saving and having to check it afterwards. You lose all the time that you saved by doing it. There are a couple of other things, a couple of the weaknesses that need to be ironed out. So obviously most of these AVT products are based on top of remote LLMs such as OpenAI or similar. We’d heard anecdotally of one AVT product, which obviously I won’t name, but one AVT product, which it was noted if a clinician was talking to a patient about suicide or violence or amputations for example, things like that, the AVT actually stopped recording. It silently failed and just broke. So we don’t know for sure, but our assumption there, our working theory there is that that is the result of a content filter in the remote LLM.

(31:41):
So because that LLM is not a clinical LLM, it’s just a normal commercial LLM, it has those content filters and those are then kicking in the clinical setting. So there are technical barriers to clear like that, but as I say, I think the key barrier to clear before that becomes really the silver bullet to save time is that issue of the need to check it afterwards. Errors are relatively easy to spot, but the big problem is omission. If it misses something, the only way the clinician can identify that it’s missed something is to actually go through it all and compare. So there is a huge amount of potential. It is already here and it is already been used, but there are still a few hurdles to clear before it becomes that silver bullet that it’s been sold as, I think.

Daniel Geater (32:24):
It’s an interesting one for me. Very much not as a medical professional. From a technology perspective, it would be an interesting halfway house because as a technology you are using particularly the LLM family of things, what they’re better for. Summarization, adaptation of speaking, assuming the quality of the transcription is good, then you are a little bit more in a wheelhouse of what LLMs are actually very good at, which is give me the summarizations of this as opposed to where you see a lot of people asking LLMs in use cases to basically think for themselves, which is not really what that technology is. It’s a very powerful language framework, but it’s not a conceptual framework. So I think as AI summarize my notes and summarize ’em, make sure that I’ve got the right annotations, I think that could be a happy path or a path of least resistance, I should say, to the NHS using what those technologies.

(33:12):
Yes, it does require initial investment of the practitioners and some say, I agree, I disagree, but back to what Matt was talking about earlier, that effectively learning via human feedback, that’s also a pretty quick one to say this model is good at it, but you hit on another interesting point, which I think is an entire other podcast episode in itself, but it’s the actual value stream problem. If your model doesn’t do something, is it because of the software that you’ve wrapped it in? Is it because the foundation model can’t do it? Is it because of the guard railing? Is it because of the data? Obviously some of the things that you and I have looked at in the past, Adam is about in a medical setting in particular, sometimes it’s not bias, it’s just the way the problem. It’s if it’s a clinical setting, this disease might be more prevalent in women than men or people of this demographic than that demographic and there will be times in cases where it shouldn’t matter.

(33:57):
A broken arm is a broken arm I assume, obviously not a practitioner. So I think that value stream argument, there’s going to be an evolution as an industry in that a lot of government bodies, you see this in the European governments, the American governments, the UK government are all pushing hard on, we know we need to do something about regulatory burden across all of AI, but nobody’s really got the answer yet. I think that will come to it, but I think as a short term, I think there’ll be some very interesting lessons learned on the ambient voice text because whilst they’re bringing in a powerful AI backend as a front end use case, they’re not as safety critical as others. I think you mean really interesting to see how that evolves over the next couple of years.

Matthew Heusser (34:32):
So two quick questions on that. One is privacy concerns because you basically have a little device that’s listening to everything and then is it good enough? Do we see progress made toward at the end of the session, take all the audio, summarize and bullet points, have a person review those bullet points? It’s right 95% of the time. I would imagine that might be a no because we just don’t have the clinical expertise in there to know what words really matter.

Adam Byfield (34:58):
It’s a really good point. The issue of whether to keep the audio, what happens to that original audio? That is a big argument at the moment. On the one hand, we’ve got what we call information governance, so these are kind of data protection professionals. That’s a really big important part of the NHS. Those guys at the moment I think would advise that we shouldn’t keep the audio, we should keep it for as short a period of time as possible. Once we’ve done what we need to do with it, we should get rid of it because we have a legal position of we should only hold data if we need it. So there’s a big argument to say ditch the audio. There’s also a financial argument for that of it’s going to cost a lot to store it all. From a test and assurance point of view, it’s the exact opposite. That audio is key evidence. That’s my ground truth. If I want to go back and check whether this did it properly, I have to have that. So that is an ongoing debate.

Matthew Heusser (35:47):
Alright, well we just got into two use cases. Maybe if things go well, we could have you back in a while, Adam, to get into some more because we just barely scratched the surface. But unfortunately we are out of time. We usually end this with just around the horn discussion of lessons learned or key thoughts or what’s next. I think we’ll end with Adam. Balancing security, privacy, efficiency and keeping the human in the loop I think is going to be key to all of those. Finding the niches where this stuff actually works is really exciting for me. There’s a lot where it don’t work yet, so this has been fun. Thank you. Michael, final thoughts?

Michael Larsen (36:27):
Well, yeah, again, it’s been interesting to me, especially from the perspective of working in, now, AI governance and getting to understand what the guardrails are and how do you trust what you’re putting your faith and time into? It’s an interesting conversation and it just feels like every time that you say, okay, we have this new detail in and some other stuff, and it’s interesting just to hear this from a medical perspective because again, oftentimes, let’s face it a lot of the stuff that we’re looking at as well, maybe this can help us write code better or oh, maybe this can help us make for a marketing thing or help us to refresh a newsletter. We think of superficial stuff a lot of times. This is a different aspect though, and I really appreciate hearing this because it does remind us, oh yeah, this stuff comes into what we consider mission critical areas of our lives, our health and our well-being are some of the most mission critical, and so I do find it really interesting. And Adam, thank you so much for sharing your insights on this. I’m very curious to see where we go from here.

Adam Byfield (37:33):
So yeah, thank you so much guys. It is been a pleasure. Thanks for having me on. I think the key thing for me at the moment in this space, there’s a huge amount of hype around AI and there’s a big drive to use AI for everything for the sake of it. I think from my point of view, one of the key things from an assurance perspective is to keep that broad view in mind and actually look at, we should use AI where it can really add value, but maybe we shouldn’t where it doesn’t. And so I think making that distinction is probably the key driver for assurance going forward.

Matthew Heusser (38:03):
Thanks boss. Daniel, you want the last word?

Daniel Geater (38:06):
Sure. Thank you very much, guys. Really good to be back on the show. As is always the way you can obviously find out a little bit more about some of the stuff that I’ve been up to over on Qualitest pages. I think the age of AI is growing. We know this and like you said Matt, there’s going to be the hype and there’s going to be the disillusionment and all those other phases. I think that technology is accelerating very fast and new use cases turn up every day, but actually I’m quite optimistic about the future of the assurance of that technology because of the drive from safety critical settings like medical, they will push, say, no, I need more confidence than I might need in other domain industries. And it goes in parallel with what other areas are doing. So we’ve already been doing some work on a number of things to do with LLM evaluations and how do you deal with this generative tech. I think there’s going to be a lot more of that coming and I’m keen to see how you get the hand in hand partnership between the tech companies building this capability and the more conservative and safety critical companies saying, okay, but we need to keep it in check and to Adam’s point, make sure we’re using it in the right way, not just because we can.

Matthew Heusser (39:06):
Thanks for everybody for coming and we’ll have to do this again soon.

Michael Larsen (39:10):
Thanks for having us, as always.

Matthew Heusser (39:11):
Bye-bye.

Adam Byfield (39:12):
Cheers guys.

Daniel Geater (39:13):
Thank you.

Testing AI in the National Health Service (NHS)

share

Panelists

Recent posts

Testing AI in the National Health Service (NHS)

Synthetic Data in Testing

Women In Testing, Part 2

share

Get started with a free 30 minute consultation with an expert.