blog / AI
AI5 September 20233 min read

The AI hallucination problem is real. Here's how I'm thinking about it.

Everyone who uses AI tools runs into hallucinations eventually. The question isn't whether to use these tools; it's how to use them without getting burned.

by Matt Roberts

Let me tell you about a mistake I made.

I was preparing a customer presentation about Microsoft Entra ID, which had recently rebranded from Azure AD, and I wanted to make sure my terminology was current. I asked ChatGPT to summarise the key differences and what had changed. It gave me a confident, well-structured response. I used it as the basis for a section of the presentation.

One of my bullet points was wrong. Not completely wrong, just close enough to plausible that I didn't catch it in my review. The customer's technical contact did.

That was my fault, not ChatGPT's. The tool didn't claim to be a source of truth. I treated it like one.

What hallucination actually means

When people talk about AI "hallucination," they mean the tendency of large language models to generate confident, fluent, plausible-sounding text that is factually incorrect. It's not a bug in the traditional sense; it's a feature of how these models work. They're predicting likely next tokens based on training data, not retrieving verified facts from a database.

The insidious thing is that hallucinated content often looks indistinguishable from accurate content. The formatting is right. The tone is authoritative. The surrounding context is correct. But one specific claim (a date, a product name, a feature behaviour, a regulatory requirement) is wrong.

Why this is a specific problem for IT professionals

In our domain, specific factual claims matter in ways they might not in others. If ChatGPT tells you that Conditional Access policies work a certain way and you configure your tenant based on that and it's wrong, the consequences are real. If it gives you incorrect licensing information and you quote it to a customer, the consequences are real. If it hallucinates a PowerShell cmdlet parameter, your script fails.

The hallucination problem is not evenly distributed. Model accuracy is higher for well-documented topics that appeared frequently in training data, and lower for:

  • Recent events or releases (the model's training data has a cutoff)
  • Niche or highly specific technical questions
  • Questions that require precise numbers, dates, or version information
  • Questions where the correct answer has changed over time

That profile covers a lot of what we deal with in IT.

How I'm actually managing this

I haven't stopped using these tools. The utility is too high. But I've changed how I use them:

1. I categorise tasks by verification cost

For tasks where being wrong is cheap (drafting, brainstorming, summarising publicly available high-level information), I use AI outputs more liberally. For tasks where being wrong is expensive (specific technical guidance, anything with compliance implications, customer-facing technical claims), I treat AI as a starting point and verify against authoritative sources before I rely on it.

2. I ask for reasoning, not just answers

"Explain how Conditional Access policies evaluate device compliance" rather than "Does this policy configuration work?" When the model explains its reasoning, I can spot where the logic breaks down. It also prompts more careful generation. A model that has to justify its answer is less likely to confabulate.

3. I watch for confidence without evidence

Hallucinations often come with hedging removed. If an AI answer sounds very certain about something specific (a date, a version number, a precise product behaviour), I apply more scepticism, not less.

4. I keep source documents in the context

Where possible, I paste in the actual documentation I want the model to work with, rather than relying on its training data. "Here's the Microsoft documentation on Intune compliance policies. Summarise what it says about X." That's a different task to "Tell me about Intune compliance policies."

Using these tools intelligently

The hallucination problem is real, and it will probably get better as models improve. GPT-4 already hallucinates noticeably less than GPT-3.5. But "less" is not "none," and the improvement doesn't change the underlying nature of what these models are doing.

The responsible position is to use these tools for what they're genuinely good at, maintain appropriate verification processes for consequential claims, and not pretend the limitations don't exist.

That's not a reason to avoid them. It's a reason to use them intelligently.

#ai-hallucination#chatgpt#ai-reliability#it-professional
Share:X / TwitterLinkedIn

Related posts

Building my first AI-powered app: what I learned as a non-ML developer
AI

Building my first AI-powered app: what I learned as a non-ML developer

I built this website using Next.js, AWS, and Claude. I'm not a developer by trade. Here's an honest account of what that process was like and what surprised me.

14 May 20253 min read
Claude 3.7 and the rise of agentic AI — this is the inflection point
AI

Claude 3.7 and the rise of agentic AI — this is the inflection point

Anthropic released Claude 3.7 Sonnet in February 2025 with extended thinking mode. Combined with the MCP protocol, something important just shifted.

10 Mar 20253 min read
DeepSeek just changed the economics of AI. What it means for enterprise
AI

DeepSeek just changed the economics of AI. What it means for enterprise

DeepSeek R1 arrived in January 2025 and sent the AI industry into a brief panic. The dust has settled. Here's what actually happened and what it means for enterprise AI strategy.

28 Jan 20253 min read