Think of an LLM as a very advanced autocomplete. Instead of finishing your sentence, it can write entire paragraphs, answer questions, or summarize complex documents.
How it works (simplified):
It’s a special type of machine learning network that trained on massive amounts of text (books, articles, websites).
It learns patterns in language. It does not learn facts the way humans do.
When you give it a prompt, it predicts the most likely next word, over and over, until it forms a complete answer.
Key points:
LLMs don’t “know” things; they generate text based on patterns. Without safeguards, they can produce confident but wrong answers (hallucinations).
Based on which model is used and what topic is being discussed, hallucinations can occur in up to 30% of answers.
Healthcare analogy:
Imagine a resident who has read every textbook but never seen a patient. They can give plausible answers, but without checking the specific patient's exam, chart, or lab results, they might be dangerously wrong. The worst part is that they may present their answers as fact, despite them being wrong.
Local LLM
Runs entirely on your own hardware (laptop, workstation, or server).
Pros:
Data never leaves your environment (good for HIPAA-sensitive material).
Works without internet.
Can be customized and fine‑tuned for your exact needs.
Cons:
Requires powerful hardware (especially GPUs with lots of VRAM).
More setup and maintenance.
Healthcare analogy: Like having an in‑house simulation lab; you control the equipment, the cases, and the schedule.
Non‑Local (Cloud) LLM
Hosted by a provider (e.g., OpenAI, Anthropic, Google, Microsoft) and accessed via the internet.
Pros:
No hardware to manage.
Largest and most powerful models.
Always up to date and scalable.
Cons:
Data leaves your environment (must be handled with compliance safeguards).
Ongoing subscription or per‑use costs.
Healthcare analogy: Like sending residents to an external simulation center; you get great resources without owning the gear, but you have less control.
VRAM (Video RAM)
Memory on your GPU (graphics card).
Super-fast and located right next to the GPU cores.
Holds the LLM’s parameters (“knowledge”) during use.
If the whole model fits in VRAM → fastest performance (which is why NVIDIA, a graphics card maker, is so rich).
Analogy: VRAM is the scrub nurse’s table, and all instruments are right there, ready to use.
RAM (System Memory)
Main memory for your CPU (central processing unit).
Larger but much slower.
If the model doesn’t fit in VRAM, it spills into RAM. It works, but it is much slower.
Analogy: RAM is the supply room down the hall, and you can get what you need, but it takes longer.
Definition: The maximum amount of text (measured in tokens) the model can “see” at once. This includes both your prompt and the model’s reply.
Tokens: Roughly 1–1.5 words each.
Why it matters:
Bigger context = can handle longer cases, full patient histories, or multi‑turn mock oral sessions without “forgetting” earlier details.
If you exceed it, older info gets dropped or must be summarized and restated to the LLM.
Analogy: It’s the model’s short‑term memory and acts similar to how much of a patient’s chart you can keep in your head before you have to flip back and re-read something.
There are several practical strategies used to keep AI outputs accurate, relevant, and safe for our mock orals project:
Fine Tuning:
Fine‑tuning is the process of taking an existing large language model and training it further on your own domain‑specific data so it learns your terminology, style, and priorities. By grounding the model in accurate, vetted examples from your field, fine‑tuning helps it generate responses that stay aligned with your source material, which can significantly reduce hallucinations
Temperature Control:
Adjust the model’s “creativity” setting depending on the task. Lower temperatures are better for factual recall; higher ones can be used for brainstorming or scenario variation.
Prompt Engineering:
Ask clear, specific questions with constraints so the AI knows exactly what to focus on. For example: “Using only the retrieved case notes, summarize the diagnosis in three bullet points.”
Use Retrieval-Augmented Generation (RAG):
Always ground the AI’s answers in trusted, up-to-date sources. For us, that means pulling from a vetted mock oral case bank and protocols instead of the open internet.
Structured Data:
Feed the model information in a consistent, labeled format. Standardized case templates make it easier for the AI to extract and present the right details.
Reasoning Models:
Reasoning models break problems into explicit steps, check intermediate logic, and ground answers in provided evidence, which helps catch contradictions and reduce hallucinations. They can still occur if the reasoning is built on faulty inputs.
Multi-LLM Workspaces:
Multi‑LLM workspaces reduce hallucinations by having multiple models with different training data and strengths cross‑validate answers, flag discrepancies, and use consensus or majority voting to filter out incorrect or fabricated content.
Human-in-the-Loop Review:
Always have a subject matter expert review AI-generated outputs before they’re used in teaching. This ensures accuracy and alignment with our educational goals.
Regular Data Updates:
Keep the retrieval sources current. Outdated protocols or case notes can lead to incorrect answers. So, we may need to update our library on a set schedule.
In short: By combining grounded data, precise prompts, structured inputs, expert review, and regular updates, we can dramatically reduce the risk of hallucinations. Our goal, and major hurdle, is to make a system that can catch them before they reach learners.
What it is: Training an existing LLM on your own data so it learns your style, terminology, and priorities.
Types:
Full fine‑tuning: Adjusts all parameters. This is the most accurate, but expensive and hardware‑intensive.
Parameter‑efficient tuning (LoRA, adapters, prompt‑tuning): Adjusts only small parts of the model, which is cheaper, faster, and often “good enough.”
Why we’d use it:
Teach the model APEX‑specific case formats.
Reduce hallucinations by reinforcing correct protocols.
Make it “speak” like an examiner in our mock orals.
Analogy: Like orienting a new attending to your hospital’s exact protocols. They already know medicine, but you train them on your way of doing things.
RAG is like giving a resident instant access to your hospital’s library, EMR, and latest guidelines before they answer.
How it works:
Retrieve: The system searches trusted sources (e.g., your protocols, case archives, research papers).
Augment: It feeds that retrieved info into the LLM.
Generate: The LLM uses both its general language skills and the retrieved facts to produce a grounded, context-specific answer.
Why it matters for us:
Reduces hallucinations by grounding answers in our data.
Ensures responses reflect APEX protocols and mock oral standards.
Retrieval happens from approved, compliant sources.
LLMs are powerful but can hallucinate: don’t trust without verification.
RAG is one of our main tools for grounding answers in our trusted data.
Clear prompts + structured data + human review = safer, more accurate AI outputs.
Your feedback is essential to keep the system aligned with our educational goals.