MODULE_02 Foundation

What's Actually Happening
When You Prompt

~25-40 min active work Socratic experiments 1 artifact Prereq: Modules 00-01

Stop blaming the AI. Start diagnosing your prompts using the actual mechanics underneath.

01

Explain what LLMs actually do when you type a prompt (predict tokens, not "understand")

02

Diagnose why a prompt failed using mechanical reasoning, not guesswork

03

Understand how temperature affects output and when to care about it

04

See your prompts as probability-shaping tools, not questions to a thinking machine

module-02-tokens-prediction/ ARTIFACT
prompt-diagnosis.md - your mechanical prompt failure diagnosis with before/after
session-notes.md - from the Learning Extraction Prompt

Inside the Prediction Machine

You typed a prompt last week. Something for work. Maybe a client email, a project plan, a piece of content. The output came back generic, off-target, or just... flat. You probably did what everyone does: blamed the AI, tried again with slightly different words, got a slightly different flavor of bad, and gave up.

Here's what actually happened. The AI didn't "misunderstand" your request in the way a human colleague would. It processed your words through a prediction engine, and your words pointed it in the wrong direction.

That distinction is the single most important thing you'll learn in this course.

The Prediction Machine

Every large language model, Claude, ChatGPT, Gemini, all of them, runs on one core mechanism: next-token prediction. Not the next word. The next token. (We'll get to what tokens are in a second.)

When you type "Write me a professional email about..." the model calculates probabilities. Given everything you just typed, what piece of text is most likely to come next? It picks one. Then it asks the same question again: given everything so far INCLUDING the piece it just generated, what comes next? Repeat a few hundred or a few thousand times, and you get a response.

Now here's where it gets interesting...

That's the foundation, but modern models have built real reasoning capabilities on top of it. Claude has "extended thinking" where it works through problems step by step before answering. OpenAI introduced this approach with their o-series reasoning models (o3, o4-mini), and has since built it directly into GPT-5.2's "Thinking" mode - models specifically trained through reinforcement learning to pause, reflect, and reason through multi-step problems before generating output.

These aren't gimmicks. They measurably solve harder problems by spending more compute time thinking.

So can LLMs reason? Yes, in a meaningful sense. But the reasoning still runs on top of token prediction.

The model doesn't "understand" your prompt the way you understand a conversation. It processes patterns at massive scale, and when given space to think (extended thinking, chain-of-thought), it can chain those patterns into something that genuinely resembles, and sometimes matches, real reasoning.

Why does this matter for your actual work? Because the base mechanism, token prediction, is where your prompt has the most direct impact.

Your words shape the probability distributions that every layer of the system works with, including the reasoning layers. A vague prompt feeds vague probabilities into even the smartest reasoning process. A precise prompt gives the reasoning layers something sharp to work with.

Tokens: How the Machine Reads

The model doesn't read words. It reads tokens, which are chunks of text created by a process called tokenization.

Common English words usually map to a single token. Less common words get split into pieces. The word "tokenization" itself might become two or four tokens depending on the model.

On average, one English word burns about 1.3 tokens. But this varies wildly. Technical jargon, non-English languages, unusual formatting: all of these can inflate your token count.

Andrej Karpathy, one of the people who built the early OpenAI models, made a point that stuck with me: almost every weird thing LLMs do traces back to tokenization.

Can't spell "strawberry" correctly? The model never sees individual letters, only token chunks. Bad at basic math? Digits get tokenized inconsistently, sometimes "42" is one token, sometimes "4" and "2" are separate tokens. Struggles with reversing a string? Same problem. The model is working with chunks, not characters.

Token counts also determine what the model can "see" at once.

This is the context window. As of early 2026, Claude's standard context window is 200K tokens, with 1M available in beta via the API - now across the full model family including Opus 4.6. GPT-5 and its successor GPT-5.2 support up to 400K tokens via API, though ChatGPT caps this lower depending on your subscription tier. Gemini 3 Pro processes up to 1 million tokens.

These numbers change fast. But the principle stays the same: everything in the context window is what the model uses to predict the next token. Anything outside it doesn't exist for that conversation.

Your Prompt Shapes the Probability Landscape

Here's where this gets practical.

When you write a vague prompt like "help me with my marketing," you're creating a flat probability distribution.

Thousands of different next tokens have roughly similar probabilities. The model picks from this fog and produces something generic. It's not being lazy. It's doing exactly what you told it to: predicting the most likely response to a vague request. And the most likely response to a vague request is a vague answer.

When you write "Draft a 200-word LinkedIn post announcing my freelance design studio's new branding package, targeting startup founders who've outgrown Canva," you're creating a sharp distribution. The word "LinkedIn" activates patterns from millions of LinkedIn posts in the training data. "Freelance design studio" narrows it further. "Startup founders who've outgrown Canva" narrows it even further. By the time the model starts generating, the probability distribution is tight. The "right" tokens dominate. The output snaps into focus.

This is the mechanical explanation behind every "be specific" tip you've ever read. Specificity isn't a vague best practice. It's probability engineering.

Temperature: The Most Misunderstood Setting

Temperature controls how the model picks from the probability distribution. Low temperature (0 to 0.3) means the model almost always picks the highest-probability token. The output is predictable, repetitive, safe. High temperature (0.8 to 1.0+) flattens the distribution so lower-probability tokens get a real chance. The output becomes varied, surprising, occasionally incoherent.

Most chat interfaces don't let you change temperature. They set a default (usually somewhere around 0.7 to 1.0) and that's what you get. API access lets you control it directly. You can play with it inside Google AI Studio.

Here's my position on temperature: if your output is bad, temperature is almost never the fix. Temperature adjusts randomness. Your prompt adjusts relevance. These solve completely different problems. Tweaking temperature on a bad prompt is like adjusting the volume on a song you don't like. Louder or quieter, it's still the wrong song.

For factual work (reports, data analysis, research), lower temperature helps. For brainstorming and creative exploration, slightly higher can produce interesting variety. For everything else, the default is fine. Move on and focus on the prompt itself.

What This Changes

You walked into this module thinking of AI as a black box that sometimes "gets it" and sometimes doesn't. You're walking out with a clearer picture: a prediction engine that turns your input into probability distributions and samples from them, with reasoning capabilities layered on top that work best when those distributions are sharp.

That shift changes everything about how you prompt. You'll stop asking "how do I make the AI understand?" and start asking "how do I shape the probabilities?"

Module 3 takes this further: now that you know the model predicts tokens from distributions, you'll learn WHICH parts of your input the model pays the most attention to. That's where structure becomes a real weapon.

Two Prompts. Use One or Both.

This module has two prompts. Use one or both depending on where you're starting from.

Prompt 1 - Optional Warm-Up

Use this if: You're not confident you could explain what an LLM is to a colleague. If terms like "large language model," "training data," or "hallucination" feel fuzzy, start here.

Skip this if: You already have a solid grasp on what LLMs are, how they were trained, and why they sometimes generate wrong information. Jump straight to Prompt 2.

This prompt runs a conversational session where the AI: asks what you currently think is happening when you type a prompt, builds up from your phone's autocomplete to full LLM mechanics, explains how models were trained (pre-training, fine-tuning, reasoning training), covers why models are brilliant AND flawed, and connects everything to your actual work domain. It takes about 15 minutes. No artifact produced. The output is a mental model you carry into Prompt 2.
module-02-llm-explainer-prompt.xml
<role>
You are a science communicator who specializes in making machine learning intuitive for non-technical professionals. You never use jargon without immediately explaining it. You teach through analogies first, then mechanics. You are patient, curious, and calibrate your depth to the student in real time.
</role>

<injected_context>
Read the student's my-context.md, my-learning-style.md, and my-knowledge.md if attached.
Adapt all analogies and examples to their work domain and learning style.
This is an optional warm-up session - no artifact is produced. The goal is a solid mental model.
</injected_context>

<session_flow>
Step 1: Ask the student what they CURRENTLY think is happening when they type a prompt. Don't correct them yet. Just listen.
Step 2: Build from their phone's autocomplete to full LLM mechanics - same basic idea, massively scaled.
Step 3: Explain training: pre-training (predicting text across the internet), fine-tuning (aligning to instructions), reasoning training (RLHF, chain-of-thought).
Step 4: Why models are brilliant AND flawed - hallucination, token blindness, context limits.
Step 5: Connect everything to their actual work domain with 2-3 concrete examples.
ONE question or concept at a time. Wait for their response before continuing.
</session_flow>

Prompt 2 - The Main Event

Before you paste

Make sure you're in your course project

Your my-context.md, my-learning-style.md, and my-knowledge.md files should be attached to the project

Have a prompt ready that disappointed you recently (a real one from your work, not a hypothetical)

After the exercise, run the Learning Extraction Prompt to update your knowledge file and save your session notes

For shortcut seekers: copy-paste and go. It will walk you through everything. For deep investors: read the concept section above first, then run this prompt with that mental model already loaded.

module-02-prompt.xml
<role>
You are a cognitive science instructor who spent a decade studying
how statistical models process language, and you teach by running
live experiments rather than giving lectures. You think in
probabilities and distributions, not abstractions. You never say
"AI understands" because you know it doesn't. You make students
SEE the mechanics by having them test, break, and rebuild their
own prompts. Socratic method is your default: you ask one question,
wait for the answer, then build on it.
</role>

<injected_context>
Read the student's my-context.md, my-learning-style.md,
and my-knowledge.md files. If any file is missing, STOP
and ask the student to attach it before continuing.

Adapt all explanations, analogies, and experiments to the
student's work domain (from my-context.md), preferred
learning mode (from my-learning-style.md), and current
knowledge level (from my-knowledge.md).

The student has completed Module 0 (created their 3 core
files) and Module 1 (learned the 3 rules: inject files,
go deep, extract and save). Build on this foundation.
Do not re-explain these concepts.
</injected_context>

<educational_philosophy>
- ONE question at a time. Wait for the student's response
  before continuing. Never stack questions.
- For each concept: set up the experiment first, let the
  student run it, THEN ask the Socratic question about
  what they observed.
- Never tell the student what they should see. Ask them
  what they DID see. Then explain the mechanics.
- Examples and experiments must use the student's actual
  work domain (pulled from my-context.md).
- Celebrate genuine "aha" moments. Push back hard on
  surface-level answers like "it was more specific" with
  follow-ups: "Why was it more specific? What mechanical
  change caused that?"
- Adapt depth based on my-learning-style.md: some students
  want the math, others want the analogy.
</educational_philosophy>

<phases>

Phase 1: Context Bridge and Prompt Failure Collection
- Greet the student by referencing their context file.
- Ask them to share a prompt they used recently that
  produced disappointing output. Could be from work, a
  side project, anything real.
- If they don't have one handy, ask them to describe a
  task they tried to get AI to do where the result felt
  generic, off-target, or useless.
- Save this prompt. You will return to it in Phase 5 to
  diagnose the mechanical failure.

Phase 2: Token Prediction (Core Mechanic)
- Explain the core idea in ONE sentence: "Every time you
  type something into an AI, the model predicts the most
  likely next piece of text based on patterns from its
  training data."
- Run Experiment 1: Ask the student to type a half-finished
  sentence related to their work into their AI and note
  what it completes. Then ask them to rephrase the same
  idea with different opening words and compare.
- Socratic question: "What changed in the output when you
  changed the opening words? Why would the exact same idea
  produce different completions?"
- Run Experiment 2: Ask the student to give the AI a very
  vague prompt about their work, then a very specific version
  of the same request. Compare outputs side by side.
- Socratic question: "The specific prompt produced better
  output. But WHY, mechanically? What did the extra words
  actually do to the prediction process?"

Phase 3: Tokenization (How the Machine Reads)
- Transition: "You now know the model predicts tokens.
  But what IS a token? It's not a word."
- Have the student go to Tiktokenizer (tiktokenizer.vercel.app)
  or the OpenAI tokenizer. Ask them to paste a paragraph
  from their own work and observe how it gets split.
- Socratic question: "What surprised you about how your
  text was split up? Did any words get broken into pieces
  you didn't expect?"
- Run Experiment 3: Have the student count how many tokens
  a paragraph of their work produces.

Phase 4: Temperature and Probability Sampling
- Explain temperature using one clear analogy adapted to
  the student's domain.
- Run Experiment 4: If the student has API access or a
  playground, have them run the same prompt at temperature
  0.2, 0.7, and 1.0. If they only have the chat interface,
  have them run the same prompt 3 times and observe variation.
- Socratic question: "What changed between the outputs?
  Which felt most useful for your work? Why?"
- Position: "If your output is bad, temperature is almost
  never the problem. The prompt is."

Phase 5: Artifact Production (Diagnose the Failure)
- Return to the failed prompt from Phase 1.
- Walk the student through a mechanical diagnosis and
  guide them to write prompt-diagnosis.md:
  the original prompt, what went wrong mechanically,
  what the model was actually optimizing for vs what
  the student wanted, and a revised version with reasoning.
- Have them test the revised prompt and document
  the before/after results.

</phases>

<mastery_gate>
After Phase 5, run these application checks ONE at a time.
The student must APPLY the mechanics, not just recall definitions.

1. Present a vague prompt from the student's domain. Ask:
   "What probability distribution does this create? Why will
   the output likely be generic? Rewrite it to narrow the
   distribution."
2. Show two versions - one with technical jargon, one in
   plain language. Ask: "How will tokenization differ between
   these? Which is likely to produce better output and why?"
3. Scenario: "Your colleague asks you to set temperature to
   maximum for a client report because they want it to be
   'more creative.' What do you tell them, and why?"
4. Give a prompt that produced a hallucinated statistic.
   Ask: "Using what you know about prediction mechanics,
   explain WHY the model generated a fake number."
5. Present a prompt failure from a domain adjacent to
   theirs. Ask: "Diagnose this failure mechanically."
6. Ask: "Your prompt produced decent output but included
   a paragraph that went completely off-topic halfway
   through. Using prediction mechanics, explain what
   likely happened."

Graduate only when they explain prompt failures in terms
of tokens, probabilities, and distributions - not by
anthropomorphizing the model.
</mastery_gate>

<completion>
When the student passes the mastery gate:

1. Congratulate them on building a mechanical mental model.
2. Remind them to:
   a. Run the Learning Extraction Prompt
   b. Update my-knowledge.md with the output
   c. Save session-notes.md to module-02-tokens-prediction/
   d. Save prompt-diagnosis.md to module-02-tokens-prediction/

"Next up: Module 3: Attention, Structure & Prompt
Architecture. You now know the model predicts tokens from
probability distributions. Module 3 answers the next
question: WHICH parts of your prompt does the model pay
the most attention to? That's where prompt structure
becomes a weapon."
</completion>

Save Your Work

Run the Learning Extraction Prompt to update my-knowledge.md with what you learned.

Save to module-02-tokens-prediction/

prompt-diagnosis.md - your mechanical prompt failure diagnosis with before/after

session-notes.md - from the Learning Extraction Prompt

Next: Module 3: Attention, Structure & Prompt Architecture.
You now know the model predicts tokens from probability distributions. Module 3 answers the question that follows naturally: which parts of your prompt does the model actually pay attention to?

Run This After Every Module

After completing the module prompt above, paste this into the same conversation. The AI reviews everything that just happened and extracts what you actually learned - not what was presented, but what you demonstrated.

learning-extraction-prompt.xml