Four hands-on AI projects you build together — then race each other through. Train, prompt, predict, deploy. Whoever wins the most rounds wins the season.
Teach the AI to tell two things apart by drawing examples. Then duel: who can draw clearer pictures?
Write sentences. Guess which one the AI will think is closest to your opponent's query. Read its mind.
Same challenge, two prompts. The model judges. Precision beats eloquence — every time.
Each player builds an agent with tools and memory. Same task. Side-by-side traces. Best agent wins.
Random rotation through all four projects. First to win 3 takes the season.
Co-train a classifier. Then take turns drawing test pictures — the AI judges yours, the cleaner picture wins the point.
When you press "+ add", the drawing is shrunk to a 16×16 grayscale grid — that's 256 numbers per drawing. Each label's gallery becomes a list of vectors.
When a new doodle comes in, the model computes its 256-number vector and finds the 3 nearest neighbours across both classes (cosine distance). Majority vote wins; the margin is shown as "confidence". This is literally k-NN — the simplest possible classifier, no neural net.
Bigger lesson: a clean, diverse set of examples beats a clever algorithm. If your "cat" doodles all look similar, the model will only recognise cats that look like yours.
Write sentences into a shared library. On your turn, write a query — your opponent must guess which library sentence the AI picks as #1. Read the AI's mind.
Every sentence gets turned into a 12-number vector by hashing each word into one of 12 bins (a tiny embedding). Sentences that share words land in similar shapes.
To rank the library, the AI computes cos(query, sentence) for every sentence — that's the same cosine similarity from Day 7½ of the course. Highest cosine wins. We show all 10 numbers so you can see exactly why one beat another.
Bigger lesson: an embedding doesn't care about word order, only what words are in the sentence. Real embedding models do care about order (that's what transformers fixed). The intuition you're learning here is the foundation.
Same challenge. Two prompts. One deterministic model judges both. The judge counts bullets, validates JSON, checks for keywords — eloquence won't save you, precision will.
The "model" here is a tiny rule engine — it inspects your prompt for keywords ("bullet", "haiku", "json", "pirate", explicit counts like "5 sentences") and renders text accordingly. It's deterministic: the same prompt always produces the same reply.
The judge then runs measurable checks: count bullets, count syllables (for haiku), JSON.parse for valid JSON, regex for required keywords. Each check is worth 1 point.
Bigger lesson: real LLMs also reward specificity. "5 short bullets" beats "make a list". "Reply in valid JSON" beats "give me data". The skill that wins here is the same skill that wins in production.
Both players assemble an agent. Pick tools, write a system prompt, seed memory. Same task. Watch both agents run side by side. Judge by tool fit, requirements hit, and trace length.
Each agent goes through the THINK → ACT → OBSERVE loop. The trace branches based on which tools you gave it. No tool for the job → BLOCKED step (and lost points).
Auto-judge scores each agent on: tool fit (did the picked tools match the task?), requirements hit (did the final answer mention the key things?), and trace efficiency (shorter is better — but not so short it skipped a needed step).
Bigger lesson: more tools isn't better. A focused agent with the right 2-3 tools and a sharp prompt beats a kitchen-sink agent every time.
Hand the laptop over. When you're ready, press start.