AI Fundamentals — The Complete Beginner's Guide¶
Who this is for: Someone joining an AI/QA team for the first time, with no prior AI background. Read top to bottom — every concept builds on the previous one. By the end you'll understand what AI, ML, LLMs, and RAG are, how they differ, and how each one is tested.
1. The AI Journey — How We Got Here¶
Artificial Intelligence is not one thing — it's a family of nested fields, each one a subset of the previous:
flowchart TB
AI["🧠 ARTIFICIAL INTELLIGENCE (1950s–)<br/>Any technique that makes machines mimic human intelligence"]
ML["📊 MACHINE LEARNING (1980s–)<br/>Machines that LEARN from data instead of being explicitly programmed"]
DL["🕸️ DEEP LEARNING (2010s–)<br/>ML using multi-layered neural networks — learns features automatically"]
GenAI["✨ GENERATIVE AI (2020s–)<br/>Deep learning models that CREATE new content — text, images, code"]
LLM["💬 LARGE LANGUAGE MODELS<br/>Generative AI specialised in understanding and producing language<br/>(GPT, Claude, Gemini, Llama)"]
AI --> ML --> DL --> GenAI --> LLM
The journey in one paragraph: Early AI (1950s–80s) was rule-based — humans wrote every rule ("IF temperature > 38 THEN fever"). This broke down for complex problems, so Machine Learning emerged: instead of writing rules, we show the machine thousands of examples and it learns the rules itself. Deep Learning supercharged this with brain-inspired neural networks that handle images, speech, and language. Generative AI flipped the direction — instead of just recognising patterns, models began creating new content. Large Language Models are the latest step: models trained on enormous amounts of text that can read, write, reason, and converse.
| Era | Approach | Example |
|---|---|---|
| 1950s–80s | Rule-based AI — humans write every rule | Chess programs, expert systems |
| 1980s–2010 | Machine Learning — learns rules from data | Spam filters, recommendation engines |
| 2010–2020 | Deep Learning — neural networks learn features | Face recognition, voice assistants, self-driving perception |
| 2020–now | Generative AI / LLMs — creates new content | ChatGPT, Claude, Copilot, Midjourney |
2. What Is Machine Learning?¶
Machine Learning (ML) is teaching computers to learn patterns from data rather than programming them with explicit rules.
Traditional Programming vs Machine Learning¶
flowchart LR
subgraph Traditional["Traditional Programming"]
direction TB
R[Rules] --> P1[Program]
D1[Data] --> P1
P1 --> A1[Answers]
end
subgraph MLP["Machine Learning"]
direction TB
D2[Data] --> T[Training]
A2[Answers / Labels] --> T
T --> M[Model = learned rules]
end
Traditional ~~~ MLP
In traditional programming, you give the computer rules + data and it produces answers. In machine learning, you give it data + answers and it learns the rules itself. The learned rules are called a model.
The Three Types of Machine Learning¶
| Type | How it learns | Everyday example |
|---|---|---|
| Supervised Learning | From labelled examples — "this email is spam, this one isn't" | Spam filters, price prediction, medical diagnosis |
| Unsupervised Learning | Finds hidden patterns in unlabelled data | Customer segmentation, anomaly detection |
| Reinforcement Learning | Trial and error with rewards/penalties | Game-playing AI, robotics, recommendation tuning |
Key ML Vocabulary (you will hear these daily)¶
| Term | Plain-English meaning |
|---|---|
| Model | The "learned brain" — a file containing patterns extracted from data |
| Training | The process of showing data to the algorithm so it learns |
| Features | The input characteristics the model looks at (e.g. age, income, word frequency) |
| Labels | The correct answers used during training |
| Inference | Using a trained model to make a prediction on new data |
| Overfitting | Model memorised the training data — performs well in training, badly in real life |
| Accuracy / Precision / Recall | Metrics that measure how often and how reliably the model is right |
| Dataset split | Data divided into training (learn), validation (tune), test (final exam) sets |
3. What Is a Neural Network? (The Bridge to LLMs)¶
A neural network is an ML model loosely inspired by the brain — layers of simple mathematical units ("neurons") connected together. Each connection has a weight (a number). Training adjusts millions/billions of these weights until the network produces good outputs.
flowchart LR
subgraph Input["Input Layer"]
I1((x1)); I2((x2)); I3((x3))
end
subgraph Hidden["Hidden Layers (the 'deep' in deep learning)"]
H1((h)); H2((h)); H3((h)); H4((h))
end
subgraph Output["Output Layer"]
O1((y))
end
I1 --> H1 & H2; I2 --> H2 & H3; I3 --> H3 & H4
H1 --> O1; H2 --> O1; H3 --> O1; H4 --> O1
Why it matters: LLMs are gigantic neural networks — billions of weights — of a specific architecture called the Transformer (invented 2017). When people say "GPT-4 has ~1 trillion parameters", parameters = weights.
4. What Is an LLM (Large Language Model)?¶
An LLM is a neural network trained on massive amounts of text (books, websites, code) whose core skill is deceptively simple:
Given some text, predict the next word.
That's it. But done at enormous scale, this single skill produces models that can answer questions, write essays, generate code, translate languages, and reason through problems.
How an LLM Works — Step by Step¶
flowchart LR
A["📝 Your prompt:<br/>'The capital of France is'"] --> B["🔢 Tokenisation<br/>Text split into tokens<br/>(word pieces → numbers)"]
B --> C["🧠 Transformer<br/>Billions of weights process<br/>the tokens in context"]
C --> D["🎲 Next-token prediction<br/>'Paris' = 97%<br/>'Lyon' = 1%<br/>'a' = 0.5%"]
D --> E["📤 Output token chosen,<br/>appended, repeat<br/>until answer complete"]
- Tokenisation — your text is split into tokens (chunks of ~4 characters / part-words) and converted to numbers.
- Processing — the Transformer network reads ALL tokens at once and computes how each word relates to every other word ("attention").
- Prediction — it outputs a probability for every possible next token.
- Generation — one token is picked, appended to the text, and the process repeats — one token at a time — until the answer is complete.
Key LLM Vocabulary¶
| Term | Plain-English meaning |
|---|---|
| Token | A chunk of text (~¾ of a word). "Testing" might be 1 token; "internationalisation" might be 4 |
| Prompt | The input text you send to the model |
| Context window | The maximum amount of text the model can "see" at once (e.g. 200K tokens) |
| Temperature | Randomness dial. 0 = consistent/boring, 1+ = creative/unpredictable |
| Parameters / weights | The learned numbers inside the model (billions of them) |
| Hallucination | The model confidently states something false — its #1 weakness |
| Fine-tuning | Additional training on specialised data to adapt a base model |
| System prompt | Hidden instructions that set the model's behaviour and role |
| Inference | Running the model to generate a response |
The Critical Limitation of LLMs¶
LLMs have frozen knowledge: they only know what was in their training data, which has a cutoff date. They also know nothing about your private data — your company's documents, policies, or databases. And when asked about things they don't know, they often hallucinate — invent plausible-sounding but wrong answers.
This limitation is exactly why RAG exists.
5. What Is RAG (Retrieval-Augmented Generation)?¶
RAG = giving the LLM an open-book exam instead of a closed-book one.
Instead of relying on the model's frozen memory, RAG retrieves relevant documents from your own knowledge base and hands them to the LLM along with the question. The LLM then generates an answer grounded in those documents.
Retrieval — find the relevant documents Augmented — add them to the prompt Generation — LLM answers using them
The RAG Pipeline¶
flowchart TB
subgraph Ingestion["📥 INGESTION (done once, offline)"]
Docs["📄 Your documents<br/>(PDFs, wikis, policies)"] --> Chunk["✂️ Chunking<br/>split into passages"]
Chunk --> Embed1["🔢 Embedding model<br/>each chunk → vector<br/>(list of numbers capturing meaning)"]
Embed1 --> VDB[("🗄️ Vector Database<br/>(pgvector, Pinecone, OpenSearch)")]
end
subgraph Query["🔍 QUERY TIME (every user question)"]
Q["❓ User question"] --> Embed2["🔢 Embed the question"]
Embed2 --> Search["🎯 Similarity search<br/>find chunks closest in meaning"]
VDB --> Search
Search --> Rerank["📊 Re-ranking<br/>order by true relevance"]
Rerank --> PromptBuild["📝 Build prompt:<br/>question + retrieved chunks"]
PromptBuild --> LLM2["💬 LLM generates answer<br/>grounded in the chunks"]
LLM2 --> Ans["✅ Answer + source citations"]
end
Ingestion ~~~ Query
RAG Vocabulary¶
| Term | Plain-English meaning |
|---|---|
| Chunking | Splitting documents into bite-size passages (e.g. 500 tokens each) |
| Embedding | Converting text into a vector — a list of numbers representing its meaning |
| Vector database | A database that finds text by similarity of meaning, not keywords |
| Similarity search | "Find me the 5 chunks whose meaning is closest to this question" |
| Re-ranking | A second, more careful pass that reorders retrieved chunks by true relevance |
| Grounding | Forcing the LLM to base its answer on the retrieved documents |
| Citation | Pointing back to the source document for each claim |
| Top-k | How many chunks to retrieve (e.g. top-5 most similar) |
6. LLM vs RAG — The Key Differences¶
The most important mental model: an LLM is a component; RAG is a system that uses an LLM as one of its parts.
flowchart LR
subgraph LLMOnly["💬 LLM alone (closed book)"]
UQ1[Question] --> M1[LLM memory only] --> A1[Answer from<br/>training data]
end
subgraph RAGSys["📚 RAG system (open book)"]
UQ2[Question] --> Ret[Retrieve your documents] --> M2[LLM + documents] --> A2[Answer grounded<br/>in YOUR data + citations]
end
LLMOnly ~~~ RAGSys
| Dimension | LLM alone | RAG system |
|---|---|---|
| What it is | A single model | A pipeline: retriever + vector DB + LLM |
| Knowledge source | Frozen training data | Your live documents |
| Knowledge cutoff | Yes — stuck at training date | No — update documents anytime |
| Private/company data | Knows nothing about it | Built specifically for it |
| Hallucination risk | High when asked beyond its knowledge | Lower — grounded in retrieved text (but not zero!) |
| Citations | Cannot truly cite sources | Can cite the exact source chunk |
| Cost to update knowledge | Retraining/fine-tuning — expensive | Re-index documents — cheap |
| Analogy | Closed-book exam from memory | Open-book exam with the right pages found for you |
| Failure modes | Hallucination, outdated facts | Bad retrieval, bad chunking, ignoring context |
When to use which:
- LLM alone — general knowledge, creative writing, code generation, summarising text you paste in
- RAG — answering questions about your documents: company policies, product manuals, legal contracts, internal wikis
7. Testing an LLM vs Testing a RAG System¶
This is where QA comes in. The fundamental shift: traditional software is deterministic (same input → same output, pass/fail). LLMs are non-deterministic (same input → different valid outputs). You can't write assertEquals on an essay.
Traditional QA asks: "Does the output equal X?" AI QA asks: "Is the output good enough, measured how?"
7a. Testing an LLM (the model itself)¶
You evaluate the quality of generation:
| What you test | Question it answers | How |
|---|---|---|
| Correctness / factuality | Are the facts right? | Golden datasets, LLM-as-Judge scoring |
| Hallucination rate | How often does it invent things? | Compare claims against known ground truth |
| Consistency | Same question twice → compatible answers? | Repeat runs, measure variance |
| Instruction following | Did it obey the prompt format/constraints? | Schema validation, rule checks |
| Safety / toxicity | Can it be made to say harmful things? | Red-teaming, adversarial prompts, jailbreak attempts |
| Prompt injection resistance | Can hidden instructions hijack it? | Inject malicious instructions in inputs |
| Bias | Does it treat demographic groups equally? | Counterfactual testing (swap names/genders, compare) |
| Performance | Latency, token cost per request | Standard perf metrics + cost tracking |
Key tools: DeepEval, promptfoo, Garak (security), custom pytest harnesses.
7b. Testing a RAG System (the whole pipeline)¶
You test every stage — most RAG failures are retrieval failures, not LLM failures:
flowchart LR
T1["1️⃣ Ingestion testing<br/>chunks complete?<br/>no data loss?"] --> T2["2️⃣ Embedding testing<br/>similar texts close together?<br/>model regression on update?"]
T2 --> T3["3️⃣ Retrieval testing<br/>right chunks found?<br/>precision / recall@k"]
T3 --> T4["4️⃣ Re-ranking testing<br/>best chunk ranked first?<br/>MRR / NDCG"]
T4 --> T5["5️⃣ Generation testing<br/>faithful to the chunks?<br/>relevant to the question?"]
T5 --> T6["6️⃣ End-to-end testing<br/>golden Q&A datasets<br/>full pipeline + citations"]
| Stage | What can go wrong | Key metrics |
|---|---|---|
| Chunking | Sentences cut mid-thought; tables mangled; content lost | Chunk completeness, boundary checks |
| Embedding | New embedding model shifts all similarities | Similarity regression suite |
| Retrieval | Right answer exists but isn't retrieved | Context Recall, Context Precision, Recall@k |
| Re-ranking | Relevant chunk retrieved but ranked low | MRR (Mean Reciprocal Rank), NDCG |
| Generation | LLM ignores chunks or contradicts them | Faithfulness, Answer Relevancy |
| End-to-end | Everything individually fine but answer still wrong | Golden dataset pass rate, citation accuracy |
The four core RAGAS metrics (industry-standard framework):
| Metric | Question it answers |
|---|---|
| Faithfulness | Is the answer supported by the retrieved context? (anti-hallucination) |
| Answer Relevancy | Does the answer actually address the question? |
| Context Precision | Of the chunks retrieved, how many were actually relevant? |
| Context Recall | Of the relevant chunks that exist, how many were retrieved? |
Key tools: RAGAS, DeepEval, custom golden datasets, CI/CD threshold gates (e.g. faithfulness must be ≥ 0.85 or build fails).
7c. LLM Testing vs RAG Testing — Summary¶
| Dimension | LLM testing | RAG testing |
|---|---|---|
| Scope | One model | Whole pipeline (6+ stages) |
| Biggest risk | Hallucination, unsafe output | Retrieval failure (right doc never found) |
| Where bugs hide | The model, the prompt | Chunking, embeddings, search, ranking, AND the model |
| Ground truth | General-knowledge golden sets | Golden Q&A pairs built from YOUR documents |
| Signature metrics | Hallucination rate, safety, consistency | Faithfulness, context recall/precision |
| Mindset | "Is the generation good?" | "Did we find the right evidence AND use it correctly?" |
8. Beyond RAG — Agents (The Next Step)¶
Once you understand LLMs and RAG, the next concept you'll meet is agents: LLMs given tools (search, code execution, APIs) and the autonomy to take multiple steps toward a goal.
flowchart LR
LLMBlock["💬 LLM<br/>answers in one shot"] --> RAGBlock["📚 RAG<br/>answers using your documents"] --> AgentBlock["🤖 AGENT<br/>plans → uses tools → acts → loops<br/>until the task is done"]
| LLM | RAG | Agent | |
|---|---|---|---|
| Does what | Generates a reply | Generates a grounded reply | Completes a multi-step task |
| Example | "Explain RBQM" | "What does our SOP say about RBQM?" | "Find all SOPs updated this month, summarise changes, email the team" |
| Testing focus | Output quality | Pipeline + output quality | Tool-call correctness, step tracing, task completion, safety boundaries |
Agent testing is covered in depth in LLM & Agent Evaluation Matrix and Autonomous QA Multi-Agent Pipeline.
9. Glossary — One-Line Definitions¶
| Term | Definition |
|---|---|
| AI | Machines mimicking human intelligence |
| ML | Machines learning patterns from data instead of explicit rules |
| Deep Learning | ML with multi-layer neural networks |
| Neural network | Layers of weighted connections that learn by adjusting weights |
| Transformer | The neural architecture behind all modern LLMs (2017) |
| LLM | A huge transformer trained on text; predicts the next token |
| Token | A chunk of text (~¾ word); the unit LLMs read and write |
| Prompt | The input you give an LLM |
| Context window | Max text an LLM can consider at once |
| Temperature | Randomness control for generation |
| Hallucination | Confident but false output |
| Fine-tuning | Further training a model on specialised data |
| Embedding | Text converted to a meaning-vector |
| Vector database | Stores embeddings; searches by similarity of meaning |
| Chunking | Splitting documents into passages for retrieval |
| RAG | Retrieve relevant docs → give to LLM → grounded answer |
| Faithfulness | Whether an answer is supported by its retrieved context |
| Context recall | Whether the retriever found all the relevant chunks |
| Golden dataset | Curated Q&A pairs with known-correct answers, used for evaluation |
| LLM-as-Judge | Using a strong LLM to score another LLM's outputs |
| Red-teaming | Deliberately attacking a model to find unsafe behaviour |
| Prompt injection | Hiding malicious instructions in input data to hijack a model |
| Agent | An LLM that plans and uses tools across multiple steps |
| MCP | Model Context Protocol — a standard for connecting LLMs to tools/data |
10. Where to Go Next¶
Recommended reading order in this portal:
- You are here — AI Fundamentals ✅
- QA Evolution — Testing Intelligence — why QA is changing
- RAG vs Agents vs Agentic RAG — architecture deep-dive
- LLM Testing Lifecycle — the full testing process
- RAG Automation Testing Roadmap — stage-by-stage RAG test plan
- Ragas FAQ and DeepEval FAQ — the tools
- Prompt Injection — Complete Guide — security testing