AI Fundamentals — The Complete Beginner's Guide¶

Who this is for: Someone joining an AI/QA team for the first time, with no prior AI background. Read top to bottom — every concept builds on the previous one. By the end you'll understand what AI, ML, LLMs, and RAG are, how they differ, and how each one is tested.

1. The AI Journey — How We Got Here¶

Artificial Intelligence is not one thing — it's a family of nested fields, each one a subset of the previous:

flowchart TB
    AI["🧠 ARTIFICIAL INTELLIGENCE (1950s–)<br/>Any technique that makes machines mimic human intelligence"]
    ML["📊 MACHINE LEARNING (1980s–)<br/>Machines that LEARN from data instead of being explicitly programmed"]
    DL["🕸️ DEEP LEARNING (2010s–)<br/>ML using multi-layered neural networks — learns features automatically"]
    GenAI["✨ GENERATIVE AI (2020s–)<br/>Deep learning models that CREATE new content — text, images, code"]
    LLM["💬 LARGE LANGUAGE MODELS<br/>Generative AI specialised in understanding and producing language<br/>(GPT, Claude, Gemini, Llama)"]

    AI --> ML --> DL --> GenAI --> LLM

The journey in one paragraph: Early AI (1950s–80s) was rule-based — humans wrote every rule ("IF temperature > 38 THEN fever"). This broke down for complex problems, so Machine Learning emerged: instead of writing rules, we show the machine thousands of examples and it learns the rules itself. Deep Learning supercharged this with brain-inspired neural networks that handle images, speech, and language. Generative AI flipped the direction — instead of just recognising patterns, models began creating new content. Large Language Models are the latest step: models trained on enormous amounts of text that can read, write, reason, and converse.

Era	Approach	Example
1950s–80s	Rule-based AI — humans write every rule	Chess programs, expert systems
1980s–2010	Machine Learning — learns rules from data	Spam filters, recommendation engines
2010–2020	Deep Learning — neural networks learn features	Face recognition, voice assistants, self-driving perception
2020–now	Generative AI / LLMs — creates new content	ChatGPT, Claude, Copilot, Midjourney

2. What Is Machine Learning?¶

Machine Learning (ML) is teaching computers to learn patterns from data rather than programming them with explicit rules.

Traditional Programming vs Machine Learning¶

flowchart LR
    subgraph Traditional["Traditional Programming"]
        direction TB
        R[Rules] --> P1[Program]
        D1[Data] --> P1
        P1 --> A1[Answers]
    end
    subgraph MLP["Machine Learning"]
        direction TB
        D2[Data] --> T[Training]
        A2[Answers / Labels] --> T
        T --> M[Model = learned rules]
    end
    Traditional ~~~ MLP

In traditional programming, you give the computer rules + data and it produces answers. In machine learning, you give it data + answers and it learns the rules itself. The learned rules are called a model.

The Three Types of Machine Learning¶

Type	How it learns	Everyday example
Supervised Learning	From labelled examples — "this email is spam, this one isn't"	Spam filters, price prediction, medical diagnosis
Unsupervised Learning	Finds hidden patterns in unlabelled data	Customer segmentation, anomaly detection
Reinforcement Learning	Trial and error with rewards/penalties	Game-playing AI, robotics, recommendation tuning

Key ML Vocabulary (you will hear these daily)¶

Term	Plain-English meaning
Model	The "learned brain" — a file containing patterns extracted from data
Training	The process of showing data to the algorithm so it learns
Features	The input characteristics the model looks at (e.g. age, income, word frequency)
Labels	The correct answers used during training
Inference	Using a trained model to make a prediction on new data
Overfitting	Model memorised the training data — performs well in training, badly in real life
Accuracy / Precision / Recall	Metrics that measure how often and how reliably the model is right
Dataset split	Data divided into training (learn), validation (tune), test (final exam) sets

3. What Is a Neural Network? (The Bridge to LLMs)¶

A neural network is an ML model loosely inspired by the brain — layers of simple mathematical units ("neurons") connected together. Each connection has a weight (a number). Training adjusts millions/billions of these weights until the network produces good outputs.

flowchart LR
    subgraph Input["Input Layer"]
        I1((x1)); I2((x2)); I3((x3))
    end
    subgraph Hidden["Hidden Layers (the 'deep' in deep learning)"]
        H1((h)); H2((h)); H3((h)); H4((h))
    end
    subgraph Output["Output Layer"]
        O1((y))
    end
    I1 --> H1 & H2; I2 --> H2 & H3; I3 --> H3 & H4
    H1 --> O1; H2 --> O1; H3 --> O1; H4 --> O1

Why it matters: LLMs are gigantic neural networks — billions of weights — of a specific architecture called the Transformer (invented 2017). When people say "GPT-4 has ~1 trillion parameters", parameters = weights.

4. What Is an LLM (Large Language Model)?¶

An LLM is a neural network trained on massive amounts of text (books, websites, code) whose core skill is deceptively simple:

Given some text, predict the next word.

That's it. But done at enormous scale, this single skill produces models that can answer questions, write essays, generate code, translate languages, and reason through problems.

How an LLM Works — Step by Step¶

flowchart LR
    A["📝 Your prompt:<br/>'The capital of France is'"] --> B["🔢 Tokenisation<br/>Text split into tokens<br/>(word pieces → numbers)"]
    B --> C["🧠 Transformer<br/>Billions of weights process<br/>the tokens in context"]
    C --> D["🎲 Next-token prediction<br/>'Paris' = 97%<br/>'Lyon' = 1%<br/>'a' = 0.5%"]
    D --> E["📤 Output token chosen,<br/>appended, repeat<br/>until answer complete"]

Tokenisation — your text is split into tokens (chunks of ~4 characters / part-words) and converted to numbers.
Processing — the Transformer network reads ALL tokens at once and computes how each word relates to every other word ("attention").
Prediction — it outputs a probability for every possible next token.
Generation — one token is picked, appended to the text, and the process repeats — one token at a time — until the answer is complete.

Key LLM Vocabulary¶

Term	Plain-English meaning
Token	A chunk of text (~¾ of a word). "Testing" might be 1 token; "internationalisation" might be 4
Prompt	The input text you send to the model
Context window	The maximum amount of text the model can "see" at once (e.g. 200K tokens)
Temperature	Randomness dial. 0 = consistent/boring, 1+ = creative/unpredictable
Parameters / weights	The learned numbers inside the model (billions of them)
Hallucination	The model confidently states something false — its #1 weakness
Fine-tuning	Additional training on specialised data to adapt a base model
System prompt	Hidden instructions that set the model's behaviour and role
Inference	Running the model to generate a response

The Critical Limitation of LLMs¶

LLMs have frozen knowledge: they only know what was in their training data, which has a cutoff date. They also know nothing about your private data — your company's documents, policies, or databases. And when asked about things they don't know, they often hallucinate — invent plausible-sounding but wrong answers.

This limitation is exactly why RAG exists.

5. What Is RAG (Retrieval-Augmented Generation)?¶

RAG = giving the LLM an open-book exam instead of a closed-book one.

Instead of relying on the model's frozen memory, RAG retrieves relevant documents from your own knowledge base and hands them to the LLM along with the question. The LLM then generates an answer grounded in those documents.

Retrieval — find the relevant documents Augmented — add them to the prompt Generation — LLM answers using them

The RAG Pipeline¶

flowchart TB
    subgraph Ingestion["📥 INGESTION (done once, offline)"]
        Docs["📄 Your documents<br/>(PDFs, wikis, policies)"] --> Chunk["✂️ Chunking<br/>split into passages"]
        Chunk --> Embed1["🔢 Embedding model<br/>each chunk → vector<br/>(list of numbers capturing meaning)"]
        Embed1 --> VDB[("🗄️ Vector Database<br/>(pgvector, Pinecone, OpenSearch)")]
    end

    subgraph Query["🔍 QUERY TIME (every user question)"]
        Q["❓ User question"] --> Embed2["🔢 Embed the question"]
        Embed2 --> Search["🎯 Similarity search<br/>find chunks closest in meaning"]
        VDB --> Search
        Search --> Rerank["📊 Re-ranking<br/>order by true relevance"]
        Rerank --> PromptBuild["📝 Build prompt:<br/>question + retrieved chunks"]
        PromptBuild --> LLM2["💬 LLM generates answer<br/>grounded in the chunks"]
        LLM2 --> Ans["✅ Answer + source citations"]
    end

    Ingestion ~~~ Query

RAG Vocabulary¶

Term	Plain-English meaning
Chunking	Splitting documents into bite-size passages (e.g. 500 tokens each)
Embedding	Converting text into a vector — a list of numbers representing its meaning
Vector database	A database that finds text by similarity of meaning, not keywords
Similarity search	"Find me the 5 chunks whose meaning is closest to this question"
Re-ranking	A second, more careful pass that reorders retrieved chunks by true relevance
Grounding	Forcing the LLM to base its answer on the retrieved documents
Citation	Pointing back to the source document for each claim
Top-k	How many chunks to retrieve (e.g. top-5 most similar)

6. LLM vs RAG — The Key Differences¶

The most important mental model: an LLM is a component; RAG is a system that uses an LLM as one of its parts.

flowchart LR
    subgraph LLMOnly["💬 LLM alone (closed book)"]
        UQ1[Question] --> M1[LLM memory only] --> A1[Answer from<br/>training data]
    end
    subgraph RAGSys["📚 RAG system (open book)"]
        UQ2[Question] --> Ret[Retrieve your documents] --> M2[LLM + documents] --> A2[Answer grounded<br/>in YOUR data + citations]
    end
    LLMOnly ~~~ RAGSys

Dimension	LLM alone	RAG system
What it is	A single model	A pipeline: retriever + vector DB + LLM
Knowledge source	Frozen training data	Your live documents
Knowledge cutoff	Yes — stuck at training date	No — update documents anytime
Private/company data	Knows nothing about it	Built specifically for it
Hallucination risk	High when asked beyond its knowledge	Lower — grounded in retrieved text (but not zero!)
Citations	Cannot truly cite sources	Can cite the exact source chunk
Cost to update knowledge	Retraining/fine-tuning — expensive	Re-index documents — cheap
Analogy	Closed-book exam from memory	Open-book exam with the right pages found for you
Failure modes	Hallucination, outdated facts	Bad retrieval, bad chunking, ignoring context

When to use which:

LLM alone — general knowledge, creative writing, code generation, summarising text you paste in
RAG — answering questions about your documents: company policies, product manuals, legal contracts, internal wikis

7. Testing an LLM vs Testing a RAG System¶

This is where QA comes in. The fundamental shift: traditional software is deterministic (same input → same output, pass/fail). LLMs are non-deterministic (same input → different valid outputs). You can't write assertEquals on an essay.

Traditional QA asks: "Does the output equal X?" AI QA asks: "Is the output good enough, measured how?"

7a. Testing an LLM (the model itself)¶

You evaluate the quality of generation:

What you test	Question it answers	How
Correctness / factuality	Are the facts right?	Golden datasets, LLM-as-Judge scoring
Hallucination rate	How often does it invent things?	Compare claims against known ground truth
Consistency	Same question twice → compatible answers?	Repeat runs, measure variance
Instruction following	Did it obey the prompt format/constraints?	Schema validation, rule checks
Safety / toxicity	Can it be made to say harmful things?	Red-teaming, adversarial prompts, jailbreak attempts
Prompt injection resistance	Can hidden instructions hijack it?	Inject malicious instructions in inputs
Bias	Does it treat demographic groups equally?	Counterfactual testing (swap names/genders, compare)
Performance	Latency, token cost per request	Standard perf metrics + cost tracking

Key tools: DeepEval, promptfoo, Garak (security), custom pytest harnesses.

7b. Testing a RAG System (the whole pipeline)¶

You test every stage — most RAG failures are retrieval failures, not LLM failures:

flowchart LR
    T1["1️⃣ Ingestion testing<br/>chunks complete?<br/>no data loss?"] --> T2["2️⃣ Embedding testing<br/>similar texts close together?<br/>model regression on update?"]
    T2 --> T3["3️⃣ Retrieval testing<br/>right chunks found?<br/>precision / recall@k"]
    T3 --> T4["4️⃣ Re-ranking testing<br/>best chunk ranked first?<br/>MRR / NDCG"]
    T4 --> T5["5️⃣ Generation testing<br/>faithful to the chunks?<br/>relevant to the question?"]
    T5 --> T6["6️⃣ End-to-end testing<br/>golden Q&A datasets<br/>full pipeline + citations"]

Stage	What can go wrong	Key metrics
Chunking	Sentences cut mid-thought; tables mangled; content lost	Chunk completeness, boundary checks
Embedding	New embedding model shifts all similarities	Similarity regression suite
Retrieval	Right answer exists but isn't retrieved	Context Recall, Context Precision, Recall@k
Re-ranking	Relevant chunk retrieved but ranked low	MRR (Mean Reciprocal Rank), NDCG
Generation	LLM ignores chunks or contradicts them	Faithfulness, Answer Relevancy
End-to-end	Everything individually fine but answer still wrong	Golden dataset pass rate, citation accuracy

The four core RAGAS metrics (industry-standard framework):

Metric	Question it answers
Faithfulness	Is the answer supported by the retrieved context? (anti-hallucination)
Answer Relevancy	Does the answer actually address the question?
Context Precision	Of the chunks retrieved, how many were actually relevant?
Context Recall	Of the relevant chunks that exist, how many were retrieved?

Key tools: RAGAS, DeepEval, custom golden datasets, CI/CD threshold gates (e.g. faithfulness must be ≥ 0.85 or build fails).

7c. LLM Testing vs RAG Testing — Summary¶

Dimension	LLM testing	RAG testing
Scope	One model	Whole pipeline (6+ stages)
Biggest risk	Hallucination, unsafe output	Retrieval failure (right doc never found)
Where bugs hide	The model, the prompt	Chunking, embeddings, search, ranking, AND the model
Ground truth	General-knowledge golden sets	Golden Q&A pairs built from YOUR documents
Signature metrics	Hallucination rate, safety, consistency	Faithfulness, context recall/precision
Mindset	"Is the generation good?"	"Did we find the right evidence AND use it correctly?"

8. Beyond RAG — Agents (The Next Step)¶

Once you understand LLMs and RAG, the next concept you'll meet is agents: LLMs given tools (search, code execution, APIs) and the autonomy to take multiple steps toward a goal.

flowchart LR
    LLMBlock["💬 LLM<br/>answers in one shot"] --> RAGBlock["📚 RAG<br/>answers using your documents"] --> AgentBlock["🤖 AGENT<br/>plans → uses tools → acts → loops<br/>until the task is done"]

	LLM	RAG	Agent
Does what	Generates a reply	Generates a grounded reply	Completes a multi-step task
Example	"Explain RBQM"	"What does our SOP say about RBQM?"	"Find all SOPs updated this month, summarise changes, email the team"
Testing focus	Output quality	Pipeline + output quality	Tool-call correctness, step tracing, task completion, safety boundaries

Agent testing is covered in depth in LLM & Agent Evaluation Matrix and Autonomous QA Multi-Agent Pipeline.

9. Glossary — One-Line Definitions¶

Term	Definition
AI	Machines mimicking human intelligence
ML	Machines learning patterns from data instead of explicit rules
Deep Learning	ML with multi-layer neural networks
Neural network	Layers of weighted connections that learn by adjusting weights
Transformer	The neural architecture behind all modern LLMs (2017)
LLM	A huge transformer trained on text; predicts the next token
Token	A chunk of text (~¾ word); the unit LLMs read and write
Prompt	The input you give an LLM
Context window	Max text an LLM can consider at once
Temperature	Randomness control for generation
Hallucination	Confident but false output
Fine-tuning	Further training a model on specialised data
Embedding	Text converted to a meaning-vector
Vector database	Stores embeddings; searches by similarity of meaning
Chunking	Splitting documents into passages for retrieval
RAG	Retrieve relevant docs → give to LLM → grounded answer
Faithfulness	Whether an answer is supported by its retrieved context
Context recall	Whether the retriever found all the relevant chunks
Golden dataset	Curated Q&A pairs with known-correct answers, used for evaluation
LLM-as-Judge	Using a strong LLM to score another LLM's outputs
Red-teaming	Deliberately attacking a model to find unsafe behaviour
Prompt injection	Hiding malicious instructions in input data to hijack a model
Agent	An LLM that plans and uses tools across multiple steps
MCP	Model Context Protocol — a standard for connecting LLMs to tools/data

10. Where to Go Next¶

Recommended reading order in this portal:

You are here — AI Fundamentals ✅
QA Evolution — Testing Intelligence — why QA is changing
RAG vs Agents vs Agentic RAG — architecture deep-dive
LLM Testing Lifecycle — the full testing process
RAG Automation Testing Roadmap — stage-by-stage RAG test plan
Ragas FAQ and DeepEval FAQ — the tools
Prompt Injection — Complete Guide — security testing