Skip to content

AI Fundamentals — The Complete Beginner's Guide

Who this is for: Someone joining an AI/QA team for the first time, with no prior AI background. Read top to bottom — every concept builds on the previous one. By the end you'll understand what AI, ML, LLMs, and RAG are, how they differ, and how each one is tested.


1. The AI Journey — How We Got Here

Artificial Intelligence is not one thing — it's a family of nested fields, each one a subset of the previous:

flowchart TB
    AI["🧠 ARTIFICIAL INTELLIGENCE (1950s–)<br/>Any technique that makes machines mimic human intelligence"]
    ML["📊 MACHINE LEARNING (1980s–)<br/>Machines that LEARN from data instead of being explicitly programmed"]
    DL["🕸️ DEEP LEARNING (2010s–)<br/>ML using multi-layered neural networks — learns features automatically"]
    GenAI["✨ GENERATIVE AI (2020s–)<br/>Deep learning models that CREATE new content — text, images, code"]
    LLM["💬 LARGE LANGUAGE MODELS<br/>Generative AI specialised in understanding and producing language<br/>(GPT, Claude, Gemini, Llama)"]

    AI --> ML --> DL --> GenAI --> LLM

The journey in one paragraph: Early AI (1950s–80s) was rule-based — humans wrote every rule ("IF temperature > 38 THEN fever"). This broke down for complex problems, so Machine Learning emerged: instead of writing rules, we show the machine thousands of examples and it learns the rules itself. Deep Learning supercharged this with brain-inspired neural networks that handle images, speech, and language. Generative AI flipped the direction — instead of just recognising patterns, models began creating new content. Large Language Models are the latest step: models trained on enormous amounts of text that can read, write, reason, and converse.

Era Approach Example
1950s–80s Rule-based AI — humans write every rule Chess programs, expert systems
1980s–2010 Machine Learning — learns rules from data Spam filters, recommendation engines
2010–2020 Deep Learning — neural networks learn features Face recognition, voice assistants, self-driving perception
2020–now Generative AI / LLMs — creates new content ChatGPT, Claude, Copilot, Midjourney

2. What Is Machine Learning?

Machine Learning (ML) is teaching computers to learn patterns from data rather than programming them with explicit rules.

Traditional Programming vs Machine Learning

flowchart LR
    subgraph Traditional["Traditional Programming"]
        direction TB
        R[Rules] --> P1[Program]
        D1[Data] --> P1
        P1 --> A1[Answers]
    end
    subgraph MLP["Machine Learning"]
        direction TB
        D2[Data] --> T[Training]
        A2[Answers / Labels] --> T
        T --> M[Model = learned rules]
    end
    Traditional ~~~ MLP

In traditional programming, you give the computer rules + data and it produces answers. In machine learning, you give it data + answers and it learns the rules itself. The learned rules are called a model.

The Three Types of Machine Learning

Type How it learns Everyday example
Supervised Learning From labelled examples — "this email is spam, this one isn't" Spam filters, price prediction, medical diagnosis
Unsupervised Learning Finds hidden patterns in unlabelled data Customer segmentation, anomaly detection
Reinforcement Learning Trial and error with rewards/penalties Game-playing AI, robotics, recommendation tuning

Key ML Vocabulary (you will hear these daily)

Term Plain-English meaning
Model The "learned brain" — a file containing patterns extracted from data
Training The process of showing data to the algorithm so it learns
Features The input characteristics the model looks at (e.g. age, income, word frequency)
Labels The correct answers used during training
Inference Using a trained model to make a prediction on new data
Overfitting Model memorised the training data — performs well in training, badly in real life
Accuracy / Precision / Recall Metrics that measure how often and how reliably the model is right
Dataset split Data divided into training (learn), validation (tune), test (final exam) sets

3. What Is a Neural Network? (The Bridge to LLMs)

A neural network is an ML model loosely inspired by the brain — layers of simple mathematical units ("neurons") connected together. Each connection has a weight (a number). Training adjusts millions/billions of these weights until the network produces good outputs.

flowchart LR
    subgraph Input["Input Layer"]
        I1((x1)); I2((x2)); I3((x3))
    end
    subgraph Hidden["Hidden Layers (the 'deep' in deep learning)"]
        H1((h)); H2((h)); H3((h)); H4((h))
    end
    subgraph Output["Output Layer"]
        O1((y))
    end
    I1 --> H1 & H2; I2 --> H2 & H3; I3 --> H3 & H4
    H1 --> O1; H2 --> O1; H3 --> O1; H4 --> O1

Why it matters: LLMs are gigantic neural networks — billions of weights — of a specific architecture called the Transformer (invented 2017). When people say "GPT-4 has ~1 trillion parameters", parameters = weights.


4. What Is an LLM (Large Language Model)?

An LLM is a neural network trained on massive amounts of text (books, websites, code) whose core skill is deceptively simple:

Given some text, predict the next word.

That's it. But done at enormous scale, this single skill produces models that can answer questions, write essays, generate code, translate languages, and reason through problems.

How an LLM Works — Step by Step

flowchart LR
    A["📝 Your prompt:<br/>'The capital of France is'"] --> B["🔢 Tokenisation<br/>Text split into tokens<br/>(word pieces → numbers)"]
    B --> C["🧠 Transformer<br/>Billions of weights process<br/>the tokens in context"]
    C --> D["🎲 Next-token prediction<br/>'Paris' = 97%<br/>'Lyon' = 1%<br/>'a' = 0.5%"]
    D --> E["📤 Output token chosen,<br/>appended, repeat<br/>until answer complete"]
  1. Tokenisation — your text is split into tokens (chunks of ~4 characters / part-words) and converted to numbers.
  2. Processing — the Transformer network reads ALL tokens at once and computes how each word relates to every other word ("attention").
  3. Prediction — it outputs a probability for every possible next token.
  4. Generation — one token is picked, appended to the text, and the process repeats — one token at a time — until the answer is complete.

Key LLM Vocabulary

Term Plain-English meaning
Token A chunk of text (~¾ of a word). "Testing" might be 1 token; "internationalisation" might be 4
Prompt The input text you send to the model
Context window The maximum amount of text the model can "see" at once (e.g. 200K tokens)
Temperature Randomness dial. 0 = consistent/boring, 1+ = creative/unpredictable
Parameters / weights The learned numbers inside the model (billions of them)
Hallucination The model confidently states something false — its #1 weakness
Fine-tuning Additional training on specialised data to adapt a base model
System prompt Hidden instructions that set the model's behaviour and role
Inference Running the model to generate a response

The Critical Limitation of LLMs

LLMs have frozen knowledge: they only know what was in their training data, which has a cutoff date. They also know nothing about your private data — your company's documents, policies, or databases. And when asked about things they don't know, they often hallucinate — invent plausible-sounding but wrong answers.

This limitation is exactly why RAG exists.


5. What Is RAG (Retrieval-Augmented Generation)?

RAG = giving the LLM an open-book exam instead of a closed-book one.

Instead of relying on the model's frozen memory, RAG retrieves relevant documents from your own knowledge base and hands them to the LLM along with the question. The LLM then generates an answer grounded in those documents.

Retrieval — find the relevant documents Augmented — add them to the prompt Generation — LLM answers using them

The RAG Pipeline

flowchart TB
    subgraph Ingestion["📥 INGESTION (done once, offline)"]
        Docs["📄 Your documents<br/>(PDFs, wikis, policies)"] --> Chunk["✂️ Chunking<br/>split into passages"]
        Chunk --> Embed1["🔢 Embedding model<br/>each chunk → vector<br/>(list of numbers capturing meaning)"]
        Embed1 --> VDB[("🗄️ Vector Database<br/>(pgvector, Pinecone, OpenSearch)")]
    end

    subgraph Query["🔍 QUERY TIME (every user question)"]
        Q["❓ User question"] --> Embed2["🔢 Embed the question"]
        Embed2 --> Search["🎯 Similarity search<br/>find chunks closest in meaning"]
        VDB --> Search
        Search --> Rerank["📊 Re-ranking<br/>order by true relevance"]
        Rerank --> PromptBuild["📝 Build prompt:<br/>question + retrieved chunks"]
        PromptBuild --> LLM2["💬 LLM generates answer<br/>grounded in the chunks"]
        LLM2 --> Ans["✅ Answer + source citations"]
    end

    Ingestion ~~~ Query

RAG Vocabulary

Term Plain-English meaning
Chunking Splitting documents into bite-size passages (e.g. 500 tokens each)
Embedding Converting text into a vector — a list of numbers representing its meaning
Vector database A database that finds text by similarity of meaning, not keywords
Similarity search "Find me the 5 chunks whose meaning is closest to this question"
Re-ranking A second, more careful pass that reorders retrieved chunks by true relevance
Grounding Forcing the LLM to base its answer on the retrieved documents
Citation Pointing back to the source document for each claim
Top-k How many chunks to retrieve (e.g. top-5 most similar)

6. LLM vs RAG — The Key Differences

The most important mental model: an LLM is a component; RAG is a system that uses an LLM as one of its parts.

flowchart LR
    subgraph LLMOnly["💬 LLM alone (closed book)"]
        UQ1[Question] --> M1[LLM memory only] --> A1[Answer from<br/>training data]
    end
    subgraph RAGSys["📚 RAG system (open book)"]
        UQ2[Question] --> Ret[Retrieve your documents] --> M2[LLM + documents] --> A2[Answer grounded<br/>in YOUR data + citations]
    end
    LLMOnly ~~~ RAGSys
Dimension LLM alone RAG system
What it is A single model A pipeline: retriever + vector DB + LLM
Knowledge source Frozen training data Your live documents
Knowledge cutoff Yes — stuck at training date No — update documents anytime
Private/company data Knows nothing about it Built specifically for it
Hallucination risk High when asked beyond its knowledge Lower — grounded in retrieved text (but not zero!)
Citations Cannot truly cite sources Can cite the exact source chunk
Cost to update knowledge Retraining/fine-tuning — expensive Re-index documents — cheap
Analogy Closed-book exam from memory Open-book exam with the right pages found for you
Failure modes Hallucination, outdated facts Bad retrieval, bad chunking, ignoring context

When to use which:

  • LLM alone — general knowledge, creative writing, code generation, summarising text you paste in
  • RAG — answering questions about your documents: company policies, product manuals, legal contracts, internal wikis

7. Testing an LLM vs Testing a RAG System

This is where QA comes in. The fundamental shift: traditional software is deterministic (same input → same output, pass/fail). LLMs are non-deterministic (same input → different valid outputs). You can't write assertEquals on an essay.

Traditional QA asks: "Does the output equal X?" AI QA asks: "Is the output good enough, measured how?"

7a. Testing an LLM (the model itself)

You evaluate the quality of generation:

What you test Question it answers How
Correctness / factuality Are the facts right? Golden datasets, LLM-as-Judge scoring
Hallucination rate How often does it invent things? Compare claims against known ground truth
Consistency Same question twice → compatible answers? Repeat runs, measure variance
Instruction following Did it obey the prompt format/constraints? Schema validation, rule checks
Safety / toxicity Can it be made to say harmful things? Red-teaming, adversarial prompts, jailbreak attempts
Prompt injection resistance Can hidden instructions hijack it? Inject malicious instructions in inputs
Bias Does it treat demographic groups equally? Counterfactual testing (swap names/genders, compare)
Performance Latency, token cost per request Standard perf metrics + cost tracking

Key tools: DeepEval, promptfoo, Garak (security), custom pytest harnesses.

7b. Testing a RAG System (the whole pipeline)

You test every stage — most RAG failures are retrieval failures, not LLM failures:

flowchart LR
    T1["1️⃣ Ingestion testing<br/>chunks complete?<br/>no data loss?"] --> T2["2️⃣ Embedding testing<br/>similar texts close together?<br/>model regression on update?"]
    T2 --> T3["3️⃣ Retrieval testing<br/>right chunks found?<br/>precision / recall@k"]
    T3 --> T4["4️⃣ Re-ranking testing<br/>best chunk ranked first?<br/>MRR / NDCG"]
    T4 --> T5["5️⃣ Generation testing<br/>faithful to the chunks?<br/>relevant to the question?"]
    T5 --> T6["6️⃣ End-to-end testing<br/>golden Q&A datasets<br/>full pipeline + citations"]
Stage What can go wrong Key metrics
Chunking Sentences cut mid-thought; tables mangled; content lost Chunk completeness, boundary checks
Embedding New embedding model shifts all similarities Similarity regression suite
Retrieval Right answer exists but isn't retrieved Context Recall, Context Precision, Recall@k
Re-ranking Relevant chunk retrieved but ranked low MRR (Mean Reciprocal Rank), NDCG
Generation LLM ignores chunks or contradicts them Faithfulness, Answer Relevancy
End-to-end Everything individually fine but answer still wrong Golden dataset pass rate, citation accuracy

The four core RAGAS metrics (industry-standard framework):

Metric Question it answers
Faithfulness Is the answer supported by the retrieved context? (anti-hallucination)
Answer Relevancy Does the answer actually address the question?
Context Precision Of the chunks retrieved, how many were actually relevant?
Context Recall Of the relevant chunks that exist, how many were retrieved?

Key tools: RAGAS, DeepEval, custom golden datasets, CI/CD threshold gates (e.g. faithfulness must be ≥ 0.85 or build fails).

7c. LLM Testing vs RAG Testing — Summary

Dimension LLM testing RAG testing
Scope One model Whole pipeline (6+ stages)
Biggest risk Hallucination, unsafe output Retrieval failure (right doc never found)
Where bugs hide The model, the prompt Chunking, embeddings, search, ranking, AND the model
Ground truth General-knowledge golden sets Golden Q&A pairs built from YOUR documents
Signature metrics Hallucination rate, safety, consistency Faithfulness, context recall/precision
Mindset "Is the generation good?" "Did we find the right evidence AND use it correctly?"

8. Beyond RAG — Agents (The Next Step)

Once you understand LLMs and RAG, the next concept you'll meet is agents: LLMs given tools (search, code execution, APIs) and the autonomy to take multiple steps toward a goal.

flowchart LR
    LLMBlock["💬 LLM<br/>answers in one shot"] --> RAGBlock["📚 RAG<br/>answers using your documents"] --> AgentBlock["🤖 AGENT<br/>plans → uses tools → acts → loops<br/>until the task is done"]
LLM RAG Agent
Does what Generates a reply Generates a grounded reply Completes a multi-step task
Example "Explain RBQM" "What does our SOP say about RBQM?" "Find all SOPs updated this month, summarise changes, email the team"
Testing focus Output quality Pipeline + output quality Tool-call correctness, step tracing, task completion, safety boundaries

Agent testing is covered in depth in LLM & Agent Evaluation Matrix and Autonomous QA Multi-Agent Pipeline.


9. Glossary — One-Line Definitions

Term Definition
AI Machines mimicking human intelligence
ML Machines learning patterns from data instead of explicit rules
Deep Learning ML with multi-layer neural networks
Neural network Layers of weighted connections that learn by adjusting weights
Transformer The neural architecture behind all modern LLMs (2017)
LLM A huge transformer trained on text; predicts the next token
Token A chunk of text (~¾ word); the unit LLMs read and write
Prompt The input you give an LLM
Context window Max text an LLM can consider at once
Temperature Randomness control for generation
Hallucination Confident but false output
Fine-tuning Further training a model on specialised data
Embedding Text converted to a meaning-vector
Vector database Stores embeddings; searches by similarity of meaning
Chunking Splitting documents into passages for retrieval
RAG Retrieve relevant docs → give to LLM → grounded answer
Faithfulness Whether an answer is supported by its retrieved context
Context recall Whether the retriever found all the relevant chunks
Golden dataset Curated Q&A pairs with known-correct answers, used for evaluation
LLM-as-Judge Using a strong LLM to score another LLM's outputs
Red-teaming Deliberately attacking a model to find unsafe behaviour
Prompt injection Hiding malicious instructions in input data to hijack a model
Agent An LLM that plans and uses tools across multiple steps
MCP Model Context Protocol — a standard for connecting LLMs to tools/data

10. Where to Go Next

Recommended reading order in this portal:

  1. You are here — AI Fundamentals ✅
  2. QA Evolution — Testing Intelligence — why QA is changing
  3. RAG vs Agents vs Agentic RAG — architecture deep-dive
  4. LLM Testing Lifecycle — the full testing process
  5. RAG Automation Testing Roadmap — stage-by-stage RAG test plan
  6. Ragas FAQ and DeepEval FAQ — the tools
  7. Prompt Injection — Complete Guide — security testing