AI Concepts
Key concepts that matter when talking about AI. This covers tokens, chunking, vectors, embeddings, retrieval, and generation.
In a nutshell:
Break text into smaller parts (tokenisation & chunking)
Map text into numbers that capture meaning (vectors & embeddings)
Use those numbers to either find similar text (retrieval) or create new text (generation)
Split Text
Break text into processable units. Two main approaches: tokenisation and chunking.
Subword tokens: work best for generation and low-resource scenarios. They handle unknown terms well. Individual tokens lack meaning alone.
Words: give precise units for tagging and entity recognition. They miss context and can be ambiguous.
Sentences: provide natural meaning units. Good for semantic search and Q&A. They need splitting and may miss document context.
Paragraphs: excel in long document retrieval and RAG systems. Rich context, fewer vectors. Trade precision for context.
Tokenisation
Tokenisation chops text into units your model recognises. Modern LLMs use subword tokenisation because:
Handles unknown words
Reduces vocabulary size
Enables generalisation to unseen words
Try https://tiktokenizer.vercel.app to see how it works.
Chunking
Chunking breaks long text into meaningful spans. Think sentences, paragraphs, or sliding windows.
Larger chunks add context but dilute precision
Smaller chunks fragment meaning
Overlapping chunks improve recall but cost more
Tokenisation vs Chunking
Purpose
Translate text to model units
Create retrievable segments
Driven by
Model architecture
Application logic
Granularity
subword, word
sentence, paragraph
Flexibility
Fixed once chosen
Highly flexible
Text to Number
Convert text to numbers for computational processing.
Vectors
A vector is a list of numbers. Example: [0.23, -1.7, 3.14, ...]
🤔 Are your vectors stable? Will the same text always produce the same embedding?
Embeddings
An embedding is a vector trained to reflect meaning and capture semantic relationships.
Use cases: Semantic search, Recommendations, Clustering, Personalisation
Vector vs Embedding:
"Vector" = what it is (list of numbers)
"Embedding" = what it means (vector trained for semantic meaning)

Vector Spaces
Embedding Space = precomputed vectors optimised for similarity and retrieval.
Latent Space = dynamic internal representations during model inference. Ephemeral, not stored.
Two Main Use Cases
1. Retrieval Path (Embedding Model)
Finds similar documents or information.
Used for:
Semantic search
RAG systems
Clustering
Recommendation systems
Characteristics:
Stable, reusable
Optimised for similarity
Text
→ Tokenise
→ Embedding model
→ Fixed-size vector
→ Stored in vector database
Query processing:
→ User query
→ Embedding model
→ Vector
→ Compare (cosine similarity)
→ Find similar texts
2. Generation Path (LLM Context)
Produces coherent, fluent text.
Used for:
Original content creation
Conversational Q&A
Text summarisation
Creative writing
Characteristics:
Context-sensitive
Temporal
Not stable embeddings
Prompt
→ Tokenise
→ Maps tokens to vectors
→ Processes through transformer layers
→ Generates next tokens based on context
FAQ
Are your vectors stable? Will the same text always produce the same embedding?
For text retrieval, we typically use static embedding models. These are deterministic: the same text always produces the same embedding vector—so we can precompute and store them in a vector database.
By contrast, contextual embeddings inside LLMs change based on surrounding words, making them unsuitable for direct retrieval.
Can we reuse embeddings trained for one task for another task?
It's tempting to think so, since embeddings all look like vectors of numbers. But embeddings are task-specific: they're trained to capture the relationships and patterns that matter for a particular goal.
Embeddings optimised for entity recognition cluster tokens by label patterns, not by broad semantic meaning. Semantic search embeddings, on the other hand, are trained to place semantically similar texts close together in vector space.
So while you can technically reuse them, the quality of your search results will likely suffer. It's usually best to use an embedding model trained (or fine-tuned) specifically for semantic similarity and retrieval.
Last updated