AI Concepts

Key concepts that matter when talking about AI. This covers tokens, chunking, vectors, embeddings, retrieval, and generation.

In a nutshell:

  1. Break text into smaller parts (tokenisation & chunking)

  2. Map text into numbers that capture meaning (vectors & embeddings)

  3. Use those numbers to either find similar text (retrieval) or create new text (generation)

Split Text

Break text into processable units. Two main approaches: tokenisation and chunking.

Subword tokens: work best for generation and low-resource scenarios. They handle unknown terms well. Individual tokens lack meaning alone.

Words: give precise units for tagging and entity recognition. They miss context and can be ambiguous.

Sentences: provide natural meaning units. Good for semantic search and Q&A. They need splitting and may miss document context.

Paragraphs: excel in long document retrieval and RAG systems. Rich context, fewer vectors. Trade precision for context.

Tokenisation

Tokenisation chops text into units your model recognises. Modern LLMs use subword tokenisation because:

  • Handles unknown words

  • Reduces vocabulary size

  • Enables generalisation to unseen words

Try https://tiktokenizer.vercel.app to see how it works.

Chunking

Chunking breaks long text into meaningful spans. Think sentences, paragraphs, or sliding windows.

  • Larger chunks add context but dilute precision

  • Smaller chunks fragment meaning

  • Overlapping chunks improve recall but cost more

Tokenisation vs Chunking

Aspect
Tokenisation
Chunking

Purpose

Translate text to model units

Create retrievable segments

Driven by

Model architecture

Application logic

Granularity

subword, word

sentence, paragraph

Flexibility

Fixed once chosen

Highly flexible

Text to Number

Convert text to numbers for computational processing.

Vectors

A vector is a list of numbers. Example: [0.23, -1.7, 3.14, ...]

🤔 Are your vectors stable? Will the same text always produce the same embedding?

Embeddings

An embedding is a vector trained to reflect meaning and capture semantic relationships.

Use cases: Semantic search, Recommendations, Clustering, Personalisation

Vector vs Embedding:

  • "Vector" = what it is (list of numbers)

  • "Embedding" = what it means (vector trained for semantic meaning)

Vector vs Embedding comparison
Vectors vs Embeddings

Vector Spaces

Embedding Space = precomputed vectors optimised for similarity and retrieval.

Latent Space = dynamic internal representations during model inference. Ephemeral, not stored.

Two Main Use Cases

1. Retrieval Path (Embedding Model)

Finds similar documents or information.

Used for:

  • Semantic search

  • RAG systems

  • Clustering

  • Recommendation systems

Characteristics:

  • Stable, reusable

  • Optimised for similarity

Text 
    → Tokenise 
    → Embedding model 
    → Fixed-size vector
    → Stored in vector database

Query processing:
    → User query 
    → Embedding model 
    → Vector 
    → Compare (cosine similarity) 
    → Find similar texts

2. Generation Path (LLM Context)

Produces coherent, fluent text.

Used for:

  • Original content creation

  • Conversational Q&A

  • Text summarisation

  • Creative writing

Characteristics:

  • Context-sensitive

  • Temporal

  • Not stable embeddings

Prompt 
  → Tokenise 
  → Maps tokens to vectors
  → Processes through transformer layers 
  → Generates next tokens based on context

FAQ

Are your vectors stable? Will the same text always produce the same embedding?

For text retrieval, we typically use static embedding models. These are deterministic: the same text always produces the same embedding vector—so we can precompute and store them in a vector database.

By contrast, contextual embeddings inside LLMs change based on surrounding words, making them unsuitable for direct retrieval.

Can we reuse embeddings trained for one task for another task?

It's tempting to think so, since embeddings all look like vectors of numbers. But embeddings are task-specific: they're trained to capture the relationships and patterns that matter for a particular goal.

Embeddings optimised for entity recognition cluster tokens by label patterns, not by broad semantic meaning. Semantic search embeddings, on the other hand, are trained to place semantically similar texts close together in vector space.

So while you can technically reuse them, the quality of your search results will likely suffer. It's usually best to use an embedding model trained (or fine-tuned) specifically for semantic similarity and retrieval.

Last updated