How Raw Text Becomes Intelligence: Tokenization, Vocab & N-Grams Explained

How unstructured text is converted into structured sequences
The role of tokenization, vocabulary, and n-grams in NLP pipelines
Hands-on examples with a link to an interactive Colab notebook
🔗 Explore More AI/ML Blogs
Visit SlayItCoder Blogsite for more practical AI/ML tutorials, weekend logs, and hands-on coding guides.

Think of raw text like uncut marble — beautiful, but unstructured. Machines can’t make sense of it until we break it down.

Example:

"The cat sat on the mat." → ["the", "cat", "sat", "on", "the", "mat"] → [0, 1, 2, 3, 0, 4]

🎯 Try this: Take a sentence from your favorite movie and break it into words. Then assign numbers to each word!

Tokenization breaks text into manageable pieces called tokens.

Types of Tokenization:

🎯 Try this: Use Python to tokenize a string into characters, words, and subwords. Which gives more flexibility?

A vocabulary maps each unique token to a unique ID.

Example:

Corpus: ["I love NLP", "NLP is fun"] Vocabulary: {"I": 0, "love": 1, "NLP": 2, "is": 3, "fun": 4}

🎯 Try this: Build a vocabulary from your recent chat or email thread.

N-grams capture local context by grouping consecutive words together.

Example:

Sentence: "I love NLP"

🎯 Try this: Create bigrams manually from any sentence. See how context improves.

Raw Text → Cleaned → Tokenized → Vocabulary IDs → N-Grams → Model Input

We’ve bundled all the code into a hands-on Colab notebook for you to explore.

We’ve structured the text. Next, let’s learn how machines predict the next value or word in a sequence.

Subscribe to the SlayItCoder newsletter