How to Detect Forbidden Words in Text (Without Slowing Down)

In our previous blog post, we adopted the Trie (Prefix-tree) data structure to address the use case of detecting forbidden words in texts. This structure enabled us to achieve faster execution times for this task compared to the initial solution of using Ruby’s Array.includes? method for large datasets. In the worst case, the array-based solution required searching through the entire collection of forbidden words to find a match, whereas the Trie structure enabled searches based on the length of the word being queried.

However, like everything in life, the Trie structure has its drawbacks. Its main disadvantage arises in situations where storage space for the dataset is limited, and we cannot afford to store each character of every word in a separate node of the tree. To address this, in this article, we will explore how to implement the Radix Tree, a tree-based data structure that solves the limited storage space issue by compressing the tree.

How Does the Radix Tree Work?

The Radix Tree is a compressed version of the Trie. While a standard Trie uses one node for each character in a word (e.g., “c” → “a” → “s” → “a” for “casa”), the Radix Tree is more efficient: it can combine entire sequences of characters (substrings) into a single node as long as there are no branches. This results in lower memory usage, making it ideal for storing large sets of words, such as in API route collections or autocomplete systems.

Think of the Radix Tree as a family tree for words: instead of repeating shared parts (e.g., “Rom” in “Romane” and “Romulus”), it stores these shared parts only once and branches only when the words diverge.

To better understand, let’s look at the illustration below to see how the Radix Tree is structured.

As we can see, the nodes represent branching points or word endings. If a path has no branches, it is compressed into a single node.

General Structure

First branch: The edge “ROM” leads to a central node (all words start with “ROM”).
Branch to “AN” (left): Represents words starting with “ROMAN”.
- Edge “E” → End of the word “ROMANE” (full path: ROM-AN-E).
- Edge “US” → End of the word “ROMANUS” (ROM-AN-US).
Edge “ULUS” (right) → End of the word “ROMULUS” (ROM-ULUS).

This compression avoids creating separate nodes for each character, unlike the Trie discussed previously. For example, in “Romulus,” “ulus” is a single edge, not r-o-m-u-l-u-s. If a path has no branches, it is compressed into a single edge, reducing the number of nodes and, consequently, memory usage.

Show Me the Code

Before we dive in, it’s worth noting that implementing a Radix Tree is somewhat more complex than a Trie. However, I hope to make the understanding of this structure’s functionality at the code level clearer throughout this section.

For the Radix Tree implementation, we’ll use Ruby, just like with the Trie, and implement the same methods: insert and search. The concepts are language-agnostic and can be applied in languages like JavaScript or Java.

The code has two main parts: a class for the tree’s “nodes” (RadixNode) and another for the entire tree (Radix). Let’s break it down.

The `RadixNode` Class: The Building Blocks of the Tree

class RadixNode
  def initialize(prefix = "")
    @children = {}
    @prefix = prefix
    @is_end_of_word = false
  end

  attr_accessor :children, :prefix, :is_end_of_word
end

Each node is like a leaf on the tree. Think of a real tree: the root, branches, and leaves. Here’s what we have:

def initialize(prefix = ""): Constructor that creates a new node.
- @children = {}: A dictionary (like a map) to store the “children” (next nodes). The key is the first character of the next segment of the word.
- @prefix = prefix: The “prefix” stored in this node, which can be a sequence of letters (e.g., “rom”).
- @is_end_of_word = false: A flag indicating whether this node marks the end of a complete word.

The `Radix` Class: The Complete Tree

This class handles grouping all the nodes together.

class Radix
  def initialize(words = nil)
    @root = RadixNode.new

    words&.each { |word| insert(word) }
  end

  def insert(word)
    insert_recursive(@root, word)
  end

  def search(word)
    search_recursive(@root, word)
  end
  # ...
end

def initialize(words = nil): Creates the tree.
- @root = RadixNode.new: Starts with an empty root node.
- If you pass a list of words (e.g., ["romane", "romanus"]), it inserts them all at once.
def insert(word): Adds a word to the tree. It calls a recursive helper (insert_recursive) to do the heavy lifting.
def search(word): Checks if a word exists in the tree. It uses a recursive helper (search_recursive) that returns true if found, false otherwise.

Now, let’s look at the private methods that actually handle insertion and searching in the tree.

The Helper Method: `common_prefix_length`

def common_prefix_length(str1, str2)
  min_length = [str1.length, str2.length].min
  i = 0
  i += 1 while i < min_length && str1[i] == str2[i]
  i
end

This method finds how many characters at the beginning are the same between two strings. Example: For “romane” and “romanus,” the common prefix is “roman” (5 letters). It compares characters one by one until it finds a difference or reaches the end of the shorter string.

The key to “compression”: the Radix Tree combines identical segments into a single node.

Inserting Words: `insert_recursive(node, word)`

This method is recursive because it “descends” through the tree until the word is fully inserted. Let’s imagine inserting “romane,” “romanus,” and “romulus” as an example (classic words to demonstrate a Radix Tree).

  def insert_recursive(node, word)
    return if word.empty?

    if node.prefix.empty?
      first_char = word[0]
      if node.children.key?(first_char)
        insert_recursive(node.children[first_char], word)
      else
        new_node = RadixNode.new(word)
        new_node.is_end_of_word = true
        node.children[first_char] = new_node
      end

      return
    end

    common_len = common_prefix_length(node.prefix, word)
    common = node.prefix[0, common_len]

    if common_len == node.prefix.length
      # Current node's prefix is fully matched
      remaining = word[common_len..-1] || ""
      if remaining.empty?
        node.is_end_of_word = true
      else
        first_char = remaining[0]
        if node.children.key?(first_char)
          insert_recursive(node.children[first_char], remaining)
        else
          new_node = RadixNode.new(remaining)
          new_node.is_end_of_word = true
          node.children[first_char] = new_node
        end
      end
    else
      # Split the current node
      remaining_node_prefix = node.prefix[common_len..-1]
      remaining_word = word[common_len..-1] || ""

      # Update current node to common prefix
      new_node = RadixNode.new(common)
      new_node.is_end_of_word = false
      new_child = RadixNode.new(remaining_node_prefix)
      new_child.is_end_of_word = node.is_end_of_word
      new_child.children = node.children

      # Replace the current node's content
      node.prefix = common
      node.is_end_of_word = false
      node.children = {}
      node.children[remaining_node_prefix[0]] = new_child

      # Add the remaining word if it exists
      unless remaining_word.empty?
        new_word_node = RadixNode.new(remaining_word)
        new_word_node.is_end_of_word = true
        node.children[remaining_word[0]] = new_word_node
      end
    end
  end

Base case: If the word is empty (word.empty?), the execution stops. Nothing to do.
If the current node has an empty prefix (usually the root at the start):
- Takes the first letter of the word (e.g., ‘r’ from “romane”).
- If there’s already a child with that letter, it descends to it and inserts the rest of the word.
- Otherwise, it creates a new node with the entire word as the prefix, marks it as the end of a word, and adds it as a child.
If the node has a prefix:
- Calculates the common length between the node’s prefix and the word.
- If the common part equals the node’s entire prefix (e.g., node has “rom,” word is “romane” → common=3):
- Takes the remaining part of the word (“ane”).
- If the remainder is empty, mark the node as the end of a word.
- Otherwise, descends to the child with the first letter of the remainder (or creates a new one).
- If the common part is less than the node’s prefix (e.g., node has “roman,” word is “romulus” → common=3 “rom”):
- This is the “split”: The node must be broken to accommodate the difference.
- Updates the current node to only the common part (“rom”).
- Creates a new child for the remaining part of the old prefix (e.g., “an” from “roman”).
- Transfers the old children to this new child.
- Adds the remaining part of the new word as another child (e.g., “ulus” for “romulus”).
- Marks word endings correctly.

Practical Example with Words: Let’s simulate inserting [“romane”, “romanus”, “romulus”]. In the end, the tree looks like this:

Root (empty)
- Child ‘r’: Prefix “rom” (common to all)
- Child ‘a’: Prefix “an” (for “romane” and “romanus”)
  - Child ‘e’: Prefix “e” (end of “romane”)
  - Child ‘u’: Prefix “us” (end of “romanus”)
- Child ‘u’: Prefix “ulus” (end of “romulus”)

Notice how “rom” is shared, and it only branches where the words differ. This saves space!

Searching for Words: `search_recursive(node, word)`

Similar to insertion, but it only checks if the word exists.

  def search_recursive(node, word)
  # Base case: if word is empty, check if current node marks the end of a word
    return node.is_end_of_word if word.empty?

    if node.prefix.empty?
      first_char = word[0]
      return node.children.key?(first_char) ? search_recursive(node.children[first_char], word) : false
    end

    # Find length of common prefix
    common_len = common_prefix_length(node.prefix, word)
    return false if common_len != node.prefix.length

    # If the entire word matches the prefix
    if word.length == common_len
      return node.is_end_of_word
    end

    # Recurse into the child with a matching first character
    remaining = word[common_len..-1]
    first_char = remaining[0]
    return false unless node.children.key?(first_char)
    search_recursive(node.children[first_char], remaining)
  end

Base case: If the word is empty, checks if the node marks the end of a word.
If the node’s prefix is empty:
- Takes the first letter of the word, descends to the child if it exists, or returns false.
If there’s a prefix:
- Calculates the common part.
- If the common part isn’t the entire prefix, returns false (no match).
- If the word ends there, checks if it’s marked as an end.
- Otherwise, takes the remainder and descends to the child with the first letter of the remainder.

Search Examples (using the tree above):

“romane”: Descends root → “rom” (matches) → “an” (matches) → “e” (matches and end) → true.
“romanus”: Similar, goes to “us” → true.
“romulus”: Goes to “rom” → “ulus” → true.
“roman”: Goes to “an,” but it’s not marked as an end → false.
“rom”: Goes to “rom,” but it’s not an end → false.

So, why is this structure more efficient than a regular list? In a normal list, searching for a word in 1 million entries, using something like Ruby’s Array#include?, takes time proportional to the list size. With a Radix Tree, the search time is proportional to the word’s length (e.g., 10 letters = 10 steps), regardless of how many words are stored.

Therefore, try implementing this search structure in your preferred language to apply the concepts we’ve covered today.

Measuring Performance: Benchmarking

Similar to the comparison of Trie vs Array.includes?, in the last post, this analysis statistically evaluates the performance of the Radix structure against the Trie structure for a specific use case. A test battery of 500 iterations was conducted, with each iteration involving a search for forbidden words within a text, such as a tweet, containing 280 words.

Forbidden Words	Trie Total Time (ms)	Trie Avg Time per Run (ms)	Radix Total Time (ms)	Radix Avg Time per Run (ms)
10	69.753	0.13906	23.217	0.046126
100	84.677	0.169022	64.445	0.128772
500	82.211	0.164256	87.6	0.175104
1000	68.511	0.136888	94.772	0.189464
5000	64.647	0.12916	122.062	0.244004
10000	68.439	0.136688	132.731	0.265344

As shown in the chart below, initially, we tested with a reduced number of forbidden words (10). In this scenario, a slight advantage was observed in using the Radix structure compared to the Trie. However, as shown in the table and chart above, in the other cases with a larger number of forbidden words, the Trie exhibited lower and nearly constant search times, even with a considerable increase in the number of words. However, at the end of the day, the differences between the two structures are minor, and the choice often depends on the amount of space you’re willing to allocate or save.

Note: the benchmarks were conducted on an Apple M3 Pro, which significantly impacts the results. To replicate these benchmarks on your machine, clone the repository for our use case and follow the instructions in the README.md file.

Wrapping up

We’ve seen throughout this reading that the Radix Tree offers a smart solution for optimizing memory usage in scenarios where storage efficiency is essential, leveraging the strengths of the Trie data structure while simultaneously addressing its main limitation: excessive memory consumption. By compressing common prefixes into single nodes, the Radix Tree significantly reduces the number of nodes required to store a set of words, making it an ideal choice for applications such as autocomplete systems, API route matching, or blacklist detection in large datasets. However, this comes at the cost of greater implementation complexity compared to a standard Trie.

And as we can see, performance benchmarks reveal that the Radix Tree excels in scenarios with a smaller number of blacklisted words, demonstrating faster search times in these cases. However, as the dataset grows, the Trie often maintains more consistent and slightly faster search performance, although the differences are minimal. Ultimately, the choice between a Radix Tree and a Trie depends on your specific application requirements—whether you prioritize memory efficiency with a Radix Tree or a simpler implementation and consistent performance with a Trie.

We want to work with you. Check out our Services page!

How to Detect Forbidden Words in Text (Without Slowing Down) – Part II

How Does the Radix Tree Work?

General Structure

Show Me the Code

The `RadixNode` Class: The Building Blocks of the Tree

The `Radix` Class: The Complete Tree

The Helper Method: `common_prefix_length`

Inserting Words: `insert_recursive(node, word)`

Searching for Words: `search_recursive(node, word)`

Measuring Performance: Benchmarking

Wrapping up

Related

Gabriel Quaresma

How Does the Radix Tree Work?

General Structure

Show Me the Code

The RadixNode Class: The Building Blocks of the Tree

The Radix Class: The Complete Tree

The Helper Method: common_prefix_length

Inserting Words: insert_recursive(node, word)

Searching for Words: search_recursive(node, word)

Measuring Performance: Benchmarking

Wrapping up

Related

Gabriel Quaresma

You might also like

Reprogramming My Career – Engineer to EngineerHow I turned my mechanical engineering background into a future in software engineering

How I turned my mechanical engineering background into a future in software engineering

Alternative for low-end machines to develop mobile applications

The Day Github Almost CrowdStriked Us All (Again)

The `RadixNode` Class: The Building Blocks of the Tree

The `Radix` Class: The Complete Tree

The Helper Method: `common_prefix_length`

Inserting Words: `insert_recursive(node, word)`

Searching for Words: `search_recursive(node, word)`

Reprogramming My Career – Engineer to Engineer
How I turned my mechanical engineering background into a future in software engineering