Avoiding forbidden words. Using a tree 🌳!

During our workday or free time, we use apps and tools that heavily rely on data structures like trees (binary, red-black, AVL, etc.). Whether we’re using a text corrector to send a WhatsApp message or performing a complex query in a PostgreSQL database, for example, some type of tree is used to make the retrieval of information faster and more optimized.

In this post, we will explore the tree: Trie (prefix-tree), which I found particularly interesting and powerful during my studies, solving a problem of efficiently identifying forbidden words in a given text, such as filtering offensive content in social media comments, a robust and scalable solution is required.

The Problem: Identifying Forbidden Words

Imagine building a REST API to flag offensive words in user-generated text, such as comments on a social media platform. The API must efficiently identify forbidden words, even with large word lists and high-traffic scenarios. We will use the Ruby on Rails framework.

The API interface:

  • Endpoint: [POST] /flag_words
  • Payload: A JSON object containing the text to analyze.
    {
        "text": "bla bla bla..."
    }
  • Response: A JSON array of forbidden words found in the text.

The Rails controller handles requests by invoking a service to process the text:

class FlagWordsController < ApplicationController
  def flag
    service = FlagWordsService.new(text: params[:text], forbidden_words:)

    render json: { flag_words: service.flag_words }, status: :ok
  end

  private

  # TIP: Add many more disallowed words in the `forbidden_words` method.
  # You can use a database to store such words or make an external call to a preferred API.
  def forbidden_words
    %w[foo bar baz]
  end
end

The controller initializes FlagWordsService with the input text and a list of forbidden words (e.g., ["foo", "bar", "baz"]). For a text like “foo is not bar,” the API returns ["foo", "bar"]. The challenge lies in ensuring efficiency when the forbidden words list grows to thousands and the API handles frequent requests.

Initial Approach and Limitations

In software engineering, the problems we encounter daily can be solved in various ways. As engineers, our role is to choose and apply the best solution for each situation. The challenge is to perform this search efficiently, especially when the list of forbidden words is large (e.g., thousands of words) and the text is processed frequently, as in a high-traffic application, for example.

You can think that using a method like Ruby’s include? to check each word in the text against the forbidden words list. However, this approach has limitations, particularly in terms of performance, as we’ll explore below.

Initial Solution: Using “include?”

Let’s start defining a service with a basic implementation using an array and the include? method. The FlagWordsService class on the initialize constructor method splits the input text into words and checks each word against the forbidden words list on the flag_words:

class FlagWordsService
  def initialize(text:, forbidden_words:)
    @forbidden_words = forbidden_words
    @words = text.split(' ').map(&:downcase)
  end

  attr_accessor :trie, :words, :forbidden_words

  def flag_words
    words.select { |word| forbidden_words.include?(word) }
  end
end

At first glance, this solution might seem adequate. However, upon closer analysis of how the include? method is implemented in Ruby, by accessing the language’s source code, written C, we find the following implementation:

VALUE
rb_ary_includes(VALUE ary, VALUE item)
{
  long i;
  VALUE e;

  for (i=0; i<RARRAY_LEN(ary); i++) {
    e = RARRAY_AREF(ary, i);
    if (rb_equal(e, item)) {
      return Qtrue;
    }
  }
  return Qfalse;
}

Code Source
We can see that the implementation is indeed simple, but this simplicity can lead to a high computational cost. It works well for small lists of forbidden words. However, the include? method has a time complexity of O(n) for each word checked, where n is the number of forbidden words. For a text with m words and a forbidden words list of size n, the total complexity is O(m * n). As the forbidden words list grows, the performance degrades linearly, making this approach inefficient for large datasets.

A Better Solution: Using a Trie

To address the performance bottleneck, consider a Trie (prefix-tree), a tree-like data structure designed for efficient string searches. A Trie organizes words by their prefixes, allowing for rapid lookup times regardless of the number of words stored. This makes it ideal for our use case, where we must check multiple words against a large set of forbidden words.

Why a Trie is a Good Fit

A Trie offers two key advantages:

  • Efficient Search Time: Searching for a word in a Trie has a time complexity of O(m), where m is the length of the word being searched. This is independent of the number of words in the Trie, unlike the O(n) complexity of include?. For a text with multiple words, the Trie’s performance remains consistent even as the forbidden words list grows.
  • Memory Optimization: Words with common prefixes share nodes in the Trie, reducing memory usage compared to storing each word independently in an array. For example, words like “art” and “arc” share the nodes for “a” and “r,” making the structure compact.

These properties make the Trie particularly suitable for applications like autocomplete, spell-checking, and our forbidden words API, where both speed and scalability are critical.

How a Trie Works

Before diving into coding, understanding how this type of tree works will help us write better code.

We can think of a Trie as a tree where its leaves represent letters that can form words and share letters with each other. As shown in the image below:

Trie (Prefix-tree)

Words that share the same initial letters (prefixes) follow the same path in the tree until they diverge to complete the final word. To make this clearer, let’s simulate adding the words “art,” “arc,” and “port”:

Adding “art” would follow this flow:

  • From the root, you add an “a” node;
  • From “a” you add an “r” node;
  • From “r” you add a “t” node;
  • Mark “t” as the end of a word (like placing a flag to indicate “art” is complete).

Now, adding “arc“:

  • Start at the root again. The “a” node already exists, so reuse it;
  • The “r” node also exists, so reuse that too;
  • Now add a “c” node after “r” and mark “c” as the end of “arc”;
  • Notice how “art” and “arc” share the “a” and “r” nodes, saving space!

Finally, adding “port“:

  • Back to the root. No “p” node exists yet, so add a “p” node;
  • From “p,” add an “o” node;
  • From “o,” add an “r” node;
  • From “r,” add a “t” node and mark “t” as the end of “port”.

So, the tree ends up with two main branches: one starting with “a” (for “art” and “arc”) and one starting with “p” (for “port”). It’s like a filing system where similar words are grouped. This structure allows efficient searches by traversing a single path for each word, and the shared prefixes reduce memory usage.

Implementing the Trie on your own

Let’s get to practice! For the implementation of this tree, we will use the Ruby programming language. However, it’s worth noting that you can also choose another language of your preference.

First of all, our tree is composed of nodes, so let’s start by implementing the TrieNode class:

class TrieNode
    def initialize
        @children = {}
        @is_end_of_word = false
    end

    attr_accessor :children, :is_end_of_word
end

As you can see, there’s no mystery: our TrieNode class is created and initialized at the moment of its instantiation with an empty children hash and a flag variable called is_end_of_word, which indicates whether the node marks the end of a given word.

Now, let’s see how our tree will be implemented. Initially, we need two main methods: insert and search. The show method is solely intended to display how the data is distributed in our tree.

But before diving into the methods, our Trie class, when instantiated, will have an @root attribute, which is an empty TrieNode instance. Optionally, if provided, it can already store words in our Trie.

class Trie
    def initialize(words = nil)
        @root = TrieNode.new

        words&.each { |word| insert(word) }
    end
# ...
end

Insert

The insert method, as the name suggests, inserts words into the tree. Here’s how it works:

  • It starts at the root node (@root).

For each letter in the word:

  • If the character does not exist in the children of the current node, it creates a new node;
  • If it already exists, it uses the existing node;
  • At the end, it marks the last node as the end of a word.

For example, as mentioned earlier, when adding the word “art”:

  • First, it adds ‘a’;
  • Then ‘r’;
  • Finally ‘t’ and marks it as the end of a word.
def insert(word)
    node = @root
    word.each_char do |char|
        node = node.children[char] ||= TrieNode.new
    end
    node.is_end_of_word = true
end

Search

The search method checks whether a given word exists in the tree, returning true or false. Let’s break it down step by step:

  • It starts at the root node (@root) of the tree.

For each character in the word being searched:

  • It checks if there is a path for that character;
  • If it doesn’t exist, it returns false (word not found);
  • If it exists, it continues navigating to the next node;
  • At the end, it checks if we reached a valid word’s end (is_end_of_word).

For example, if we search for the word “art” in the Trie, it will:

  • Check if there is a node for ‘a’;
  • Then check if there is an ‘r’ connected to ‘a’;
  • Finally, check if there is a ‘t’ connected to ‘r’;
  • And verify if that ‘t’ marks the end of a word.
def search(word)
    node = @root
    word.each_char do |char|
        return false unless (node = node.children[char])
    end
    node.is_end_of_word
end

Show

Finally, the show method is a simple function that displays the current structure of the Trie. It works as follows:

  • For each letter (char) and node (node) in the list of children of the root node:
  • It prints the character first;
  • Then it shows all the keys (letters) of the child nodes of that character.
    def show
    @root.children.each do |char, node|
        print "#{char}: "
        print node.children.keys
        print "\n"
    end
    end

And that’s how our implemented Trie class looks simple, isn’t it?

If you do not have time, this is the full code:

class Trie
    def initialize(words = nil)
        @root = TrieNode.new

        words&.each { |word| insert(word) }
    end

    def insert(word)
        node = @root
        word.each_char do |char|
            node = node.children[char] ||= TrieNode.new
        end
        node.is_end_of_word = true
    end

    def search(word)
        node = @root
        word.each_char do |char|
            return false unless (node = node.children[char])
        end
        node.is_end_of_word
    end

    def starts_with(prefix)
        node = @root
        prefix.each_char do |char|
            return false unless node.children[char]
            node = node.children[char]
        end
        true
    end

    def show
        @root.children.each do |char, node|
            print "#{char}: "
            print node.children.keys
            print "\n"
        end
    end
end

Updating the FlagWordsService with Trie

Now, let’s update the FlagWordsService to use the Trie for searching forbidden words, while keeping the include?-based method for comparison:

class FlagWordsService
  def initialize(text:, forbidden_words:)
    @forbidden_words = forbidden_words
    @trie = Trie.new(forbidden_words).freeze
    @words = text.split(' ').map(&:downcase)
  end

  attr_accessor :trie, :words, :forbidden_words

  def flag_words
    words.select { |word| trie.search(word) }
  end

  def flag_words_without_trie
    words.select { |word| forbidden_words.include?(word) }
  end
end

The flag_words method now uses the Trie’s search method, which is more efficient for large forbidden words lists.

Measuring Performance: Benchmarking

To statistically demonstrate the performance gains of using the Trie structure in the addressed use case, a test battery consisting of 500 iterations was conducted. Each test involved searching for forbidden words in a text, like a tweet, containing 280 words.

Forbidden WordsTrie Total Time (s)Trie Avg Time per Run (s)Array include? method Total Time (s)Array include? method Avg Time per Run (s)
100.0726930.0001444560.0126800.000025004
1000.0659590.0001317800.1107490.000221368
5000.0653440.0001303720.6176190.001234460
10000.0703930.0001405841.0434150.002086660
50000.0657930.0001313865.2216070.010442896
100000.0752210.00015030810.4729220.020945244

trie vs array include

Initially, we tested with a reduced number of forbidden words (10). In this scenario, a slight advantage was observed in using the include? method compared to the Trie. However, as shown in the table and chart above, in the other cases with a larger number of forbidden words, the Trie exhibited significantly lower and nearly constant search times, even with a considerable increase in the number of words.

The benchmarks were conducted on an Apple M3 Pro, which significantly impacts the results. To replicate these benchmarks on your machine, clone the repository for our use case and follow the instructions in the README.md file.

Wrapping up

The exploration of the Trie data structure highlights its remarkable efficiency and versatility in handling text-based operations, such as searching for forbidden words, autocompletes, and spell-checking. Unlike traditional methods like Ruby’s include?, which can become computationally expensive with larger datasets due to its linear search complexity (O(n)), the Trie offers a significant performance advantage with its O(m) complexity, where m is the length of the searched word. This makes it particularly well-suited for scenarios involving large sets of words, as demonstrated in our benchmarks, where the Trie maintained nearly constant search times even as the number of forbidden words increased.

By implementing a Trie in a practical application, such as our Flag Forbidden Words API, we showcased its ability to optimize real-world tasks, like filtering offensive content in texts. The structure’s ability to share prefixes among words not only reduces memory usage but also accelerates search operations, making it a powerful tool for software engineers tackling performance-critical problems. Whether you’re building a social media moderation tool or enhancing a text editor’s autocomplete feature, the Trie stands out as a robust and scalable solution.

So, if you enjoyed diving into the Trie and its real-world applications, stay tuned for our next post, where we’ll explore the Radix Tree—a smart variation of the Trie, also known as a compressed Trie, that pushes efficiency even further. Get ready for more insights into optimizing your code with advanced data structures!

We want to work with you. Check out our "What We Do" section!