Remember the Comma Separated List of Doom, despite your ORM.

Refresher: What is the Comma Separated List of Doom?

If you’ve ever built a web application where users can search through lots of data, you’ve probably made the same mistake I once did: storing multiple values in a single SQL column as a comma-separated list. Storing multiple values this way feels convenient until you try to search or scale. If you run a public site where users filter results (for example: products, videos, whatever) by tags, the old CSLoD anti-pattern can be the difference between a solid user experience and a tedious one.

Why comma-separated lists are a problem

Apart from that obviously your database model is not correctly normalized, there are 3 negative consequences by using this anti-pattern.

Bad Performance

Relational engines use indexes efficiently when your predicates are sargable (Search ARGument ABLE). An index can be used to locate candidate rows without scanning the whole table. I'm sure that if you use a comma-separated list for storing tags in your tables, you use the LIKE operator with wildcards to filter the results. When you put a wildcard on both sides of a string in a LIKE query (for example WHERE book_tags LIKE '%fiction%'), the database can’t use a normal B-tree index efficiently. That's because those indexes rely on the leftmost part of the value being fixed in order to seek into the tree. The engine has no way to jump to a starting point with a leading wildcard, so it ends up scanning every row and checking the condition manually. If the wildcard is only at the end ('foo%') of a value, the index can be used to do a range scan. Unfortunately that’s still only fast if the prefix is selective. On large datasets with many matches (like 'a%'), it can still mean scanning a huge portion of the table, which is far from optimal. Some SQL databases like PostgreSQL have a trigram index, which could make querying on a comma separated list slightly faster. However, it's not ment for this kind of usage. A normalized model will still beat it in terms of performance.

TLDR: The usual B-tree index is useless on a column when its queried on using the LIKE operator with wildcards on both sides.

Maintainability & Data Integrity Issues

Adding/removing a tag means string surgery—easy to introduce duplicates (fiction,fiction) or broken punctuation. Referential integrity (e.g., ensuring tags come from a canonical list) is impossible. In a normalized schema, (book_id, tag_id) can be uniqued, preventing duplicates by construction.

Unreliable, Messy Queries that are hard to read

Substring search can mis-match, for example: WHERE tags LIKE '%man%' matches "human", "romance", and "truman". You’ll need awkward boundary logic (commas, start/end of string) to avoid false positives. If the criteria is to match each tag exactly, a query can quickly turn in to this mess:

sql

WHERE tags LIKE 'fiction,%'
   OR tags LIKE '%,fiction'
   OR tags LIKE '%,fiction,%'
   OR tags = 'fiction'

The normalized ideal

On to how it SHOULD be. Suppose we have an online bookstore (I know, very original) where users can browse books, and filter them based on tags. Instead of having a books table with a tag column that is a comma separated list, we could create a books and a book_tags table like this:

sql

CREATE TABLE books (
  id   BIGINT PRIMARY KEY,
  title VARCHAR(200) NOT NULL,
  -- other fields...
);

CREATE TABLE book_tags (
  book_id BIGINT NOT NULL,
  tag VARCHAR(100) NOT NULL,
  PRIMARY KEY (book_id, tag),
  FOREIGN KEY (book_id) REFERENCES books(id)
);

CREATE INDEX idx_book_tag ON book_tags(tag);

Notice the index on the tag column in the book_tags table. As long as we query exact matches for tags, the index will greatly improve performance.

A small, clear and fast query

Let's expand on the scenario of browsing books on tags. Suppose we want to search for books that are tagged with "fiction" and "thriller". Our query could look like this:

sql

SELECT books.*
FROM books
JOIN (
    SELECT book_id
    FROM book_tags
    WHERE tag IN ('fiction', 'thriller')
    GROUP BY book_id
    HAVING COUNT(*) = 2
) tags ON tag.book_id = books.id

This query would work pretty fast, given that we filter for exact matches instead of using the like operator. Basically, the engine searches for tags that match "fiction" and "thriller" and return those tags including the corresponding book id's. Only the book_id's that occur as many times as the amount of tags filtered on will be returned.

Your ORM might be the reason you forgot about this.

Maybe you're thinking "I knew this already, this is nothing new" right now. You're right, as i said this anti-pattern was addressed long ago. But the old blogs that talk about this hardly pop up in the search results. And if you really didn't forget this is an anti pattern and how you could solve it, you would have stopped reading a while ago.

I am partially blaming ORM's for this memory-hole. ORM's get increasingly more intelligent, but they don't solve a bad database model. I see far too many times that developers assume that an ORM means that they don't have to think about optimizing queries. Just the standard spring CRUD repository isn't always enough. For your convenience, I will show 2 examples on how to get an ORM to perform this query below.

Writing this query in Spring Data.

I assume you know how to make entities in Spring. The example below shows how to create a repository with a method that performs this query.

java

public interface UserSampleRepository extends JpaRepository<Book, Long> {
  @Query(value = """
      SELECT books.*
      FROM books
      JOIN (
          SELECT book_id
          FROM book_tags
          WHERE tag IN (:tags)
          GROUP BY book_id
          HAVING COUNT(*) = :size
      ) tags ON tag.book_id = books.id
      """, nativeQuery = true)
  List<Book> findBooksMatchingAllTags(
      @Param("tags") List<String> tags,
      @Param("size") long size
  );
}

Surprisingly it's pretty straight forward in spring because we can just write the raw query, while the ORM will map the result to the book entity.

Writing this query in Laravel's ORM.

Laravel's ORM is a good example that you need to look a little deeper in the docs sometimes. The code is not too complex, but it might take a little more time to find functions like joinSub.

php

function filterBooksByTags($tags){
    $booksQuery = DB::table('books')->select(['books.*']);

    $booksSql = DB::table('book_tags')->select('book_id')
        ->whereIn('tag', $tags)
        ->groupBy('book_id')
        ->havingRaw('count(*) = ?', [count($tags)]);

    $booksQuery = $booksQuery->joinSub($booksSql, 'tags', function($join) {
        $join->on('books.id', '=', 'tags.book_id');
    });

    return $booksQuery->get();
}