Bonjour readers👋
I have been wrestling with a peculiar issue this week while using ClickHouse database to manage image data (stored as Base64 strings, I know it is not recommended to store image in DB, but for a particular reason I have to store it).
I could happily INSERT
multi-hundred KB images without a hitch using the clickhouse_driver
in Python.
It was working fine and I was happy 😀.
Then came the need to update an existing image. Suddenly, for images smaller than what I could initially insert, I started getting the "Max query size exceeded"
error. 🤔
Inserts were successful with images of size 500KB but updates with smaller images say of 250KB were getting error.
After banging my head for a while, I realized it’s a classic case of how ClickHouse processes different types of queries under the hood.
The Plot Twist: INSERT
vs. ALTER TABLE UPDATE
The key difference lies in how ClickHouse processes INSERT
and ALTER TABLE UPDATE
statements.
When you INSERT
data using the Python driver, the driver is pretty smart. It sends the relatively short INSERT
command first, and then it efficiently streams the actual data to the server in a separate binary format.
The server's max_query_size
setting primarily applies to the SQL command text itself, not this separate data stream. This is why those large inserts sailed through and my images with size around 500KB were inserted successfully.
Now, let's come to ALTER TABLE UPDATE
. In ClickHouse, an UPDATE
is actually a mutation performed via an ALTER TABLE
command.
The SQL looks something like this:
ALTER TABLE your_images_table
UPDATE image_base64 = 'my_very_long_base64_string'
WHERE image_id = 'some_unique_id';
The important point here is that the entire Base64 encoded string of your new image gets embedded directly into the SQL query text itself!!!
When your Python driver constructs this command, it creates one massive string. ClickHouse then checks the size of this entire string against the max_query_size
(which is typically 256 KiB). Even a 250KB image, when Base64 encoded, can easily push the query string over this limit. So, Boom! "Max query size exceeded." 💥
So, it's not about the size of the new image data itself being too big, but rather the size of the SQL query string containing that data.
The Fix:
We've got a couple of ways to tackle this, one more "ClickHouse way" than the other.
Option 1: Embrace the ReplacingMergeTree
(The Recommended way or Clickhouse way)
ClickHouse, being an analytical database, isn't really designed for frequent in-place updates. A more Clickhouse way in to use it's table engines, specifically ReplacingMergeTree
.
The core idea is to treat updates as new insertions.
You insert a new version of the row with the updated image data. ReplacingMergeTree
then intelligently deduplicates rows based on a specified versioning column during its background merge processes, keeping only the latest version.
Here's how its done:
Recreate your table using
ReplacingMergeTree
, including a versioning column (like a timestamp):CREATE TABLE images_replacing (
image_id String,
image_base64 String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY image_id;
To "update" an image, simply
INSERT
a new row with the sameimage_id
and a newerupdated_at
timestamp. You can continue using your efficientINSERT
mechanism.When querying: To get the latest image of the row, use the
FINAL
modifier:SELECT * FROM images_replacing FINAL WHERE image_id = 'some_image_id';
This method aligns perfectly with ClickHouse's strengths and avoids the max_query_size
limitation altogether without compromising speed.
Option 2: Increase the max_query_size
size (Use with Caution)
As a quick (but generally less recommended) fix, you can increase the max_query_size
setting on your ClickHouse server. While this might solve your immediate problem, it's more of a temporary solution. Allowing excessively large queries can potentially impact server performance down the line.