ClickHouse icon indicating copy to clipboard operation
ClickHouse copied to clipboard

Intern Tasks 2025/2026

Open alexey-milovidov opened this issue 4 months ago • 66 comments

This is the list of proposed tasks. It is to be extended. You can propose more tasks. You can also find the previous lists here:

2024/2025: https://github.com/ClickHouse/ClickHouse/issues/71175 2023/2024: https://github.com/ClickHouse/ClickHouse/issues/58394 2022/2023: https://github.com/ClickHouse/ClickHouse/issues/42194 2021/2022: https://github.com/ClickHouse/ClickHouse/issues/29601 2020/2021: https://github.com/ClickHouse/ClickHouse/issues/15065 2019/2020: https://gist.github.com/alexey-milovidov/4251f71275f169d8fd0867e2051715e9 2018/2019: https://gist.github.com/alexey-milovidov/6735d193762cab1ad3b6e6af643e3a43 2017/2018: https://gist.github.com/alexey-milovidov/26cc3862eb87e52869b9dac64ab99156

The tasks should be:

  • not too hard (doable within about a month) but usually not less than a week;
  • not alter some core components of the system;
  • mostly isolated, does not require full knowledge of the system;
  • somewhat interesting to implement or have some point of research;
  • not in the critical path of our roadmap (ok to be thrown away after a year);
  • most of them are for C++ developers, but there should also be tasks for frontend developers or tools/research that only require Go/Python/whatever;
  • some tasks should allow teamwork;
  • cover various skills, e.g., system programming, algorithm knowledge, etc...

SLM-based custom autocomplete

Implement and train a small language model (~10 MB), using a custom tokenizer, for auto-completion of ClickHouse queries. The tokenizer should correspond to the ClickHouse Lexer, and support its syntax elements as well as keywords. However, for identifiers and string literals, it should represent items like lookbehind copies, instead of coding the values literally. We aim to use a GPT-based model.

The goal is to implement a small and fast model, running on CPU with minimum of resources, that could be embedded in the ClickHouse binary by default, as well as exported to WebAssembly, and it could be largerly customizable, so we can do sampling based on the syntaxic correctness of the query and based on the available dictionaries of tables.

Porting ClickHouse parser to WASM

Compile a minimal version of ClickHouse to run in the browser. The filesystem and network interaction should be abstracted so it can request file uploads from the user and interact with remote HTTP endpoints.

Negative and fractional LIMIT values

Already done by @nihalzp and @0xgouda!

https://github.com/ClickHouse/ClickHouse/issues/28913 https://github.com/ClickHouse/ClickHouse/issues/81892

Functions for rolling hashes and content-defined chunking

https://github.com/ClickHouse/ClickHouse/issues/81183

Aggregate functions for Geo types

https://github.com/ClickHouse/ClickHouse/issues/80186

Aggregation with approximate and spatial clustering (+ strings)

Add a modifier WITH CLUSTER to a GROUP BY element, defining a maximum distance between groups. Groups located close to each other will be merged.

Example:

SELECT user_id, min(event_time), max(event_time) FROM table GROUP BY user_id, event_time WITH CLUSTER 1800

This query will group together sessions of events for every user_id, where session is a sequence of events, with the distance between nearest events no longer than 1800 (half an hour).

It can be extended to the multidimensional case:

Example:

SELECT user_id, count() FROM table GROUP BY user_id, (x, y) WITH CLUSTER 10

This query will group together points (x, y) with L2-distance less than 10.

A more interesting extension is clustering of arbitrary dimensional metric spaces, such as strings with Levenstein distance.

Query Rewrite Rules

https://github.com/ClickHouse/ClickHouse/issues/80084

In-progress, almost finished by @hinata34

API for query rewrite and formatting

https://github.com/ClickHouse/ClickHouse/issues/75033

Key-Value data marts

https://github.com/ClickHouse/ClickHouse/issues/33581

User authentication with wildcard names

https://github.com/ClickHouse/ClickHouse/issues/78065

Query cache for partial results

Implement the possibility of caching at different stages of the query pipeline, such as - before LIMIT, before ORDER BY, or intermediate results of GROUP BY. The most important use case is partial caching of time-series queries, where the aggregation of the older data should be reused and merged with the newly coming data.

This requires certain efforts, such as remembering and preserving the split of the work across data parts for further queries. So that the the results of aggregation can be cached in the form of virtual projections corresponding to data parts.

AI functions

Implement scalar and aggregate AI functions in ClickHouse similar to https://www.vldb.org/pvldb/vol18/p5415-mhedhbi.pdf

As extensions of this, add AI functions to:

  • translate texts,
  • summarize texts,
  • compute a similarity score between texts
  • compute the sentiment (happy, sad, etc.) of a text,
  • generate texts based on a prompt (GenAI),
  • fix spelling and grammar in a text,
  • classify a text in one of N given categories

You can use this PR as a starting point: https://github.com/ClickHouse/ClickHouse/pull/85442

Support for altering index granularity and primary key

The minimal implementation will allow different data parts to have a different primary key. The index analysis will run on these sets of data parts differently, and these data parts will never be merged together. So after the change of the primary key, newer data parts will use the new primary key, while the old ones remain.

Further implementation will reorder data in the older data parts according to the new primary key.

https://github.com/ClickHouse/ClickHouse/issues/147

Combinatorics functions

Various functions on arrays to obtain combinations and permutations.

https://github.com/ClickHouse/ClickHouse/issues/43175

Pipelined SQL syntax

https://github.com/ClickHouse/ClickHouse/issues/69364

Approximate vector search with LeanNN

If you like data structure, math and low-level programming

https://github.com/unum-cloud/usearch/issues/662

Can be implemented in the vector similarity index.

Import and Export of Substrait plans

Substrait is an emerging representation of relational algebra. Its purpose is to decouple and modularize complex data systems. For example, it enables ClickHouse to execute computation on data described by a Substrait plan generated by another system. Similarly, ClickHouse might pushdown calculations into dependent databases using Substrait plans (e.g. Sirius) or utilize external optimizers for query optimization (e.g. optd). The task is to produce Substrait plans (similar to today's EXPLAIN PLAN) and consume Substrait plans (using a new function runSubstrait).

Improvements for data formats: HDF5, Vortex, SQLite

Vortex is a new data format, an alternative to Parquet. HDF5 is the old one, often used as a container for scientific data. SQLite we already support as a database and table engine, as well as a table function, but we can also add it as a data format, to make it easy reading and writing the data once; additionally we should support SQLite files on remote filesystems.

Streaming inserts and routed inserts

One part is #41439

Another part is to implement a handler capable of inserting data into multiple tables, where the table name is extracted from the data (e.g., from a certain JSON field). It can create tables automatically using an inferred or pre-defined schema.

TOTALS, BY and ORDER BY combinators for aggregate functions

https://github.com/ClickHouse/ClickHouse/issues/34156

Schema Inference For Freeform Text Formats

Either https://github.com/ClickHouse/ClickHouse/issues/34006 Or https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf Or any method you'd like for extracting the structure of free-form text files.

Statistical aggregate functions

Aggregate or window functions for checking for process stationarity, breaking points, as well as generating predicted values.

User-Defined types

https://github.com/ClickHouse/ClickHouse/issues/18568

Extending ORDER BY with natural sorting and topological sorting and the SHUFFLE operator

We will add the SHUFFLE clause for SQL queries and SHUFFLE ... LIMIT that will work faster than ORDER BY rand(): under O(n) instead of O(n * log(n)).

As a bonus, we can add support for topological sorting: https://github.com/ClickHouse/ClickHouse/issues/34343

Additionally, implement natural sorting: https://github.com/ClickHouse/ClickHouse/issues/14003

Integration with AWS SQS and AWS Kinesis

Make https://github.com/ClickHouse/ClickHouse/pull/80177 production-ready.

Framing formats for query results

Support "multiplexing formats", also what we called "framing formats", which will allow sending results of multiple queries in a single connection, their errors, profile events, and logs.

This is not only to solve structured output of errors, sending auxiliary data, such as query metrics, and also to allow multi-statements in the HTTP interface.

And eventually it will allow for creating a unified query pipeline for multiple queries, where, say - we can read data once if multiple queries read the same data.

Functions for changing dimensionality

Provide functions for changing the dimension of vector embeddings with low changes in distances, which is done by random projections. Typically it is for lowering the dimension, but upscaling to higher number of dimensions is also needed if the subsequent operation is binary quantization of the vectors.

Functions for multiplying vectors to a constant rectangular matrix, and creating these constant random matrices, for the following cases: - random matrix with normally-distributed values; - random orthogonal matrix, which is created by a orthogonal decomposition or a random matrix; - sparse random matrix with fixed negative/positive values, to accellerate the calculations on CPU; - Hadamard/Welsh/Weighting matrix.

The functions should be accellerated for AVX-512, AMX, and possibly for GPU.

Functions for econometric calculations

https://github.com/ClickHouse/ClickHouse/issues/87209

Improvements for aggregate functions

https://github.com/ClickHouse/ClickHouse/issues/59832 https://github.com/ClickHouse/ClickHouse/issues/56723

Improvements for projections

Support for projections with where, distinct, array join, support for compression, and secondary indices. #74234

Set and Join tables as indices

Allow a MergeTree table to be configured with an index for fast key-value lookups.

Binary files visualization and statistical analysis

Examples:

https://codisec.com/binary-visualization-explained/ http://binvis.io/#/

In this project we will apply binary visualization to a large corpus of binaries, such as: - every debian package, - every python package, - every popular Docker image.

It could be combined with statistical analysis of machine instructions and registers usage.

Extensions of probabilistic filters for skip indexes

The skip indexes in ClickHouse are based on min-max statistics, bloom filters or unique sets. Cuckoo filters provide an interesting trade-offs - they are more compact and precise than normal bloom filter indexes. The task is to implement a cuckoo-filter skip index based on this paper.

Consider implementing binary-fuse filters as well.

ALP Compression for floating point data

Implement the ALP compression codec, which achieves much higher compression rates than the existing floating point codecs in ClickHouse (FPC and Gorilla) while being more than an order of magnitude faster (compression + decompression) than ZSTD.

Please re-implement the codec from scratch and do not integrate the original implementation.

See: #60533

Large scale Internet topology analysis

This is the task for data analysis. Research project for exploring the IPv4 routing and connectivity dynamics - finding patterns that emerge in a large dataset of ICMP responses across the IPv4 range from multiple vantage points. These patterns could include finding connectivity problems and instabilities, finding anycast addresses, finding suboptimal and indirect traffic routes, estimating the global topological structure of the Internet, and exploring patterns between various types of networks, such as infrastructure, residential, and corporate.

Note: we already collected around 10 trillion data points.

Large scale Internet technology survey

This is for a full-stack engineer or data analyst.

This task does not require C++ knowledge and is implemented mostly in SQL.

Given a dataset of 1 TB of HTML pages (ClickHouse's Minicrawl) or 200 TB of HTML pages (CommonCrawl) do research around technologies used on the web and prepare a dataset with extracted metadata. Similar examples: https://w3techs.com/, https://builtwith.com/, https://httparchive.org/

You can check the preparation and current results in this task: #18842

The goal is to build a mini web app. The app will allow answering questions like - which JavaScript frameworks are the most popular on the Internet (the correct answer is - JQuery); and how their popularity changes depending on the website's popularity, country, SSL certificate vendor, etc. There is a boatload of information, but take care - if this topic is not interesting for you, it's unlikely you will make a good app.

FSST

FSST[1, 2] is a relatievly new lightweight compression method for strings. What makes it different to others is that it allows random access to individual compressed strings, thus allowing to execute queries on the compressed data.

In ClickHouse, we have an ICompressionCodec.h interface for implementing compression methods. However, it doesn't support pushing down filters into the decompressor for data skipping, a feature needed to fully leverage FSST's capabilities. A potential solution would be to create a new serialization method based on SerializationString.h. There might be other, more elegant solutions (like expanding ICompressionCodec).

This project involves researching FSST's compression approach, exploring ClickHouse's architecture and comparing different approaches to integrate lightweight compression techniques as FSST, and implementing FSST integration that fully exploits its capabilities.

See: #34246 (and https://github.com/ClickHouse/ClickHouse/pull/62670 for more context) Contact: @rienath Taken by @uladzislauNestsiaruk

alexey-milovidov avatar Sep 29 '25 14:09 alexey-milovidov

-ArgMin, -ArgMax as combinators

Done by AmosBird already https://clickhouse.com/docs/sql-reference/aggregate-functions/combinators#-argmax

UnamedRus avatar Sep 29 '25 15:09 UnamedRus

Ok, removed.

alexey-milovidov avatar Sep 29 '25 15:09 alexey-milovidov

I’d like to work on "Functions for changing dimensionality"

krsv066 avatar Sep 29 '25 16:09 krsv066

My name is Mikhail Artemov, I'm a 3rd year student in Applied Mathematics and Informatics. I'm looking for a topic for my course project and found two very interesting themes:

  • Negative and fractional LIMIT values
  • Combinatorics functions

ArtemovMichael avatar Sep 29 '25 16:09 ArtemovMichael

Hi @alexey-milovidov , I'd like to work on "ALP Compression for floating point data".

ylw510 avatar Sep 29 '25 16:09 ylw510

I’d like to work on “Query cache for partial results”

sociopate-sss avatar Sep 29 '25 16:09 sociopate-sss

@alexey-milovidov, I'd like to work on "Improvements for Vortex data format"

m7kss1 avatar Sep 29 '25 17:09 m7kss1

I'd like to work on "Pipelined SQL syntax" issue

maVovk avatar Oct 01 '25 11:10 maVovk

Hi @alexey-milovidov, if the topic is still available, I would like to contribute to "Extensions of probabilistic filters for skip indexes".

zheguang avatar Oct 02 '25 03:10 zheguang

Hi @alexey-milovidov, I'd like to work on #80177 "Integration with AWS SQS and AWS Kinesis". Could you please explain in more detail what is needed to make this PR production-ready?

blotzartdd avatar Oct 02 '25 11:10 blotzartdd

Hi @alexey-milovidov , I'd like to work on "Negative and fractional LIMIT values".

0xgouda avatar Oct 06 '25 01:10 0xgouda

Hi @alexey-milovidov, I'd like to take "Improvements for projections" task as my graduation project

Fulldis avatar Oct 07 '25 19:10 Fulldis

Hi @alexey-milovidov , i'd like to work on "Aggregation with approximate and spacial clustering (+ strings)" or "Extensions of probablistic filters for skip indexes" if @zheguang wants to split the papers :-)

bannayeva avatar Oct 09 '25 18:10 bannayeva

"Extensions of probablistic filters for skip indexes" if @zheguang wants to split the papers :-)

Sure :)

zheguang avatar Oct 09 '25 19:10 zheguang

Hi @alexey-milovidov , I'd like to work on "FSST" issue.

uladzislauNestsiaruk avatar Oct 13 '25 11:10 uladzislauNestsiaruk

Hi. I'd like to work on "Approximate vector search with LeanNN" as a 3rd year course project.

valerikk2005 avatar Oct 13 '25 21:10 valerikk2005

Hi. I'd like to work on "Functions for rolling hashes and content-defined chunking" as a 3rd year course project

aobelski avatar Oct 14 '25 16:10 aobelski

Hello @alexey-milovidov , I would like to work on the "Aggregate functions for Geo types" issue as my course project for my junior year, if it is still available or @zheguang wants to split the papers. I have completed a relevant course on computational geometry, which I believe gave me the necessary skills to implement such functions in ClickHouse.

zhemalb avatar Oct 14 '25 16:10 zhemalb

Hi @alexey-milovidov, if the topic is still available, I would like to work on "FSST" issue

dfdigger avatar Oct 15 '25 14:10 dfdigger

Hi @alexey-milovidov I would like to work on porting the ClickHouse parser to WASM, if that is an acceptable 3rd year course project

shevlfs avatar Oct 16 '25 11:10 shevlfs

Hey @alexey-milovidov I would like to work on "Improvements for data formats: HDF5, Vortex, SQLite".

Dev79844 avatar Oct 20 '25 08:10 Dev79844

I'd like to work on user defined types

Vaneyk1973 avatar Oct 20 '25 09:10 Vaneyk1973

@alexey-milovidov hi! i'd like to work on

  • Streaming inserts and routed inserts
  • Extending ORDER BY with natural sorting and topological sorting and the SHUFFLE operator

andriibeee avatar Oct 22 '25 21:10 andriibeee

Hi, @alexey-milovidov, I would like to work on "Improvements for data formats: HDF5, Vortex, SQLite" also. On which format does @Dev79844 wish to work?

lubsanovdmitry avatar Oct 23 '25 14:10 lubsanovdmitry

Hey! I am interested in vortex and sqlite

Hi, @alexey-milovidov, I would like to work on "Improvements for data formats: HDF5, Vortex, SQLite" also. On which format does @Dev79844 wish to work?

Dev79844 avatar Oct 23 '25 16:10 Dev79844

hi, @Dev79844 @lubsanovdmitry. The task "Support for Vortex format" is already taken for my uni graduation project

@alexey-milovidov Is it possible to create a separate task and assign it to me?

m7kss1 avatar Oct 24 '25 18:10 m7kss1

@Dev79844 @m7kss1 I will take HDF5, then.

lubsanovdmitry avatar Oct 25 '25 13:10 lubsanovdmitry

Hey @lubsanovdmitry @m7kss1 I will take up sqlite.

Dev79844 avatar Oct 25 '25 13:10 Dev79844

Hi, @alexey-milovidov. I'd like to work on "Support for altering index granularity and primary key".

fenol-form avatar Oct 25 '25 14:10 fenol-form

Hi! I would like to work on "User authentication with wildcard names" as a bachelor thesis project

yistarostin avatar Oct 26 '25 09:10 yistarostin