clickhouse-java icon indicating copy to clipboard operation
clickhouse-java copied to clipboard

lossing data when inserting data to bitmap table by clickhouse-jdbc

Open victortony opened this issue 4 years ago • 8 comments

Hi, I have two tables in my system: dm_user_set_mapping_di and dm_user_set_mapping_di_tmp.

CREATE TABLE profile.dm_user_set_mapping_di (dateUInt32,userset_keyString,uvAggregateFunction(groupBitmap, UInt64) ) ENGINE = Distributed( 'perftest_3shards_1replicas', 'profile', 'dm_user_set_mapping_di_local', cityHash64(userset_key)

CREATE TABLE profile.dm_user_set_mapping_di_tmp (dateUInt32,userset_keyString,u_idUInt64 ) ENGINE = Distributed( 'perftest_3shards_1replicas', 'profile', 'dm_user_set_mapping_di_tmp_local', intHash64(u_id) ) Firstly, my system load data to "dm_user_set_mapping_di_tmp" by clickhouse-jdbc. And then execute the following sql to load data to "dm_user_set_mapping_di". The problem is that "dm_user_set_mapping_di_tmp" is accurate , but "dm_user_set_mapping_di" loss data by clickhouse-jdbc. When execute "insert" sql in clickhouse client without clickhouse-jdbc, the data in dm_user_set_mapping_di" become accurate.

insert into profile.dws_mifi_loan_user_profile_df (date, label_name, label_value, uv) select date, label_name, label_value, groupBitmapState(toUInt64(u_id)) as uv from profile.dws_mifi_loan_user_profile_df_tmp where date=20210423 group by date, label_name, label_value

I think maybe the data doesn't be updated, when the first table loaded data. Should I flush or update data of the first table when loading the second table, even I create a new jdbc connection?

ClickHouse client version 19.16.3.6. Clickhouse-jdbc version 0.2.6.

victortony avatar Apr 26 '21 06:04 victortony

@victortony, the SQL you provided is for populating DWS table dws_mifi_loan_user_profile_df, which is different from the dimension table dm_user_set_mapping_di.

If my understanding is correct, I think it's not related to JDBC driver, as the insertion happened on server without involving JDBC driver, if you're certain that no data loss found in the temp table.

ClickHouse 19.16 and 20.3 are no longer supported, so maybe you should try 21.3 or at least 20.8? Moreover, if you can upgrade JDBC driver to 0.3.0, you no longer need the temp table as you can insert RoaringBitmap directly into the AggregateFunction column - please refer to example at here.

zhicwu avatar Apr 26 '21 09:04 zhicwu

@victortony, the SQL you provided is for populating DWS table dws_mifi_loan_user_profile_df, which is different from the dimension table dm_user_set_mapping_di.

If my understanding is correct, I think it's not related to JDBC driver, as the insertion happened on server without involving JDBC driver, if you're certain that no data loss found in the temp table.

ClickHouse 19.16 and 20.3 are no longer supported, so maybe you should try 21.3 or at least 20.8? Moreover, if you can upgrade JDBC driver to 0.3.0, you no longer need the temp table as you can insert RoaringBitmap directly into the AggregateFunction column - please refer to example at here.

不好意思,我把sql中的表名写错了,确认是使用clickhouse客户端进行插入bitmap数据是全的。 我按照您的建议升级到了21.3.8。我的集群是两个节点,升级后,其中一个节点变得非常慢,show tables都要用半分钟。而且插入bitmap整行的的丢失数据,不知道是发生了什么? 看日志一直在报错,错误信息如下: 0. DB::TCPHandler::receiveHello() @ 0xf657442 in /usr/bin/clickhouse

  1. DB::TCPHandler::runImpl() @ 0xf6503f9 in /usr/bin/clickhouse
  2. DB::TCPHandler::run() @ 0xf6633e9 in /usr/bin/clickhouse
  3. Poco::Net::TCPServerConnection::start() @ 0x11d1732f in /usr/bin/clickhouse
  4. Poco::Net::TCPServerDispatcher::run() @ 0x11d18d41 in /usr/bin/clickhouse
  5. Poco::PooledThread::run() @ 0x11e4f469 in /usr/bin/clickhouse
  6. Poco::ThreadImpl::runnableEntry(void*) @ 0x11e4b2ca in /usr/bin/clickhouse
  7. start_thread @ 0x7dc5 in /usr/lib64/libpthread-2.17.so
  8. clone @ 0xf773d in /usr/lib64/libc-2.17.so (version 21.3.8.76 (official build)) 2021.04.28 10:54:35.954359 [ 364594 ] {} <Error> ServerErrorHandler: Code: 101, e.displayText() = DB::NetException: Unexpected packet from client, Stack trace (when copying this message, always include the lines below):

victortony avatar Apr 28 '21 07:04 victortony

I don't think it's related to JDBC driver but something else with your cluster. Below worked for me on a standalone 21.3 server. Perhaps you can consult on Telegram/Slack or create an issue at here with detailed information like steps to reproduce?

drop database if exists testbm;
create database testbm;

CREATE TABLE testbm.dm_user_set_mapping_di (
	date UInt32,upsert_key String, uv AggregateFunction(groupBitmap, UInt64)
) ENGINE = MergeTree() partition by date order by (date, upsert_key);
create table testbm.dm_user_set_mapping_di_tmp (
	date UInt32, upsert_key String, u_id UInt64
) ENGINE = MergeTree()
partition by date order by (date, upsert_key, intHash64(u_id))
sample by intHash64(u_id);

insert into testbm.dm_user_set_mapping_di_tmp
select 20210423, concat('key_', toString(rand64() % 2333)), number
from system.numbers limit 10000000;

insert into testbm.dm_user_set_mapping_di(date, upsert_key, uv)
select date, upsert_key, groupBitmapState(u_id)
from testbm.dm_user_set_mapping_di_tmp
where date=20210423
group by date, upsert_key;

select a.date, a.upsert_key,
	bitmapCardinality(a.uv) - ifnull(b.cnt, 0) as cnt_diff,
	b.bm is null ? 0 : bitmapHasAll(a.uv, b.bm) as has_all
from testbm.dm_user_set_mapping_di a
left join (
	select date, upsert_key, uniqExact(u_id) as cnt, groupBitmapState(u_id) as bm
	from testbm.dm_user_set_mapping_di_tmp
	group by date, upsert_key
	order by upsert_key
) b on a.date = b.date and a.upsert_key = b.upsert_key
where b.date is null or cnt_diff != 0 or has_all != 1
union all
select date, upsert_key, toInt64(bitmapCardinality(uv)) as cnt_diff, toUInt8(0) as has_all
from testbm.dm_user_set_mapping_di
where (date, upsert_key) not in (select distinct date, upsert_key from testbm.dm_user_set_mapping_di_tmp)

zhicwu avatar Apr 28 '21 08:04 zhicwu

thank you very much

victortony avatar Apr 28 '21 09:04 victortony

@victortony, the SQL you provided is for populating DWS table dws_mifi_loan_user_profile_df, which is different from the dimension table dm_user_set_mapping_di.

If my understanding is correct, I think it's not related to JDBC driver, as the insertion happened on server without involving JDBC driver, if you're certain that no data loss found in the temp table.

ClickHouse 19.16 and 20.3 are no longer supported, so maybe you should try 21.3 or at least 20.8? Moreover, if you can upgrade JDBC driver to 0.3.0, you no longer need the temp table as you can insert RoaringBitmap directly into the AggregateFunction column - please refer to example at here.

hi, 升级到21.3.8还是有问题,我的tmp表数据量是对的 ,但是使用JDBC执行insert生成bitmap数据量总差很多,在clickhouse客户端执行就一条不差。 insert into profile.dws_mifi_loan_user_profile_df (date, label_name, label_value, uv) select date, label_name, label_value, groupBitmapState(toUInt64(u_id)) as uv from profile.dws_mifi_loan_user_profile_df_tmp where date=20210423 group by date, label_name, label_value

导入数据相关代码: ` /**

  • 生成connection
  • @return */ def getConnection(): ClickHouseConnection = { val dataSource = new ClickHouseDataSource(CH_URL) dataSource.getConnection(CH_USERNAME, CH_PASS) }

/**

  • 导入数据
  • @param iterator
  • @param sqlTemplate
  • @param date */ def batchImportData(iterator: Iterator[Row], sqlTemplate: String, date: String) = { var index = 1 val conn = getConnection() val psmt = conn.prepareStatement(sqlTemplate) while (iterator.hasNext) { val row: Row = iterator.next() var fieldIndex = 1 row.schema.fields.foreach(field => { field.dataType match { case StringType => psmt.setString(fieldIndex, row.getAsString) case LongType => psmt.setLong(fieldIndex, row.getAsLong) case IntegerType => psmt.setInt(fieldIndex, row.getAsInt) case _ => println(s" other type: ${field.dataType}") } fieldIndex += 1 }) psmt.addBatch() if (index % BATCH_COUNT == 0) { psmt.executeBatch() psmt.clearBatch() } index = index + 1 } psmt.executeBatch() // psmt.clearBatch()
//关闭链接
psmt.close()
conn.close()

}`

生成bitmap相关代码: def buildBitmap(date: String): Unit = { val conn = getConnection() val stmt = conn.createStatement() val sql = BITMAP_GENERATION_SQL_TEMPLATE.format(date) stmt.executeQuery(sql) stmt.close() conn.close() }

main方法 val insertSql = buildPrepareSqlTemplate(df) df.repartition(PARTITION_NUM).foreachPartition(iterator => { batchImportData(iterator, insertSql, date) }) logger.info("begin to generate bitmap...") buildBitmap(date) logger.info("after generate bitmap...")

victortony avatar May 13 '21 10:05 victortony

Not sure if it's related to #655. Does the latest code on develop branch work for you?

zhicwu avatar May 13 '21 11:05 zhicwu

hi, is it merged into any release version?

victortony avatar May 18 '21 06:05 victortony

hi, is it merged into any release version?

Not yet. You'll need to check out code from develop and build a snapshot package by issue mvn clean package.

zhicwu avatar May 18 '21 07:05 zhicwu