We know the restrictions of joins when using KSQL. Two sources must be partitioned by the same key (join key) and have the same number of partitions.
But when it still does not work, that join does not produce correct result so that join does not happen, and no error message is produced. What should I look for? I will share my experience as I didn’t find useful resource.
No new data
I did go through https://www.confluent.io/blog/troubleshooting-ksql-part-1 and https://docs.confluent.io/current/ksql/docs/troubleshoot-ksql.html.
For me the important takeaways is to ensure new data comes in to the source you want to check, and use
SET ‘auto.offset.reset’ = ‘earliest’;
is very convenient. Besides that, when you create table or stream, it would be very useful to check you can see some data:
select * from MY_STREAM limit 10;
select * from MY_TABLE limit 10;
Now I still don’t see any join result after running the join query, what happened?
Join key? rowkey?
Basically the rowkey should be the same, because it is the key used to join. What here means is the rowkey should be the same and you can see it.
select rowkey from clicks;
select rowkey from events;
What I experienced is the first one returns
“1”
“2”
“1”
and the following for the second
1
1
2
So the problem is the key of the first one should not have “”. Due to this difference, the join fail to produce correct result. The error comes from the key of the sample data used incorrectly, it should not have “”. I will share some test input here:
“1”, {“price”: 1}
is wrong, and it should be
1, {“price”:1}
The other thing worth mentioning is
CREATE STREAM … WITH(key=’id’)
does not repartition. If you check
select rowkey from YOUR_STREAM;
you will see in the query result, the key is what is produced to kafka.
You have to use persistent query that
CREATE STREAM … AS SELECT * FROM YOUR_STREAM PARTITION BY YOUR_JOIN_KEY;
And the above would create a topic that is partition by the key that used by join.
As a best practice, I would recommend to always use ROWKEY as join key, because using other field name, it does not mean it will work, and ROWKEY is the only approach to make it work.