Background
I have a Delta table motor
which has too many columns.
Due to AWS Glue Catalog schema size 400KB limit, I have to split the table.
To simplify the question, for example, before I have columns timestamp
, col0
, col1
, col2
, col3
, … , col79
.
After splitting to 8 tables, I have tables
- motor_0:
timestamp
,col0
,col1
, …,col9
- motor_1:
timestamp
,col10
,col11
, …,col19
- …
- motor_7:
timestamp
,col70
,col71
, …,col79
Note the same timestamp
column is reserved across all tables.
Question
Now, I am using Trino SQL to join these tables. Here is my current SQL:
with t0 as (
select * from delta.my_delta_db.motor_0
order by timestamp desc limit 1000
),
t1 as (
select * from delta.my_delta_db.motor_1
order by timestamp desc limit 1000
),
t2 as (
select * from delta.my_delta_db.motor_2
order by timestamp desc limit 1000
),
t3 as (
select * from delta.my_delta_db.motor_3
order by timestamp desc limit 1000
),
t4 as (
select * from delta.my_delta_db.motor_4
order by timestamp desc limit 1000
),
t5 as (
select * from delta.my_delta_db.motor_5
order by timestamp desc limit 1000
),
t6 as (
select * from delta.my_delta_db.motor_6
order by timestamp desc limit 1000
),
t7 as (
select * from delta.my_delta_db.motor_7
order by timestamp desc limit 1000
),
select * from t0
join t1 on t0.timestamp = t1.timestamp
join t2 on t0.timestamp = t2.timestamp
join t3 on t0.timestamp = t3.timestamp
join t4 on t0.timestamp = t4.timestamp
join t5 on t0.timestamp = t5.timestamp
join t6 on t0.timestamp = t6.timestamp
join t7 on t0.timestamp = t7.timestamp
order by t0.timestamp desc limit 1000
The reason I am adding order by timestamp desc limit 1000
in each with
block is to make each Trino sub-task output data size small. This helps avoid joining the total number of rows for each join.
I am wondering if there is a way to make this query shorter but also not lose performance?
I found Trino has a WITH RECURSIVE, but seems not applying to my case (?)
Any guide would be appreciate. Thanks!
It sounds like you’re trying to deal with a poor or possibly denormalised design, have you created them with a clustered index on the timestamp column? What do you mean by “make the query shorter” ? Less text? The written size of a query has no correlation with its performance once compiled.
Thanks @Stu for reminding! There would be z-index on
timestamp
column in this Delta table. Yeah, just hopefully less text so that I don’t have to repeat same text. But if can further improve performance in another SQL, that will be even better! 😃I’m not familiar with
trino
so can’t suggest and product specific tuning, however if you can use temporary tables perhaps insert your 1000 rows fromt0
and then join with the temp table? Normally with a clustered index seek you won’t need to use CTEs like this just join the tables.