Optimize HashJoin Query Performance

When performing queries that join large tables, the SQL optimizer often chooses the HashJoin operator. HashJoin builds a hash table based on the join key for matching, which can lead to memory access and disk I/O bottlenecks. RuntimeFilter is an optimization technique that generates a filter in real-time during the execution of a HashJoin operation. It pre-filters data before the HashJoin, allowing for faster execution. In some scenarios, RuntimeFilter optimization can double the execution efficiency.

HashJoin is commonly used for joining a small table with a large table. When executing a HashJoin operation, SynxDB typically builds a hash table based on the smaller of the two tables to be joined. It then iterates through the tuples of the larger table, looking for matching tuples in the hash table to perform the join. The smaller table used to build the hash table is called the Inner Table, and the other table used for iterative matching is called the Outer Table.

The HashJoin operator mainly has the following execution bottlenecks:

Memory access: For each tuple in the outer table, it needs to find a matching tuple in the hash table, involving one or more memory accesses.
Disk I/O: If the inner table is too large to fit entirely in memory, it needs to be processed in batches using disk, resulting in significant disk I/O.

To address these bottlenecks, enabling the RuntimeFilter optimization allows SynxDB to build a corresponding RuntimeFilter while building the hash table. This filters the tuples of the large table before the HashJoin execution. During execution, the RuntimeFilter is implemented using a Bloom Filter, a data structure that occupies much less memory space than a hash table. When it can fit entirely within the L3 cache, a Bloom Filter’s filtering efficiency is twice that of a HashJoin, significantly reducing memory access overhead.

This optimization decides whether to generate a RuntimeFilter operator based on the selectivity of the HashJoin’s join condition and the size of the inner table. During actual execution, if SynxDB finds that the data volume deviates too much from the estimated result, it will also stop using the RuntimeFilter in a timely manner.

Use RuntimeFilter in the Postgres optimizer

This section describes the basic usage of RuntimeFilter, applicable to the Postgres query optimizer.

Applicable scenarios

If your scenario meets all the following conditions, consider using the RuntimeFilter optimization in HashJoin operations:

The inner table of the HashJoin contains a large amount of valid data on a single segment.
The inner table of the HashJoin has fewer than 16 million rows of valid data on a single segment.
The original HashJoin’s join key has a selectivity of less than 60%, meaning the size of the result set satisfying the hash join condition is less than 60% of the outer table. This can also be understood as a filtering rate greater than 40%.

In the scenarios described above, the Bloom Filter created by SynxDB via RuntimeFilter is less than 2 MB in size and can fit entirely in the L3 cache (typically 16 MB). Because the original inner table has a large amount of data, the directly built hash table cannot fit entirely in the L3 cache. Consequently, RuntimeFilter can filter out 40% of the outer table tuples with minimal overhead, yielding a positive return. In some scenarios, if the selectivity of the HashJoin’s join key is below 10%, the RuntimeFilter optimization can double the execution efficiency.

Usage limitations

Currently, RuntimeFilter is only enabled when the estimated number of rows in the inner table is less than 16 million. This limitation is to prevent the Bloom Filter from consuming too much memory, which could lead to slow execution or low filtering efficiency. Future updates may support using RuntimeFilter for very large inner tables to avoid the disk I/O overhead caused by batch processing.

Usage example

This optimization is effective only in the PostgreSQL optimizer. Therefore, before enabling it, you need to disable the GPORCA optimizer and manually enable the GUC parameter gp_enable_runtime_filter.

-- Preparations
SET optimizer TO off; -- Disables ORCA optimizer, use PostgreSQL optimizer.
SET gp_enable_runtime_filter TO on; -- Enables RuntimeFilter optimization.

-- Creates tables.
DROP TABLE IF EXISTS fact, dim;
CREATE TABLE fact (fid int, did int, val int);
CREATE TABLE dim (did int, proj_id int, filter_val int);

-- Generates test data, where 80% of fact.did and dim.did overlap.
INSERT INTO fact SELECT i, i % 8000 + 1, i FROM generate_series(1, 100000) s(i);
INSERT INTO dim SELECT i, i % 10, i FROM generate_series(1, 10000) s(i);
ANALYZE fact, dim;

-- Views the execution plan.
EXPLAIN (COSTS OFF) SELECT COUNT(*) FROM fact, dim
    WHERE fact.did = dim.did AND proj_id < 2;

The execution plan output is as follows, where the RuntimeFilter operator appears:

                                QUERY PLAN
---------------------------------------------------------------------------
 Finalize Aggregate
   ->  Gather Motion 3:1  (slice1; segments: 3)
         ->  Partial Aggregate
               ->  Hash Join
                     Hash Cond: (fact.did = dim.did)
                     ->  RuntimeFilter
                           ->  Seq Scan on fact
                     ->  Hash
                           ->  Broadcast Motion 3:3  (slice2; segments: 3)
                                 ->  Seq Scan on dim
                                       Filter: (proj_id < 2)
 Optimizer: Postgres query optimizer
(12 rows)

To print more execution-related information, you can use EXPLAIN ANALYZE:

->  RuntimeFilter (actual time=0.047..5.976 rows=6682 loops=1)
       Bloom Bits: 1048576
       Extra Text: (seg1)   Inner Processed: 2000, Flase Positive Rate: 0.000000

Optimization effect example

Using the tables from the “Usage Example”, join the fact and dim tables on the did column with the condition that proj_id is less than 2. The execution time shows the effect of using RuntimeFilter:

-- Without RuntimeFilter enabled.
EXPLAIN ANALYZE SELECT COUNT(*) FROM fact, dim
    WHERE fact.did = dim.did AND proj_id < 2;

Execution Time: 35956.436 ms

-- With RuntimeFilter enabled.
SET gp_enable_runtime_filter TO on;
EXPLAIN ANALYZE SELECT COUNT(*) FROM fact, dim
    WHERE fact.did = dim.did AND proj_id < 2;

Execution Time: 18276.112 ms

The above is a partial result of EXPLAIN ANALYZE. In the full results, the scan of the fact table returns 100 million tuples. After passing through the RuntimeFilter, 22 million tuples remain. Finally, after the HashJoin, the expected 20 million tuples are left. The pre-filtering effect is significant, and the time consumed is reduced by approximately 50%.

Push down filters to dynamic scans in GPORCA

The RuntimeFilter feature has been extended to the Dynamic Sequential Scan operator in the GPORCA optimizer. When querying partitioned tables, this enhancement allows the runtime filter to be pushed down from the HashJoin operator to the DynamicSeqScan operator. This filters data before scanning, effectively reducing the amount of data to be processed and significantly improving the join query performance for partitioned tables.

Applicable scenarios

You can consider enabling this feature when your query meets all the following conditions:

You are using the GPORCA optimizer (optimizer=on).
The query involves scanning a Partitioned Table.
The join operation is a HashJoin.

Usage example

This feature is controlled by the GUC parameter gp_enable_runtime_filter_pushdown, which is disabled by default. The following example demonstrates the impact of gp_enable_runtime_filter_pushdown on the performance of queries on partitioned tables.

-- Prepares test tables and data.
CREATE TABLE t1 (c1 INT, c2 INT) DISTRIBUTED BY (c1) PARTITION BY RANGE (c2) (START (1) END (100) EVERY (50));
CREATE TABLE t2 (c1 INT, c2 INT) DISTRIBUTED REPLICATED;
INSERT INTO t1 SELECT generate_series(1, 99), generate_series(1, 99);
INSERT INTO t1 SELECT * FROM t1;
INSERT INTO t1 SELECT * FROM t1;
INSERT INTO t1 SELECT * FROM t1;
INSERT INTO t1 SELECT * FROM t1;
INSERT INTO t2 SELECT generate_series(1, 5), generate_series(1, 5);
INSERT INTO t2 SELECT generate_series(51, 51), generate_series(51, 51);
ANALYZE;

-- Confirms GPORCA is being used.
SET optimizer TO on;

-- Views the execution plan without pushdown enabled.
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM t1, t2 WHERE t1.c2 = t2.c2;

The execution plan output is as follows:

                       QUERY PLAN
--------------------------------------------------------------------------
Gather Motion 2:1  (slice1; segments: 2) (actual rows=96 loops=1)
   ->  Hash Join (actual rows=64 loops=1)
         Hash Cond: (t1.c2 = t2.c2)
         Extra Text: (seg0)   Hash chain length 1.0 avg, 1 max, using 6 of 524288 buckets.
         ->  Dynamic Seq Scan on t1 (actual rows=832 loops=1)
               Number of partitions to scan: 2 (out of 2)
               Partitions scanned:  Avg 2.0 x 2 workers.  Max 2 parts (seg0).
         ->  Hash (actual rows=6 loops=1)
               Buckets: 524288  Batches: 1  Memory Usage: 4097kB
               ->  Partition Selector (selector id: $0) (actual rows=6 loops=1)
                     ->  Seq Scan on t2 (actual rows=6 loops=1)
Optimizer: GPORCA
(12 rows)

Now, enable the runtime filter pushdown feature:

SET gp_enable_runtime_filter_pushdown TO on;

Check the execution plan again:

EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM t1, t2 WHERE t1.c2 = t2.c2;

After enabling it, the execution plan changes to:

                        QUERY PLAN
--------------------------------------------------------------------------
Gather Motion 2:1  (slice1; segments: 2) (actual rows=96 loops=1)
   ->  Hash Join (actual rows=64 loops=1)
         Hash Cond: (t1.c2 = t2.c2)
         Extra Text: (seg0)   Hash chain length 1.0 avg, 1 max, using 6 of 524288 buckets.
         ->  Dynamic Seq Scan on t1 (actual rows=64 loops=1)
               Number of partitions to scan: 2 (out of 2)
               Partitions scanned:  Avg 2.0 x 2 workers.  Max 2 parts (seg0).
         ->  Hash (actual rows=6 loops=1)
               Buckets: 524288  Batches: 1  Memory Usage: 4097kB
               ->  Partition Selector (selector id: $0) (actual rows=6 loops=1)
                     ->  Seq Scan on t2 (actual rows=6 loops=1)
Optimizer: GPORCA
(12 rows)

As you can see, in the Dynamic Seq Scan on t1 operator, the number of returned rows decreased from 832 rows to 64 rows. This indicates that most of the data was filtered out during the scan phase, reducing the amount of data that the subsequent HashJoin operator needs to process, thereby improving query efficiency.

-- Cleans up the environment
RESET gp_enable_runtime_filter_pushdown;
DROP TABLE IF EXISTS t1;
DROP TABLE IF EXISTS t2;
SET optimizer TO off;

Push down runtime filters to Table Access Method (AM)

The Runtime Filter feature supports pushing down filters to the Table Access Method (AM) layer. This optimization primarily targets tables using the PAX storage format. When a scan on a PAX table occurs, a pushed-down runtime filter can leverage the min/max statistics of columns in the PAX files to skip data files that do not meet the conditions before reading the data. This approach moves the filtering operation to the storage layer, greatly reducing I/O and the amount of data that subsequent operators need to process, significantly improving the performance of join queries involving PAX tables.

Applicable scenarios

You can benefit from this feature when your query meets all the following conditions:

The query includes a HashJoin operation.
The outer table of the HashJoin (the table being scanned and filtered) is in PAX storage format.
The PAX table has minmax statistics enabled on the join key (specified via WITH (minmax_columns=...)).
The runtime filter feature is enabled (gp_enable_runtime_filter = on).

Usage example

This optimization is triggered automatically and requires no additional configuration. The following example demonstrates how this feature can improve query performance.

-- Preparations
SET optimizer TO off; -- Disables ORCA optimizer, use PostgreSQL optimizer.
SET gp_enable_runtime_filter TO on; -- Enables RuntimeFilter optimization.

-- Creates a PAX table as a fact table, and enables minmax statistics for the join key 'did'.
CREATE TABLE fact_pax (fid int, did int, val int) USING PAX WITH (minmax_columns='did');
-- Creates a dimension table.
CREATE TABLE dim (did int, proj_id int, filter_val int);

-- Generates test data.
-- The fact_pax table has 10 million rows.
INSERT INTO fact_pax SELECT i, i % 100000 + 1, i FROM generate_series(1, 10000000) s(i);
-- The data range of the dim table is much smaller than that of the fact_pax table.
INSERT INTO dim SELECT i, i % 10, i FROM generate_series(1, 10000) s(i);
ANALYZE fact_pax, dim;

-- Executes a join query. The dim table is the inner table, and fact_pax is the outer table.
-- The filter condition on the dim table, proj_id < 2, will be passed to the scan process of the fact_pax table via RuntimeFilter.
EXPLAIN ANALYZE SELECT COUNT(*) FROM fact_pax, dim
    WHERE fact_pax.did = dim.did AND proj_id < 2;

Optimization effect analysis

In the EXPLAIN ANALYZE results of the query above, you will observe that the number of rows returned by the Seq Scan operator on the fact_pax table is far less than the total number of rows in the table. This is because after the dim table is filtered by proj_id < 2, a runtime filter is generated for the range of its join key did. This filter is pushed down to the PAX storage layer of fact_pax. PAX uses the min/max statistics of the did column to directly skip entire data files where the did values are outside the filter’s range during the scan.

Compared to the case without pushdown to AM, this optimization avoids reading a large amount of irrelevant data from disk into memory for filtering by upper-level operators, thus achieving significant performance improvements.