v4.1.0 Release Notes

Release date: September 2025

Version: v4.1.0

v4.1.0 is a feature and performance enhancement release that improves query processing efficiency, data lake integration, system high availability, and overall stability.

This version greatly enhances the query optimizer, improving the efficiency of complex analytical queries through advancements in parallel processing, memory management, and locking mechanisms. Data lake integration is expanded with enhanced direct access to mainstream data formats and external data sources, along with increased flexibility in data synchronization. The new version also includes built-in support for common database extensions, simplifying development and management. For system high availability and operations, an automatic failover mechanism for critical nodes has been introduced, and deployment flexibility has been enhanced.

In addition, this release includes many bug fixes, comprehensively improving system stability and reliability.

New features

Database

The following table lists some of the new features in SynxDB v4.1.0. For full list of features, see the sections below the table.

Category

Features

User documents

Query processing and optimization

GPORCA optimizer introduces a two-phase optimization for window functions to effectively address performance bottlenecks caused by data skew of partition keys in window function queries.

Two-phase window function optimization

Query processing and optimization

The GPORCA optimizer now supports the Append operator as an alternative for scanning partitioned tables, addressing the limitation where the vectorized executor does not support multi-threaded metadata reading with Dynamic Scan.

Support using Append operator to scan partitioned tables

Query processing and optimization

Implements parallel processing for UNION operations by introducing a “Parallel-oblivious Append” mechanism, which allows multiple worker processes to handle data independently without sharing resources.

Execute UNION operations in parallel

Query processing and optimization

The runtime filtering feature in the GPORCA optimizer has been extended to the DynamicSeqScan operator, allowing runtime filters to be pushed down from the HashJoin operator to the partitioned table scan phase.

Push down filters to dynamic scans in GPORCA

Query processing and optimization

Supports parallelizing DISTINCT operations in an MPP architecture by redistributing tuple data among multiple worker processes using the Motion operator.

Execute DISTINCT operations in parallel

Query processing and optimization

Adds detailed hash aggregation statistics to the EXPLAIN ANALYZE output, including key metrics such as the number of hash tables, group and batch processing details, number of overflow partitions, disk space used, and hash collision chain length.

Interpret the EXPLAIN ANALYZE output for hash aggregation

Query processing and optimization

Introduces a new lock optimization feature for SELECT ... FOR UPDATE and similar queries, which downgrades a relation-level lock to a tuple-level lock when certain conditions are met.

Lock optimization for SELECT … FOR UPDATE, enable_lock_optimization parameter

Storage

Comprehensively optimizes memory management for the Hash Aggregate operator when processing large-scale data.

Hash aggregation memory management optimization, hash_mem_multiplier parameter

Data loading and external tables

datalake_fdw external tables now support custom HDFS user access via CREATE USER MAPPING, replacing the default gpadmin system role.

External table supports HDFS user mapping

Data loading and external tables

Adds sync_hive_partition_table function, supporting only a single partition of a Hive table rather than the entire table.

Use sync_hive_partition_table to synchronize partitioned Hive tables

Data loading and external tables

Supports direct access to Apache Iceberg tables stored in Amazon S3 or compatible object storage, without relying on external metadata catalogs (such as Hive Metastore or REST Catalog).

Load Iceberg table data from S3 (without external metadata service)

Tools and utilities

Adds support for PL/Container extensions, allowing users to run PL/Python user-defined functions (UDFs) securely within Docker containers.

PL/Container extension

Tools and utilities

Adds built-in support for pg_partman extensions, which are used for automated creation and management of partitioned tables based on time or sequence.

Use pg_partman to manage partitioned tables

Query processing and optimization

  • Two-phase window function optimization: GPORCA optimizer introduces a two-phase optimization for window functions to effectively address performance bottlenecks caused by data skew of partition keys in window function queries. When a window function with a highly selective filter condition (for example, rank() < N) is detected, the computation is automatically split into two phases: local pre-computation and global computation after redistribution, minimizing data transfer. This optimization is completely transparent to users, delivering performance improvements without requiring any SQL statement modifications.

    For details, see Two-phase window function optimization.

  • ORCA supports using Append operator to scan partitioned tables: The GPORCA optimizer now supports the Append operator as an alternative for scanning partitioned tables, addressing the limitation where the vectorized executor does not support multi-threaded metadata reading with Dynamic Scan. Users can enable this feature via the GUC parameter optimizer_disable_dynamic_table_scan, allowing the vectorized executor to leverage its performance advantages in partitioned table scenarios.

    For details, see Support using Append operator to scan partitioned tables.

  • Parallelize UNION operations: Implements parallel processing for UNION operations by introducing a “Parallel-oblivious Append” mechanism, which allows multiple worker processes to handle data independently without sharing resources. This optimization improves the performance and reliability of queries containing UNION, and users can observe parallel execution operators such as HashAggregate and Redistribute Motion in the execution plan.

    For details, see Execute UNION operations in parallel.

  • Push down runtime filters to dynamic SeqScan: The runtime filtering feature in the GPORCA optimizer has been extended to the DynamicSeqScan operator, allowing runtime filters to be pushed down from the HashJoin operator to the partitioned table scan phase. By filtering data before it is scanned, this feature effectively reduces the amount of data to be processed, greatly improving join query performance on partitioned tables. This feature can be enabled using the GUC parameter gp_enable_runtime_filter_pushdown.

    For details, see Push down filters to dynamic scans in GPORCA.

  • Parallelize DISTINCT operations: Supports parallelizing DISTINCT operations in an MPP architecture by redistributing tuple data among multiple worker processes using the Motion operator. A multi-stage processing strategy is employed, including pre-deduplication via Streaming HashAggregate, data redistribution based on the DISTINCT expression, and a final deduplication stage. This greatly enhances DISTINCT query performance with large data volumes.

    For details, see Execute DISTINCT operations in parallel.

  • Enhance EXPLAIN ANALYZE details for hash aggregation: Adds detailed hash aggregation statistics to the EXPLAIN ANALYZE output, including key metrics such as the number of hash tables, group and batch processing details, number of overflow partitions, disk space used, and hash collision chain length. These details help users better understand the memory usage and performance characteristics of hash aggregation operations, facilitating the diagnosis of query performance issues and targeted optimization.

    For details, see Interpret the EXPLAIN ANALYZE output for hash aggregation.

  • Optimize locks for SELECT ... FOR UPDATE: Introduces a new lock optimization feature for SELECT ... FOR UPDATE and similar queries, which downgrades a relation-level lock to a tuple-level lock when certain conditions are met. Controlled by the enable_lock_optimization parameter, enabling this feature can considerably improve concurrent transaction performance and reduce lock wait times. The parameter is false by default to maintain compatibility.

    For details, see Lock optimization for SELECT … FOR UPDATE, enable_lock_optimization parameter.

  • ORCA deduplicates superset conditions in AND clauses: Optimizes GPORCA’s logic for deduplicating superset conditions when processing AND clauses, such as pruning redundant conditions in subquery scenarios, improving the efficiency of query optimization.

  • ORCA prunes unused columns in CTEs: Optimizes GPORCA’s handling of Common Table Expressions (CTEs) to prune unused column references (ColRefs) in both the producer and consumer.

  • ORCA inlines basic wrapper functions: The GPORCA optimizer now inlines some basic wrapper functions, reducing the overhead of function calls and further improving query optimization performance.

  • Optimize PLT linking method: By using static linking instead of libpostgres.so, the procedure language table (PLT) function call overhead has been optimized, resulting in a 5-8% performance improvement in TPC-DS benchmarks.

  • Optimize AQUMV system: The AQUMV system now directly stores and uses the original SQL when processing materialized views, avoiding the overhead of rule parsing and thus improving query performance.

  • Optimize path generation for UNION operations: Optimizes the path generation logic for UNION operations by skipping mismatched partial paths, which improves the efficiency of query plan generation.

Storage

  • Optimize hash aggregation memory management: Comprehensively optimizes memory management for the Hash Aggregate operator when processing large-scale data. By adjusting the default value of the hash_mem_multiplier parameter to 2.0 and reducing the extra 16B memory overhead of NumericAggState, memory utilization is greatly improved, effectively reducing the risk of data spilling to disk due to insufficient memory. This optimization is completely transparent to users and can provide a substantial boost to query performance when dealing with large datasets or scenarios with multiple grouping keys.

    For details, see Hash aggregation memory management optimization, hash_mem_multiplier parameter.

  • Gracefully downgrades on AO table WAL compression failure: When WAL compression for AO/AOCO tables fails, the system will gracefully downgrade by logging the event and reverting to the uncompressed WAL format, thus avoiding a PANIC error and improving system stability.

  • Unifies catalog table paths: To avoid path inconsistencies between QD and QE nodes, catalog table paths are now unified using relid instead of relfilenode, ensuring metadata consistency.

Data loading and external tables

  • External table supports HDFS user mapping: datalake_fdw external tables now support custom HDFS user access via CREATE USER MAPPING, replacing the default gpadmin system role. This feature implements fine-grained access control for HDFS resources, particularly suitable for permission management in multi-tenant or multi-user environments, and users can configure different HDFS access users for different external tables based on actual needs.

    For details, see External table supports HDFS user mapping.

  • Support for single-partition synchronization of Hive partition tables: Adds sync_hive_partition_table function, supporting only a single partition of a Hive table rather than the entire table. This function synchronizes data under the partition corresponding to the value specified by the top-level partition key (the first one in the partition column definition), improving the data synchronization efficiency of large partition tables. Particularly suitable for incremental data synchronization scenarios, users can synchronize specific partitions as needed to avoid the overhead of full table synchronization.

    For details, see Use sync_hive_partition_table to synchronize partitioned Hive tables.

  • Direct read of Iceberg tables (S3 storage): Supports direct access to Apache Iceberg tables stored in Amazon S3 or compatible object storage, without relying on external metadata catalogs (such as Hive Metastore or REST Catalog). Through datalake_fdw and s3.conf configuration, users can quickly perform read-only queries and analysis on existing Iceberg data, supporting both partitioned and non-partitioned tables. This feature is particularly suitable for quick data exploration and analysis needs in data lake scenarios.

    For details, see Load Iceberg table data from S3 (without external metadata service).

Tools and utilities

  • PL/Container: Adds support for PL/Container extensions, allowing users to run PL/Python user-defined functions (UDFs) securely within Docker containers. Through containerization isolation technology, it ensures that user code cannot access the host operating system or file system, effectively preventing security risks. Supports Python 2.7, Python 3.6+, and R language, built into SynxDB without requiring additional installation.

    For details, see PL/Container extension.

  • pg_partman partition management extension: Adds built-in support for pg_partman extensions, which are used for automated creation and management of partitioned tables based on time or sequence. Users do not need to manually compile and install, enabling this extension in the database simplifies the maintenance of partitioned tables.

    For details, see Use pg_partman to manage partitioned tables.

  • Improve pg_dump lock table logic: Optimizes the lock table logic of the pg_dump tool, reducing communication overhead between the frontend and backend, and improving backup efficiency.

Interactive manager DBCC

Note

Starting from v4.1.0, the interactive manager CBCC has been renamed to DBCC.

Feature

User document

Supports coordinator automatic failover.

DBCC Server Coordinator automatic failover

Supports binding virtual IP.

Bind VIP

Supports modifying user password.

Install DBCC Server

Supports customizing service ports.

Customize ports

Supports disabling multi-language.

Disable multi-language support

Product change information

Configuration parameters

  • gp_appendonly_insert_files parameter default value change: The default value of system parameter gp_appendonly_insert_files has been modified to 0. If users need to prepare AO/AOCO files for parallel queries, they must manually set this parameter before data insertion.

Installation and deployment

  • The environment script greenplum_path.sh has been officially renamed to cloudberry-env.sh to be consistent with Apache Cloudberry™ (Incubating). See Deploy on multi-node.

  • PAX build optimization: To simplify dependencies and build processes, local source code has replaced the cpp-stub submodule.

Data lake and external integration

  • Default data lake agent server address set to localhost: The default binding address of the datalake_agent service has been changed to localhost, ensuring that the proxy only accepts connections from the local database instance. This security design effectively prevents the proxy service port from being exposed on the network, enhancing data lake access security.

  • The default port of the data lake agent datalake_agent has been changed from 5888 to 3888 to resolve the port conflict with PXF.

  • Default gpfdist compression level adjustment: The default compression level of the gpfdist tool has been adjusted from 1 to 3. The new default value aims to achieve a better balance between CPU overhead and network traffic, thereby improving the performance of most network-intensive ETL tasks.

  • Optimizes the performance of the gpfdist tool, automatically increasing the socket buffer size when compression is enabled to improve data loading efficiency.

Removed features

  • Remove QuickLZ compression algorithm support: Due to potential security risks and lower compression efficiency, the support for the QuickLZ compression algorithm has been removed from the system. It is recommended that users migrate to Zstandard (zstd) or zlib and other more modern and efficient compression algorithms.

Bug fixes

  • Fixed the pg_dump tool to correctly handle AO/AOCO tables with appendonly and orientation options, thereby supporting backup operations on these tables.

  • Fixed the internal database version check logic of pg_upgrade.

  • Fixed the issue where metadata in the gp_matview_aux table was not synchronized during renaming materialized views.

  • Fixed the system crash and calculation result error issues that may occur when calling pg_relation_size on PAX partitioned tables due to improper handling of the root node access method (AM).

  • Optimized the storage space usage of TOAST data in AO row-storage tables.

  • initdb tool now prohibits the use of an empty username (-U ""), and will explicitly report an error, improving the robustness of initialization.

  • Fixed the symbol export conflict between the postgres binary file and the libpq library, improving system stability.

  • Optimized the exception handling logic of Interconnect, fixed the stack overflow issue caused by recursive calls, and improved system stability in exceptional scenarios.

  • Fixed the invalid write issue of MotionConn that may exist during explicit data redistribution (Motion), improving query execution stability.

  • Fixed the issue where parallel path generation might result in consider_parallel being disabled due to improper handling of NULL bitmap.

  • Fixed the issue where parallel degree calculation might be inaccurate due to incorrect activeTaskCount statistics in dynamic parallelism.

  • Fixed the issue where COUNT(*) and COUNT(column) might be mixed in the vectorized executor, ensuring accurate aggregation results.

  • Fixed the conversion issues for Decimal and Timestamp-Millis types when reading Parquet and ORC files, improving data lake query accuracy.

  • Fixed the compatibility issues that may arise due to the ICU library not supporting certain database encodings.

  • Fixed the system crash (core dump) issue that might occur when executing data lake related queries in a single Coordinator node deployment environment.

  • Fixed the query result incorrect issue in GPORCA due to aggregation number (aggno, aggtransno) generation errors, which might occur in HAVING clause scenarios.

  • Fixed the invalid attribute mapping issue that might occur when Runtime Filter Pushdown is enabled.

  • Fixed the recovery failure issue caused by erroneous ported code in gprecoverseg tool.

  • Improved compatibility with libevent 2.0+ library for gpfdist tool, and enhanced its stability in high-concurrency scenarios by using thread-safe event_base instead of event_init.

  • Enhanced problem diagnosis ability, when the writer (Writer Gang) is lost, the system will print detailed stack trace information.

  • Fixed the issue where materialized view status might be incorrect when processing partitioned tables.

  • Fixed the limitation issue that might exist when processing a large number of object OIDs.