Load Data from Object Storage and HDFS
You can use the datalake_fdw extension to load data from an object storage (such as Amazon S3), HDFS, and ORC tables in Hive into SynxDB for data query and access.
Currently, supported data formats are CSV, TEXT, ORC, and PARQUET.
Note
datalake_fdw does not support loading data in parallel.
Attention
The default port for the
data_lakeagent has been changed from5888to3888to avoid conflict with PXF.The
datalake_agentis a background service that must run on each SynxDB node (coordinator and all segments). By default, this agent binds tolocalhostand listens on port3888, meaning it only accepts connections from the local database instance. This secure design ensures that the agent’s service port is not exposed to the network.
For information on how to load tables from Hive into SynxDB, see Load Data from Hive Data Warehouse.
Install the extension
To install the datalake_fdw extension to the database, execute the SQL statement CREATE EXTENSION data_fdw;.
CREATE EXTENSION datalake_fdw;
Instructions
This section explains how to use datalake_fdw to load data from object storage and HDFS into SynxDB.
To load data using datalake_fdw, you need to create a foreign data wrapper (FDW) first. This includes creating an FDW server and user mapping.
Load data from object storage
You can load data from Amazon S3 or other object storages into SynxDB. Follow these steps:
Create a foreign table wrapper
FOREIGN DATA WRAPPER. Note that there are no options in the SQL statement below, and you need to execute it exactly as provided.CREATE FOREIGN DATA WRAPPER datalake_fdw HANDLER datalake_fdw_handler VALIDATOR datalake_fdw_validator OPTIONS ( mpp_execute 'all segments' );
Create an external server
foreign_server.CREATE SERVER foreign_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS (host 'xxx', protocol 's3b', isvirtual 'false', ishttps 'false');
The options in the above SQL statement are explained as follows:
Option name
Description
Details
hostSets the host information for accessing the object storage.
Required: Must be set
Example:
Host for private cloud:
192.168.1.1:9000
protocolSpecifies the cloud platform for the object storage.
Required: Must be set
Options:
s3b: Amazon Cloud (uses v2 signature)s3: Amazon Cloud (uses v4 signature)
isvirtualUse virtual-host-style or path-host-style to parse the host of the object storage.
Required: Optional
Options:
true: Uses virtual-host-style.false: Uses path-host-style.
Default value:
falseishttpsWhether to use HTTPS to access the object storage.
Required: Optional
Options:
true: Uses HTTPS.false: Does not use HTTPS.
Default value:
falseCreate a user mapping.
CREATE USER MAPPING FOR gpadmin SERVER foreign_server OPTIONS (user 'gpadmin', accesskey 'xxx', secretkey 'xxx');
The options in the above SQL statement are explained as follows:
Option name
Description
Required
userCreates the specific user specified by
foreign_server.Yes
accesskeyThe key needed to access the object storage.
Yes
secretkeyThe secret key needed to access the object storage.
Yes
Create a foreign table
example. After creating it, the data on the object storage is loaded into SynxDB, and you can query this table.CREATE FOREIGN TABLE example( a text, b text ) SERVER foreign_server OPTIONS (filePath '/test/parquet/', compression 'none', enableCache 'false', format 'parquet');
The options in the SQL statement above are explained as follows:
Option name
Description
Details
filePathSets the specific path for the target foreign table.
Required: Must be set
Path format should be
/bucket/prefix.Example:
If the bucket name is
test-bucketand the path isbucket/test/orc_file_folder/, and there are files like0000_0,0001_1,0002_2, then to access file0000_0, setfilePathtofilePath '/test-bucket/test/orc_file_folder/0000_0'.To access all files in
test/orc_file_folder/, setfilePathtofilePath '/test-bucket/test/orc_file_folder/'.
Note:
filePathis parsed in the format/bucket/prefix/. Incorrect formats might lead to errors, such as:filePath 'test-bucket/test/orc_file_folder/'filePath '/test-bucket/test/orc_file_folder/0000_0'
compressionSets the write compression format. Currently supports gzip, zstd, lz4.
Required: Optional
Options:
none: Supports CSV, ORC, TEXT, PARQUET.gzip: Supports CSV, TEXT, PARQUET.zstd: Supports PARQUET.lz4: Supports PARQUET.
Default value:
none, which means no compression. Not setting this value means no compression.
enableCacheSpecifies whether to use Gopher caching.
Required: Optional
Options:
true: Enables Gopher caching.false: Disables Gopher caching.
Default value:
falseDeleting the foreign table does not automatically clear its cache. To clear the cache, you need to manually run a specific SQL function, such as:
select gp_toolkit._gopher_cache_free_relation_name(text);
formatThe file format supported by FDW.
Required: Must be set
Options:
csv: Read, Writetext: Read, Writeorc: Read, Writeparquet: Read, Write
Load Iceberg table data from S3 (without an external metadata service)
This section describes how to configure SynxDB to directly load Apache Iceberg tables stored on Amazon S3 or other compatible object storage without depending on an external metadata catalog (such as Hive Metastore or a REST Catalog).
This feature is primarily intended for quick, read-only querying and analysis of existing Iceberg data.
Prerequisites: Correct Iceberg table metadata
This feature requires that the metadata of your Iceberg tables on object storage is S3 path-aware. This means the Iceberg tables must be generated directly by tools (such as Apache Spark) that are configured with an S3 warehouse.
Incorrect approach: Generating Iceberg tables on a local file system and then manually uploading their folders to object storage. This will result in metadata files containing invalid local file system paths (for example,
file:/...), causing errors during queries.Correct approach: Ensure that the catalog or warehouse configuration of your data generation tool (for example, Spark) points directly to an S3 path (for example,
s3a://your-bucket/your-warehouse). This ensures that the generated metadata contains the corrects3a://protocol paths.
Preparations: Configure the S3 connection file
To enable this feature, you need to create a configuration file named s3.conf on the coordinator node of your SynxDB cluster. This file provides the underlying datalake_agent with the necessary connection and authentication information to access S3.
The s3.conf file must be placed in the data directory of the coordinator node. You can find the exact path of this directory by executing the echo $COORDINATOR_DATA_DIRECTORY command on the coordinator node. For example, the path might be /data0/coordinator/gpseg-1.
Example:
s3_cluster:
fs.s3a.endpoint: http://127.0.0.1:8000
fs.s3a.access.key: admin
fs.s3a.secret.key: password
fs.s3a.path.style.access: true
fs.defaultFS: s3a://
fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
Procedures
Create a foreign data wrapper. You can skip this step if it already exists.
CREATE FOREIGN DATA WRAPPER datalake_fdw HANDLER datalake_fdw_handler VALIDATOR datalake_fdw_validator OPTIONS (mpp_execute 'all segments');
Create a foreign server pointing to the S3 service. This is a standard S3 server definition.
CREATE SERVER s3_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS (host 'your_s3_host', protocol 's3');
host: Specifies the host information for accessing the object storage.protocol: For S3 or compatible storage, set this tos3.
Create a user mapping to associate a user with the foreign server.
Note
This step is mandatory for establishing an association between the user and the foreign server within the database, but you do not need to provide any authentication-related
OPTIONS. The actual authentication process is controlled by thes3.conffile.CREATE USER MAPPING FOR gpadmin SERVER s3_server;
Create a foreign table to map to the Iceberg data on S3.
CREATE FOREIGN TABLE iceberg_s3_table ( -- Defines the table columns here. The structure must match the schema of the Iceberg table. id int, name text, create_date date -- If it is a partitioned table, the partition key must also be defined as a column. ) SERVER s3_server OPTIONS ( format 'iceberg', catalog_type 's3', server_name 's3_cluster', filePath '/your_bucket/path/to/warehouse/', table_identifier 'your_db.your_table' );
format: Specifies the file format. For this scenario, it is fixed to'iceberg'.catalog_type: Specifies the catalog type. For a catalog-less S3 scenario, it is fixed to's3'.server_name: Specifies the name of the cluster configuration defined in thes3.conffile. In this example, it is's3_cluster'.filePathandtable_identifier: These two parameters jointly determine the access path to the table. To ensure correct location, the following configuration pattern is recommended:Assume your table structure on S3 is
bucket-name/warehouse-name/database-name/table-name/, for example:my-bucket/iceberg_warehouse/db1/orders/. The parameters should be set as follows:filePath:'/my-bucket/iceberg_warehouse/'(points to the warehouse root path that contains the database directories)table_identifier:'db1.orders'(in the format of'database_name.table_name')
Examples
Example 1: Query a non-partitioned table. Assume the path to the Iceberg table on S3 is s3a://ossext-ci-test/warehouse/iceberg/warehouse/default/simple_table.
Create the foreign table
iceberg_simple:CREATE FOREIGN TABLE iceberg_simple ( id int, name text ) SERVER s3_server OPTIONS ( filePath '/ossext-ci-test/warehouse/iceberg/warehouse/', catalog_type 's3', server_name 's3_cluster', table_identifier 'default.simple_table', format 'iceberg' );
Query the data:
SELECT * FROM iceberg_simple WHERE id = 1;
Example 2: Query a partitioned table. Assume the Iceberg table partitioned_table on S3 is partitioned by the create_date field, and its path is s3a://ossext-ci-test/warehouse/iceberg/warehouse/testdb/partitioned_table.
Create the foreign table
iceberg_partitioned. Note that the partition keycreate_datemust be included in the column definitions.CREATE FOREIGN TABLE iceberg_partitioned ( id int, name text, age int, department text, create_date date ) SERVER s3_server OPTIONS ( filePath '/ossext-ci-test/warehouse/iceberg/warehouse/', catalog_type 's3', server_name 's3_cluster', table_identifier 'testdb.partitioned_table', format 'iceberg' );
Query the data:
SELECT name, department FROM iceberg_partitioned WHERE create_date = '2025-05-20';
Limitations and notes
Read-only operations: Iceberg foreign tables created with this method only support
SELECTqueries. Write operations such asINSERT,UPDATE, andDELETEare not supported.Authentication method: This feature uses only the
s3.conffile for authentication. Although the database still requires creating aUSER MAPPINGto associate a user with the foreign server, you do not need to provide authentication options in theCREATE USER MAPPINGcommand for this scenario.Path concatenation: Ensure that
filePathandtable_identifierare set correctly. The system locates the table data by concatenatingfilePathandtable_identifier. Typically,filePathshould point to the warehouse root directory that contains one or more database directories.
Read Iceberg tables on S3 via Polaris Catalog
This section explains how to query Apache Iceberg tables stored on Amazon S3 or other compatible object storage in SynxDB by connecting to a Polaris Catalog service.
This feature allows you to use an external, centralized metadata service to manage Iceberg tables while using the powerful query capabilities of SynxDB for data analysis. Iceberg foreign tables created with this method currently only support SELECT queries; write operations like INSERT, UPDATE, and DELETE are not supported.
Core concepts
Unlike accessing the filesystem directly, accessing Iceberg tables via a catalog service requires SynxDB to communicate with two separate external systems:
Polaris Catalog Service: A service for storing and managing Iceberg table metadata (such as schema, partition information, and snapshots).
S3 Object Storage Service: An external service for storing the actual data files (for example, parquet files).
Therefore, you need to create two independent sets of SERVER and USER MAPPING objects to configure and authenticate the connections for these two services respectively.
Prerequisites
Network connectivity:
Ensure that the SynxDB cluster has network access to the
hostaddress of the external S3 service. This may require configuring appropriate firewall outbound rules or network policies. The requirements for accessing S3 are the same as for standard S3 foreign tables.Ensure that the Polaris Catalog service can access the SynxDB cluster.
Credentials:
Prepare the authentication credentials (
accesskeyandsecretkey) required to access the S3 service.Prepare the OAuth2 authentication credentials (
client_idandclient_secret) required to access the Polaris Catalog service.
Procedure to read Iceberg tables on S3
Create the FOREIGN DATA WRAPPER
datalake_fdw. You can skip this step if it already exists.CREATE EXTENSION IF NOT EXISTS datalake_fdw; CREATE FOREIGN DATA WRAPPER datalake_fdw HANDLER datalake_fdw_handler VALIDATOR datalake_fdw_validator OPTIONS (mpp_execute 'all segments');
Configure the connection and authentication for the S3 service. Create a
SERVERobject and a correspondingUSER MAPPINGfor the external S3 service.-- 1. Create a server object for the S3 service. CREATE SERVER s3_data_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS (host 'your_s3_host:port', protocol 's3', ishttps 'false'); -- 2. Create a user mapping for the S3 server to provide authentication credentials. CREATE USER MAPPING FOR gpadmin SERVER s3_data_server OPTIONS (user 'gpadmin', accesskey 'YOUR_S3_ACCESS_KEY', secretkey 'YOUR_S3_SECRET_KEY');
Configure the connection and authentication for the Polaris Catalog service. Similarly, create a dedicated
SERVERobject andUSER MAPPINGfor the internal Polaris Catalog service.-- 1. Create a server object for the Polaris Catalog service. CREATE SERVER polaris_catalog_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS (polaris_server_url 'http://polaris-service-url:8181/api/catalog', polaris_server_realm 'POLARIS'); -- 2. Create a user mapping for the Polaris server to provide OAuth2 authentication credentials. CREATE USER MAPPING FOR gpadmin SERVER polaris_catalog_server OPTIONS (client_id 'your_client_id', client_secret 'your_client_secret', scope 'PRINCIPAL_ROLE:ALL');
Create a foreign table to map to the Iceberg table managed by the Polaris Catalog.
CREATE FOREIGN TABLE my_iceberg_table ( name text, score decimal(16, 2) ) SERVER s3_data_server -- Note: The SERVER here points to the S3 data server. OPTIONS ( format 'iceberg', catalog_type 'polaris', table_identifier 'polaris.testdb.mytable', server_name 'polaris_catalog_server', -- [Key] Specifies which server to get metadata from. filePath '/your-bucket/warehouse/polaris' -- [Key] Still need to specify the data root path on S3. );
OPTIONSparameter details:format: Specifies the file format. For this scenario, it is fixed to'iceberg'.catalog_type: Specifies the catalog type, fixed to'polaris'.table_identifier: The full identifier of the table in the Polaris Catalog, in the format<catalog_name>.<db_name>.<table_name>.server_name: [Key] Specifies the name of the Polaris Catalog server used for fetching metadata, which ispolaris_catalog_servercreated in Step 3.filePath: [Key] The root or warehouse path on S3 where the Iceberg data files are stored. This parameter is still required.
Complete example
-- Step 1: Create FDW.
CREATE EXTENSION IF NOT EXISTS datalake_fdw;
CREATE FOREIGN DATA WRAPPER datalake_fdw HANDLER datalake_fdw_handler VALIDATOR datalake_fdw_validator OPTIONS ( mpp_execute 'all segments' );
-- Step 2: Configure S3 access.
CREATE SERVER s3_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS (host '192.168.50.102:8002', protocol 's3', ishttps 'false');
CREATE USER MAPPING FOR gpadmin SERVER s3_server OPTIONS (user 'gpadmin', accesskey 'xxx', secretkey 'xxx');
-- Step 3: Configure Polaris Catalog access.
CREATE SERVER polaris_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS (polaris_server_url 'http://192.168.50.102:8181/api/catalog', polaris_server_realm 'POLARIS');
CREATE USER MAPPING FOR gpadmin SERVER polaris_server OPTIONS (client_id 'root', client_secret 'secret', scope 'PRINCIPAL_ROLE:ALL');
-- Step 4: Create foreign table and query.
CREATE FOREIGN TABLE iceberg_rest_table (
name text,
score decimal(16,2)
)
SERVER s3_server
OPTIONS (
filePath '/your-actual-bucket/warehouse/polaris',
catalog_type 'polaris',
table_identifier 'polaris.testdb1.table27',
server_name 'polaris_server',
format 'iceberg'
);
-- Query data
SELECT * FROM iceberg_rest_table LIMIT 10;
Load data from HDFS
You can load data from HDFS into SynxDB. The following sections explain how to load data from an HDFS cluster without authentication and how to load data from an HDFS cluster with Kerberos authentication. SynxDB also supports loading data from an HDFS HA (High Availability) cluster, which is also explained below.
Load HDFS data without authentication
Load data from HDFS in the simple mode, which is the basic HDFS mode without using complex security authentication. For details, see the Hadoop documentation: Hadoop in Secure Mode. The steps are as follows:
Create an external table wrapper
FOREIGN DATA WRAPPER. Note that there are no options in the SQL statement below, and you need to execute the statement exactly as provided.CREATE FOREIGN DATA WRAPPER datalake_fdw HANDLER datalake_fdw_handler VALIDATOR datalake_fdw_validator OPTIONS ( mpp_execute 'all segments' );
Create an external server. In this step, you can create an external server for a single-node HDFS or for HA (High Availability) HDFS.
Create an external server
foreign_serverfor a single-node HDFS:CREATE SERVER foreign_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS ( protocol 'hdfs', hdfs_namenodes '192.168.178.95', hdfs_port '9000', hdfs_auth_method 'simple', hadoop_rpc_protection 'authentication');
The options in the above SQL statement are explained as follows:
Option name
Description
Details
protocolSpecifies the Hadoop platform.
Required: Must be set
Setting: Fixed as
hdfs, which means Hadoop platform, cannot be changed.Default value:
hdfs
hdfs_namenodesSpecifies the namenode host for accessing HDFS.
Required: Must be set
Example: For example,
hdfs_namenodes '192.168.178.95:9000'
hdfs_auth_methodSpecifies the authentication mode for accessing HDFS.
Required: Must be set
Options:
simple: Uses Simple authentication to access HDFS.kerberos: Uses Kerberos authentication to access HDFS.
Note: To access in Simple mode, set the value to
simple, for example,hdfs_auth_method 'simple'.
hadoop_rpc_protectionConfigures the authentication mechanism for setting up a SASL connection.
Required: Must be set
Options: Three values are available:
authentication,integrity, andprivacy.Note: This option must match the
hadoop.rpc.protectionsetting in the HDFS configuration filecore-site.xml. For more details, see the Hadoop documentation Explanation of core-site.xml.
Create an external server for a multi-node HA cluster. The HA cluster supports node failover. For more information about HDFS high availability, see the Hadoop documentation HDFS High Availability Using the Quorum Journal Manager.
To load an HDFS HA cluster, you can create an external server using the following template:
CREATE SERVER foreign_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS ( protocol 'hdfs', hdfs_namenodes 'mycluster', hdfs_auth_method 'simple', hadoop_rpc_protection 'authentication', is_ha_supported 'true', dfs_nameservices 'mycluster', dfs_ha_namenodes 'nn1,nn2,nn3', dfs_namenode_rpc_address '192.168.178.95:9000,192.168.178.160:9000,192.168.178.186:9000', dfs_client_failover_proxy_provider 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider');
In the above SQL statement,
protocol,hdfs_namenodes,hdfs_auth_method, andhadoop_rpc_protectionare the same as in the single-node example. The HA-specific options are explained as follows:Option name
Description
Details
is_ha_supportedSpecifies whether to access the HDFS HA service (high availability).
Required: Must be set
Setting: Set to
true.Default value: /
dfs_nameservicesWhen
is_ha_supportedistrue, specify the name of the HDFS HA service to access.Required: If using an HDFS HA cluster, must be set.
Matches the
dfs.ha.namenodes.myclusteritem in the HDFS config filehdfs-site.xml.Note: For example, if
dfs.ha.namenodes.myclusteriscluster, set this option asdfs_nameservices 'mycluster'.
dfs_ha_namenodesWhen
is_ha_supportedistrue, specify the accessible nodes for HDFS HA.Required: If using an HDFS HA cluster, must be set.
Setting: Matches the value of the
dfs.ha.namenodes.myclusteritem in the HDFS config filehdfs-site.xml.Note: For example,
dfs_ha_namenodes 'nn1,nn2,nn3'.
dfs_namenode_rpc_addressWhen
is_ha_supportedistrue, specifies the IP addresses of the high availability nodes in HDFS HA.Required: If using an HDFS HA cluster, must be set.
Setting: Refer to the
dfs.ha_namenodesconfiguration in the HDFShdfs-site.xmlfile. The node address matches thenamenodeaddress in the configuration.Note: For example, if
dfs.ha.namenodes.myclusterhas three namenodes namednn1,nn2,nn3, find their addresses from the HDFS configuration file and enter them into this field.dfs_namenode_rpc_address '192.168.178.95:9000,192.168.178.160:9000,192.168.178.186:9000'
dfs_client_failover_proxySpecifies whether HDFS HA has failover enabled.
Required: If using an HDFS HA cluster, must be set.
Setting: Set to the default value:
dfs_client_failover_proxy_provider 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'.Default value: /
Note
When Kerberos authentication is enabled, you need to renew the Kerberos ticket periodically. If the ticket is not renewed, database performance will gradually degrade until the database eventually crashes and becomes unavailable. Add a renewal command to an operating-system scheduler (such as
crontab) or to your orchestration platform, for example:kinit -kt /yourpath/greenplum.keytab bdms_greerplum/dev@BDMS.COM
In the command above,
/yourpath/greenplum.keytabis the path to your keytab file. The principalbdms_greerplum/dev@BDMS.COMis an example, wherebdms_greerplumis the user,devis the instance, andBDMS.COMis theREALM. Replace these with your actual values.It is recommended to add a renewal script to monitoring to ensure that the ticket remains valid. For more examples of ticket-renewal configuration, see the Greenplum documentation: <https://techdocs.broadcom.com/us/en/vmware-tanzu/data-solutions/tanzu-greenplum-platform-extension-framework/6-7/gp-pxf/pxf_kerbhdfs.html>.
Create a user mapping.
CREATE USER MAPPING FOR gpadmin SERVER foreign_server OPTIONS (user 'gpadmin');
In the above statement, the
useroption specifies the specific user forforeign_serverand must be set.Create the foreign table
example. After creating it, the data from object storage is already loaded into SynxDB, and you can query this table.CREATE FOREIGN TABLE example( a text, b text ) SERVER foreign_server OPTIONS (filePath '/test/parquet/', compression 'none', enableCache 'false', format 'parquet');
The options in the above SQL statement are explained as follows:
Option name
Description
Details
filePathSets the specific path of the target foreign table.
Required: Must be set
Setting: The path format should be
/hdfs_prefix.Example: If the
hdfs_prefixto be accessed is namedtest-example, the access path is/test-example/. If there are multiple files in the path (for example,0000_0,0001_1,0002_2), you can set thefilePathof0000_0file asfilePath '/test-example/0000_0'. To access all the files in/test-example/, you can set thefilePathasfilePath '/test_example/'.Note:
filePathshould follow the/hdfs_prefix/format. Incorrect formats might lead to errors, such as:filePath 'test/test/orc_file_folder/'filePath '/test/test/orc_file_folder/0000_0'
compressionSets the compression format for writing. Currently supports gzip, zstd, lz4 formats.
Required: Optional
Setting:
none: Supports CSV, ORC, TEXT, PARQUET formats.gzip: Supports CSV, TEXT, PARQUET formats.zstd: Supports PARQUET format.lz4: Supports PARQUET format.
Default value:
none, which means no compression. Not setting this value also means no compression.
enableCacheSpecifies whether to use the Gopher cache.
Required: Optional
Setting:
true: Enables Gopher cache.false: Disables Gopher cache.
Default:
falseNote: Deleting a foreign table does not automatically clear the cache. To clear the cache for this table, you need to manually run a specific SQL function, for example:
select gp_toolkit._gopher_cache_free_relation_name(text);
formatThe file format currently supported by FDW.
Required: Must be set
Setting:
csv: Readable, writabletext: Readable, writableorc: Readable, writableparquet: Readable, writable
Load HDFS data with Kerberos authentication
If the target HDFS cluster uses Kerberos for authentication, you can follow these steps to load data from HDFS to SynxDB.
Create a foreign data wrapper
FOREIGN DATA WRAPPER. Note that there are no options in the following SQL statement, and you need to execute it exactly as provided.CREATE FOREIGN DATA WRAPPER datalake_fdw HANDLER datalake_fdw_handler VALIDATOR datalake_fdw_validator OPTIONS ( mpp_execute 'all segments' );
Create an external server. In this step, you can create an external server for a single-node HDFS or for an HA (High Availability) HDFS.
Create an external server
foreign_serverfor a single-node HDFS:DROP SERVER foreign_server; CREATE SERVER foreign_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS (hdfs_namenodes '192.168.3.32', hdfs_port '9000', protocol 'hdfs', hdfs_auth_method 'kerberos', krb_principal 'gpadmin/hdw-68212a9a-0@GPADMINCLUSTER2.COM', krb_principal_keytab '/home/gpadmin/hadoop.keytab', hadoop_rpc_protection 'privacy' );
The options in the above SQL statement are explained as follows:
Option name
Description
Details
hdfs_namenodesSpecifies the namenode host for accessing HDFS.
Required: Must be set
Setting: /
Example: For example,
hdfs_namenodes '192.168.178.95:9000'
protocolSpecifies the Hadoop platform.
Required: Must be set
Setting: Fixed to
hdfs, meaning the Hadoop platform. Cannot be changed.Default:
hdfs
hdfs_auth_methodSpecifies the authentication method for accessing HDFS, which is Kerberos.
Required: Must be set
Setting:
kerberos, to access HDFS using Kerberos authentication.
krb_principalSpecifies the primary user set in the HDFS keytab.
Required: Must be set
Setting: Must match the user information in the keytab. You need to check the relevant user information and set this value accordingly.
krb_principal_keytabSpecifies the path to the HDFS keytab.
Required: Must be set
Setting: The value must match the actual path of the keytab in HDFS.
hadoop_rpc_protectionConfigures the authentication mechanism for establishing a SASL connection.
Required: Must be set
Setting: Three options are available:
authentication,integrity, andprivacy.Note: This option must match the
hadoop.rpc.protectionsetting in the HDFS configuration filecore-site.xml. For more details, see the Hadoop core-site.xml documentation.
Create an external server for a multi-node HA cluster. The HA cluster supports failover. For more information about HDFS high availability, see the Hadoop documentation HDFS High Availability Using the Quorum Journal Manager.
To load an HDFS HA cluster, you can use the following template to create an external server:
CREATE SERVER foreign_server FOREIGN DATA WRAPPER datalake_fdw OPTIONS (hdfs_namenodes 'mycluster', protocol 'hdfs', hdfs_auth_method 'kerberos', krb_principal 'gpadmin/hdw-68212a9a-0@GPADMINCLUSTER2.COM', krb_principal_keytab '/home/gpadmin/hadoop.keytab', hadoop_rpc_protection 'privacy', is_ha_supported 'true', dfs_nameservices 'mycluster', dfs_ha_namenodes 'nn1,nn2,nn3', dfs_namenode_rpc_address '192.168.178.95:9000,192.168.178.160:9000,192.168.178.186:9000', dfs_client_failover_proxy_provider 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' );
In the above SQL statement, the explanations for
hdfs_namenodes,protocol,hdfs_auth_method,krb_principal,krb_principal_keytab, andhadoop_rpc_protectionare the same as for the single-node example. The HA-specific options are explained as follows:Option name
Description
Details
is_ha_supportedSpecifies whether to access the HDFS HA service, which means high availability.
Required: Must be set
Setting: Set to
true.
dfs_nameservicesWhen
is_ha_supportedistrue, specifies the name of the HDFS HA service to access.Required: Must be set if using an HDFS HA cluster.
Setting: Should match the
dfs.ha.namenodes.myclusteritem in the HDFS configuration filehdfs-site.xml.Note: For example, if
dfs.ha.namenodes.myclusteriscluster, set this parameter todfs_nameservices 'mycluster'.
dfs_ha_namenodesWhen
is_ha_supportedistrue, specifies the accessible nodes for HDFS HA.Required: Must be set if using an HDFS HA cluster.
Setting: Should match the value of the
dfs.ha.namenodes.myclusteritem in the HDFS configuration filehdfs-site.xml.Note: For example,
dfs_ha_namenodes 'nn1,nn2,nn3'
dfs_namenode_rpc_addressWhen
is_ha_supportedistrue, specifies the IP addresses of the high availability nodes in HDFS HA.Required: Must be set if using an HDFS HA cluster.
Setting: Refer to the
dfs.ha_namenodesconfiguration in the HDFShdfs-site.xmlfile. The node address should match the namenode address in the configuration.Example: If
dfs.ha.namenodes.myclusterhas three namenodesnn1,nn2,nn3, you can find the addresses likedfs.namenode.rpc-address.mycluster.nn1in the HDFS config and fill them in.dfs_namenode_rpc_address '192.168.178.95:9000,192.168.178.160:9000,192.168.178.186:9000'
dfs_client_failover_proxy_providerSpecifies whether HDFS HA has failover enabled.
Required: Must be set if using an HDFS HA cluster.
Setting: Set to the default value
dfs_client_failover_proxy_provider 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'.
Create a user mapping.
CREATE USER MAPPING FOR gpadmin SERVER foreign_server OPTIONS (user 'gpadmin');
In the above statement, the
useroption specifies the specific user forforeign_serverand must be set.Create the foreign table
example. After creating it, the data from object storage is already loaded into SynxDB, and you can query this table.CREATE FOREIGN TABLE example( a text, b text ) SERVER foreign_server OPTIONS (filePath '/test/parquet/', compression 'none', enableCache 'false', format 'parquet');
The options in the above SQL statement are explained as follows:
Option name
Description
Details
filePathSets the specific path of the target foreign table.
Required: Must be set
Setting: The path format should be
/hdfs_prefix.Example: If the
hdfs_prefixto be accessed is namedtest-example, the access path is/test-example/. If there are multiple files in the path (for example,0000_0,0001_1,0002_2), you can set thefilePathof0000_0file asfilePath '/test-example/0000_0'. To access all the files in/test-example/, you can set thefilePathasfilePath '/test_example/'.Note:
filePathshould follow the/hdfs_prefix/format. Incorrect formats might lead to errors, such as:filePath 'test/test/orc_file_folder/'filePath '/test/test/orc_file_folder/0000_0'
compressionSets the compression format for writing, supports gzip, zstd, lz4 formats.
Required: Optional
Setting:
none: Supports CSV, ORC, TEXT, PARQUET formats.gzip: Supports CSV, TEXT, PARQUET formats.zstd: Supports PARQUET format.lz4: Supports PARQUET format.
Default:
none, which means no compression. Not setting this value also means no compression.
enableCacheSpecifies whether to use the Gopher cache.
Required: Optional
Setting:
true: Enables Gopher cache.false: Disables Gopher cache.
Default:
falseNote: Deleting a foreign table does not automatically clear the cache. To clear the cache, you need to manually run an SQL function, for example:
select gp_toolkit.__gopher_cache_free_relation_name(text);
formatThe file format currently supported by FDW.
Required: Must be set
Setting:
csv: Readable, writabletext: Readable, writableorc: Readable, writableparquet: Readable, writable