Load Data Using the file:// Protocol

The file:// protocol is a SynxDB protocol that allows you to load data from a local segment host server file or a coordinator host file into SynxDB.

The file:// protocol is used in a URI that specifies the location of an operating system file. External tables that you create that specify the file:// protocol are read-only tables.

The URI includes the host name, port, and path to the file. By default, each file must locate on a segment host in a location accessible by the SynxDB superuser (gpadmin). The host name used in the URI must match a segment host name registered in the gp_segment_configuration system catalog table. In addition, you can also load files from the coordinator host using the ON COORDINATOR clause.

The LOCATION clause can have multiple URIs, as shown in Usage examples.

The number of URIs you specify in the LOCATION clause is the number of segment instances that will work in parallel to access the external table. For each URI, SynxDB assigns a primary segment on the specified host to the file. For maximum parallelism when loading data, divide the data into as many equally sized files as you have primary segments. This ensures that all segments participate in the load. The number of external files per segment host cannot exceed the number of primary segment instances on that host. For example, if your array has 4 primary segment instances per segment host, you can place 4 external files on each segment host. Tables based on the file:// protocol can only be readable tables.

The system view pg_max_external_files shows how many external table files are permitted per external table. This view lists the available file slots per segment host when using the file:// protocol. The view is only applicable for the file:// protocol. For example:

SELECT * FROM pg_max_external_files;

Usage examples

Load multiple files in CSV format with header rows:

Creates a readable external table, ext_expenses, using the file protocol. The files are CSV format and have a header row.

=# CREATE EXTERNAL TABLE ext_expenses ( name text,
   date date,  amount float4, category text, desc1 text )
   LOCATION ('file://filehost/data/international/*',
             'file://filehost/data/regional/*',
             'file://filehost/data/supplement/*.csv')
   FORMAT 'CSV' (HEADER);
=# CREATE EXTERNAL TABLE ext_expenses (
   name text, date date, amount float4, category text, desc1 text )
   LOCATION ('file://host1:5432/data/expense/*.csv',
               'file://host2:5432/data/expense/*.csv',
               'file://host3:5432/data/expense/*.csv')
   FORMAT 'CSV' (HEADER);

Load files from the coordinator host

You can use the ON COORDINATOR clause to load files from the coordinator host. This is useful for files that only exist on the coordinator node. When using ON COORDINATOR, all data is loaded through the coordinator node, which is a single-process operation.

Create a readable external table ext_nation_on_coordinator to load data from the coordinator node:

CREATE EXTERNAL TABLE ext_nation_on_coordinator (
    N_NATIONKEY  INTEGER,
    N_NAME       CHAR(25),
    N_REGIONKEY  INTEGER,
    N_COMMENT    VARCHAR(152)
)
LOCATION ('file://coordinator-host/data/nation.tbl')
ON COORDINATOR
FORMAT 'text' (delimiter '|');

Query the table and check the gp_segment_id to see that the data is loaded by the coordinator (with an ID of -1):

SELECT gp_segment_id, * FROM ext_nation_on_coordinator ORDER BY N_NATIONKEY DESC LIMIT 5;

The query result shows that gp_segment_id is -1, indicating that the data comes from the coordinator node:

gp_segment_id | n_nationkey |          n_name           | n_regionkey |                                                   n_comment
---------------+-------------+---------------------------+-------------+----------------------------------------------------------------------------------------------------------------
            -1 |          24 | UNITED STATES             |           1 | y final packages. slow foxes cajole quickly. quickly silent platelets breach ironic accounts. unusual pinto be
            -1 |          23 | UNITED KINGDOM            |           3 | eans boost carefully special requests. accounts are. carefull
            -1 |          22 | RUSSIA                    |           3 |  requests against the platelets use never according to the quickly regular pint
            -1 |          21 | VIETNAM                   |           2 | hely enticingly express accounts. even, final
            -1 |          20 | SAUDI ARABIA              |           4 | ts. silent requests haggle. closely express packages sleep across the blithely
(5 rows)