pg_search Full Text Search

pg_search is an extension for SynxDB that provides powerful full-text search capabilities for database tables using the BM25 algorithm. It is built on Tantivy, a search engine library implemented in Rust, and integrates with SynxDB via the pgrx framework.

The main application scenarios and benefits of pg_search include:

High-performance full-text search: Provides efficient search capabilities for text data in the database, surpassing traditional LIKE queries.
Seamless integration with the database: Runs as an extension and integrates seamlessly with existing databases and table structures without requiring data migration.
Rich query functionality: Supports complex boolean logic queries, phrase queries, and aggregate function pushdown, which can accelerate data analysis and report generation.

Enable the pg_search extension

Follow these steps to enable the pg_search extension:

Modify the database configuration: Add pg_search to shared_preload_libraries. This is a mandatory step; otherwise, creating an index will cause the database connection to crash or hang.
```
gpconfig -c shared_preload_libraries -v pg_search
```
After modifying the configuration, restart the database cluster.
Enable the pg_search extension: Connect to the database and execute the following SQL command to enable the pg_search extension:
```
CREATE EXTENSION pg_search;
```

How to use

The following example demonstrates how to use pg_search with a product information table.

Create a sample table and insert some data:

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    description TEXT,
    rating INTEGER,
    category TEXT,
    price NUMERIC,
    in_stock BOOLEAN
);

INSERT INTO products (description, rating, category, price, in_stock) VALUES
    ('Laptop with fast processor', 5, 'Electronics', 999.99, true),
    ('Gaming laptop with RGB', 5, 'Electronics', 1299.99, true),
    ('Toy laptop for kids', 3, 'Toys', 499.99, false),
    ('Wireless keyboard and mouse', 4, 'Electronics', 79.99, true),
    ('Mechanical keyboard RGB', 5, 'Electronics', 149.99, true),
    ('Running shoes for athletes', 5, 'Sports', 89.99, true),
    ('Winter jacket warm', 4, 'Clothing', 129.99, true),
    ('Summer jacket light', 3, 'Clothing', 59.99, true);

Create a full-text search index: Use the USING bm25 clause to create a pg_search index for the table.
```
CREATE INDEX products_idx ON products 
USING bm25 (id, description, rating, category, price)
WITH (
    key_field='id',
    text_fields='{"description": {}, "category": {"fast": true}}',
    numeric_fields='{"rating": {"fast": true}, "price": {"fast": true}}'
);
```
Parameters in the WITH clause:
- key_field: Specifies the primary key field of the table.
- text_fields: Defines the text fields to be indexed for full-text search.
  - "fast": true: Marks the field as a “fast field,” which allows direct access to field values in scenarios like aggregate functions, resulting in higher performance.
- numeric_fields: Defines the numeric fields to be indexed. These can also be configured as “fast fields.”
Execute queries. Use the @@@ operator to perform full-text searches. The syntax is similar to PostgreSQL’s LIKE operator but offers better performance.
- Basic query. Query for products where the description field contains “laptop”:
```
SELECT * FROM products WHERE description @@@ 'laptop';
```
- Use boolean logic. Query for products where description contains “laptop” or “keyboard”, and the rating is greater than or equal to 4:
```
SELECT * FROM products 
WHERE ((description @@@ 'laptop') OR (description @@@ 'keyboard')) 
AND (rating >= 4);
```

Use aggregate functions

pg_search supports pushing down aggregate functions (COUNT, SUM, AVG, MIN, MAX) to the index level for execution, which significantly improves query performance. This feature is only effective for columns configured as “fast fields.”

Example: Count the number of matching products and calculate their total price:

-- Query the count and sum of price for products where description contains 'laptop'
EXPLAIN (FORMAT TEXT, COSTS OFF, TIMING OFF, VERBOSE)
SELECT COUNT(*), SUM(price) FROM products WHERE description @@@ 'laptop';

The EXPLAIN output shows that the query plan uses a Custom Scan (ParadeDB Scan), indicating that the aggregation operation has been successfully pushed down.

Considerations and limitations

shared_preload_libraries configuration: For SynxDB, you must add pg_search to shared_preload_libraries and restart the database. This is a prerequisite for the extension to work correctly.
Cloud database services: On managed cloud database services (such as Amazon RDS), you may not be able to install pg_search unless the provider explicitly supports the extension.
Aggregate function limitations:
- Aggregate pushdown only works for fields marked as "fast": true in the index.
- DISTINCT aggregates are not supported. Queries containing DISTINCT will fall back to PostgreSQL’s native execution plan and will not benefit from the performance advantages of index acceleration.
Background process: pg_search relies on a background worker process to handle index writes. If this process does not start correctly, it may lead to database connection issues.
Query limitations: pg_search currently only supports single-table queries and does not support multi-table joins.
Partitioned tables: pg_search does not support partitioned tables.