Graph Queries Across Billions of Rows of Scattered Data with Postgres and Apache AGE™

Postgres engines are now equipped to tap into an unprecedented volume of data. With the introduction of extensions like pg_lake, users can seamlessly connect Postgres to vast repositories of files stored in object storage formats such as CSV, JSON, Apache Parquet™, and Apache Iceberg™. However, it’s essential to distinguish between merely having access to data in object storage and the ability to aggregate data in object storage. This discussion delves into the Postgres extension, Apache AGE™, which enhances the usability of extensive data sets through graph relationships.

Why graph matters for data lakes

Consider a healthcare network comprising providers, patients, facilities, and referral chains. The analytical inquiries that arise are quite straightforward:

What is the total billed amount per region?
Which patients incur the highest spending?
What is the average claim amount by specialty?

SQL on Iceberg adeptly handles these queries. However, when faced with more complex questions such as, “Which in-network providers are referring patients to out-of-network providers through chains of intermediaries, and what is the financial impact?” the challenge becomes twofold: it requires both a graph traversal (to identify referral chains) and an analytical aggregation (to sum the costs). Neither a pure graph database nor a standalone analytical engine can tackle this query independently; a collaborative approach is essential.

This necessity underscores the importance of integrating graphs within data lakes.

Why Apache AGE

Apache AGE serves as a PostgreSQL extension that introduces openCypher graph query support directly within Postgres. While various graph databases exist, such as Neo4j, Amazon Neptune, and TigerGraph, AGE stands out for modern data platforms due to its operation within PostgreSQL.

Recent trends indicate that customers are increasingly adopting Apache AGE for several compelling reasons:

No data movement: Both your Iceberg tables and graph reside within the same database, eliminating the need for extracting, transforming, and loading (ETL) data into a separate graph database. This integration allows you to construct your graph from your lake tables.
SQL + Cypher together: AGE graph queries yield standard PostgreSQL result sets, enabling users to encapsulate a Cypher query within a Common Table Expression (CTE) and join it with Iceberg tables in a single statement. This means graph output can be treated as just another subquery.
One connection, one transaction: This setup simplifies operational processes, as aspects like application connection, security, and backups are streamlined into a single workflow.

Graph example: A healthcare network

To illustrate the practical benefits of this integration, let’s consider a healthcare platform where data is stored as Iceberg tables on Amazon S3, encompassing claims, providers, patients, facilities, referrals, insurance plans, and regions. This scenario exemplifies a typical data set where files are deposited in object storage from multiple upstream systems.

To get started, one would load the necessary Postgres extensions. It’s worth noting that pg_lake operates alongside a sidecar, the pgduck_server, to facilitate Iceberg read/write operations.

Tech Optimizer