pg_lake comes to Snowflake Postgres: What it means for open standards

The pg_lake extension, which made its debut in the open-source community last November, has now been seamlessly integrated into Snowflake Postgres, the fully managed PostgreSQL service offered by Snowflake. This extension comprises around 15 tools that empower users to utilize PostgreSQL as a data lakehouse, facilitating rapid queries and transactional workloads through support for Apache Iceberg tables. Furthermore, organizations can leverage pg_lake in conjunction with data stored in object stores.

With Snowflake Postgres, users can unify analytical and transactional workloads, enabling direct access to PostgreSQL within the Snowflake environment. This integration allows for reading and writing to open table formats using SQL, while pg_lake simplifies the complexities associated with managing diverse data types and formats.

In addition to the pg_lake integration, Snowflake has expanded its Snowflake Horizon Catalog, enhancing its capabilities in data discovery, metadata context, and governance. This expansion now includes the management of data assets accessed through external query systems, such as Apache Spark. Moreover, Snowflake’s open data sharing features, which span various clouds and availability zones, have been extended to encompass data in Delta Lake and Iceberg tables.

A noteworthy new feature is the integration with Microsoft One Lake, which facilitates secure, bi-directional read capabilities for data residing in both Snowflake and Microsoft Fabric. The interoperability afforded by these features, along with pg_lake’s inclusion in Snowflake Postgres, enhances compatibility across different data types, formats, clouds, and regions, as well as OLAP, OLT, and advanced machine learning applications.

Craig Kerstiens, director of software engineering for Postgres at Snowflake, remarked, “You don’t even have to know that it’s Iceberg under the covers, if you’re a customer. If you just want Postgres, it’s just Postgres. But, it has these superpower capabilities under the covers, with this unified Iceberg catalog, this kind of vectorized engine embedded in, and this easy ability to move data from one to the other.”

pg_lake’s extensions

The practical nature of pg_lake significantly influences various aspects of the application development and deployment lifecycle. By integrating it with Snowflake Postgres, the lifecycle transitions smoothly between analytical and transactional workloads that can involve open table formats. The extensions within pg_lake provide dedicated functionalities for:

  • Version management: This is crucial for updates and maintenance. Users of Snowflake Postgres often deal with different operating systems, PostgreSQL versions, and libraries. Managing updates without causing unnecessary downtime can be challenging. Kerstiens noted, “We’ve got an extension inside pg_lake that’s responsible for managing all the other extensions. So, when you upgrade library B it won’t conflict with library C.”
  • Data type and format conversions: Utilizing SQL as a common language for interacting with various data types in Postgres, including geospatial data and vector embeddings, alongside Iceberg tables, can be complex. However, pg_lake simplifies this process for joins, reads, and writes, allowing users to focus on writing SQL queries without needing to navigate the underlying complexities.
  • Caching: Effective caching strategies are vital for the performance of low-latency applications. Kerstiens explained that pg_lake includes mechanisms that manage caching automatically. For instance, when writing Iceberg files, the system can cache newer files as they arrive, optimizing joins and queries with pushdowns. These features contribute to Kerstiens’ characterization of Snowflake Postgres as “production-grade, enterprise ready.”

Open data governance

Snowflake’s capabilities for Iceberg and open standards have been enhanced with the new functionalities in Snowflake Horizon Catalog, which can now manage Iceberg tables similarly to those in Snowflake’s native storage format. This allows external query engines, such as Trino, to read and write to tables managed by the catalog. Prasanna Krishnan, head of apps, collaboration, and Horizon at Snowflake, emphasized the advantages of this approach: “They get a single source of truth of their data. They don’t have to create copies of it.”

Additionally, users can extend the access controls of Snowflake Horizon Catalog to these query engines. This means organizations can query data directly through Snowflake, leveraging the catalog’s access controls to filter results. For example, a masking policy can be applied to a column so that privileged users can view it while average analysts see a masked version. Alternatively, row access policies can restrict which rows are accessible to specific roles.

Open data sharing

Snowflake’s data sharing capabilities, which do not require data duplication, are another valuable feature provided by the vendor. Previously limited to objects and tables in Snowflake’s proprietary format, this functionality now extends to Delta Lake and Iceberg tables. This is particularly beneficial for applications where different parties utilize various clouds and are located in different geographic regions.

Through Snowflake’s open data sharing, users can access the same data without needing to move or replicate it. Krishnan noted, “Consumers aren’t spending egress costs moving data to a different region every time you query it. Under the hood, we use secure, cross-cloud auto-movement to ensure the data is available in the region where users are querying it.”

Extensible, open standards

Snowflake’s expansion of its open data sharing and Horizon Catalog features to include open table formats and external query engines underscores its commitment to open standards. The interoperability provided by these advancements, along with the support of pg_lake, simplifies the process for developers to work with the tools and formats that are most relevant to their applications, rather than those dictated by vendors.

This is particularly evident in the data lakehouse capabilities of pg_lake, which enhance the analytical and transactional support offered by Snowflake Postgres. Kerstiens remarked, “OLTP and OLAP have always lived in different worlds. One is row-based; one is columnar. One is point lookups; the other is large scans. They live in different worlds, and now we’re starting to bridge that gap and unify them.”

Tech Optimizer
pg_lake comes to Snowflake Postgres: What it means for open standards