Open Sourcing OpenHouse: A Control Plane for Managing Tables in a Data Lakehouse

Sumedh Sakdeo

March 4, 2024

Last year, we unveiled OpenHouse, a control plane that gives end-users an interface with managed tables in our open source data lakehouse deployments.

Today, we're excited to open source OpenHouse, now available on GitHub at https://github.com/linkedin/openhouse for everyone to use and contribute to. We hope this will empower organizations of all sizes to benefit from and build upon OpenHouse's data lakehouse management framework.

At LinkedIn thus far, we've implemented more than 3,500 managed OpenHouse tables in production, serving more than 550 daily active users and catering to a broad spectrum of use cases. Notably, OpenHouse has streamlined the time-to-market for LinkedIn’s dbt implementation on managed tables, slashing it by over 6 months. Concurrently, onboarding LinkedIn’s go-to-market systems to OpenHouse, we’ve achieved a 50% reduction in the end-user toil associated with data sharing. Furthermore, we have onboarded more than 1,000 datasets to OpenHouse from AI use cases, including Large Language Models (LLMs), bolstering governance during model training. Overall, since rolling out OpenHouse, we’ve seen drastic reduction in operational toil for data infra teams, improved developer experience for data infra customers, and enhanced governance for LinkedIn’s data.

In this blog post, we'll begin by delving into the inspiration behind OpenHouse. From there, we'll dive deeper into the foundational building blocks and highlight the key features supported in the open source code release. We'll then discuss the importance of the control plane's pluggability, which facilitates its operation in diverse environments. Lastly, we'll conclude by shifting our focus to the future outlook for OpenHouse.

Inspiration for OpenHouse

In the world of big data management, there's always a struggle between control and flexibility. Cloud data warehouse solutions maintain tight control over their systems, which allows data infrastructure teams to ensure proper governance, data integrity, security, and performance. However, these systems often suffer from a lack of flexibility and scale. That's where open source data lake(house) systems come into play.

At LinkedIn, we deploy an open source data lakehouse to take advantage of the flexibility and scalability benefits. However, we have faced challenges in providing a managed experience for our end-users. Not having a managed experience often means our end-users have to deal with low-level infrastructure concerns like managing the optimal layout of files on storage, expiring data based on TTL to avoid running out of quota, replicating data across geographies, and managing permissions at a file level. Spending time on all of these activities dilutes our end-users’ core product focus. Moreover, our data infra teams are left with little control on the system we operate, making it harder for us to regulate proper governance and optimization.

Control that liberates is a paradoxical idea, where a well-implemented control mechanism can actually empower data infra teams to regain control so that they can govern better, and liberate end-users from low level infra concerns by providing them a fully managed offering.

OpenHouse is the materialization of this idea!

A graph showing the the control vs. flexibility struggle, and OpenHouse's position on the spectrum — Figure 1: The control vs. flexibility struggle, and OpenHouse's position on the spectrum

Inside OpenHouse

The core of OpenHouse's control plane is a Catalog, a RESTful table service designed to offer secure and scalable table provisioning and declarative metadata management. Additionally, the control plane encompasses Data Services, which can be customized to seamlessly orchestrate table maintenance jobs. Figure 2 illustrates the integration of OpenHouse within wider open source data lakehouse deployments.

Flow chart of the the OpenHouse control plane — Figure 2: The OpenHouse control plane

Key features of OpenHouse

In this section, we delve into the key features bundled with the open source code.

Fundamental Catalog Operations

The catalog service facilitates the creation, retrieval, updating, and deletion of an OpenHouse table. It’s seamlessly integrated with Apache Spark, so that end-users can utilize standard engine syntax, SQL queries, and the DataFrame API to execute these operations. Standard supported syntax includes, but is not limited to: SHOW DATABASE, SHOW TABLES, CREATE TABLE, ALTER TABLE, SELECT FROM, INSERT INTO, and DROP TABLE.

Retention Management

The catalog service allows users to establish retention policies on time-partitioned OpenHouse tables. Through these configured policies, data services automatically identify and delete partitions older than the specified threshold. End-users can employ extended SQL syntax tailored for OpenHouse, as shown below:

          [1] ALTER TABLE openhouse.db.table SET POLICY (RETENTION=30d);

[2] ALTER TABLE openhouse.db.table SET POLICY (RETENTION=30d ON COLUMN ts WHERE pattern='yyyy-MM-dd');

For partitions with strongly typed timestamp columns, users can employ syntax [1]. Alternatively, when timestamps are captured in a string-typed column, users must specify the timestamp format as shown in syntax [2].

Sharing

The catalog service offers users the ability to share an OpenHouse table. Below are sample queries demonstrating how to execute table sharing:

          ALTER TABLE openhouse.db.table SET POLICY (SHARING=true);

GRANT {SQL_PRIVILEGE} ON {DATABASE|TABLE} openhouse.db.{table} TO {USER};

REVOKE {SQL_PRIVILEGE} ON {DATABASE|TABLE} openhouse.db.{table} FROM {USER};

Below is a table detailing the available “SQL privileges to OpenHouse role” mapping, along with corresponding data and metadata privileges and resource granularity in the catalog service:

SQL_PRIVILEGE	OpenHouse Role	Privileges	Resource Granularity
ALTER	TABLE_ADMIN	Metadata (read/write/share) Data (read/write/share)	Table
SELECT	TABLE_VIEWER	Metadata (read) Data (read)	Table
MANAGE GRANTS	ACL_EDITOR	Metadata (share)	Table and Database
CREATE TABLE	TABLE_CREATOR	Metadata (table creation)	Database

Note that the implementation of these privileges may vary depending on the environment. Thus, the bundled code includes the API specification for SQL syntax and catalog APIs. For a complete implementation of the specification, it is expected that we segregate metadata ACLs from data ACLs. While metadata ACLs can be persisted in a custom database or a framework like Open Policy Agent, data ACLs should leverage access control provided by underlying storage.

Governance

Column Tagging

The catalog service enables end-users to assign tags to their columns. Users can employ Spark SQL customized for OpenHouse to execute this task. Subsequently, OpenHouse transmits this metadata to downstream services for compliance implementation.

ALTER TABLE openhouse.db.tb MODIFY COLUMN col1 SET TAG = (PII, HC);

Observability

OpenHouse incorporates instrumentation to audit events, recording essential user activities pertaining to tables. This recorded information serves multiple purposes, including auditing, debugging, and metadata propagation from the operational catalog (in this case, OpenHouse) to discovery catalog platforms (e.g., DataHub). The following key audit events are prepared for emission:

ServiceAuditEvent: This event audits HTTP requests and responses at the service layer.
TableAuditEvent: This event audits table operations such as create, read, and insert.
ScanReport: Available for Iceberg tables, this audits column-level access for query predicates and projections
CommitReport: Also for Iceberg tables, this audits transactional commits happening on table.

Replication

We extended the Apache Gobblin framework by contributing cross-geography replication functionality tailored for Iceberg tables. IcebergDistcp, a component within this framework, ensures high availability for Iceberg tables, allowing users to execute critical workflows from any geographic location. OpenHouse classifies tables as either primary or replica table types, allowing replica tables to be read-only for end-users. Update and write permissions are exclusively granted to the distcp job and the OpenHouse system user. Leveraging IcebergDistcp, replication occurs at snapshot granularity on a scheduled basis, ensuring snapshot consistency across geographies, as depicted in the accompanying Figure 3.

Iceberg Maintenance

OpenHouse offers support for Apache Iceberg as a table format for organizing data on distributed storage. Since Iceberg supports versioning, it necessitates ongoing maintenance to uphold compliance and ensure optimal performance by regularly purging older versions and promptly deleting orphan files. Expecting end-users to manage these maintenance tasks is unrealistic. Therefore, the OpenHouse scheduler conducts regular scans of the namespace and executes the following maintenance operations:

Snapshot Expiration with Time-to-Live (TTL): This process periodically expires after system-defined TTL.
Orphan File Handling: This process relocates orphan files from their current location to the designated .trash directory.
Orphan Directory Cleanup: Addressing unregistered directories resulting from failed transactions of Create Table As Select, this task identifies orphan directories, preparing their contents for deletion.
Staged File Cleanup: Files within the .trash directory older than the TTL are systematically removed.

Pluggability

OpenHouse was initially conceived with the idea to eventually open source it. This philosophy inspired us to give importance to pluggability in our design. Interfaces for the below categories are provided, allowing for custom implementations to accommodate diverse environments.

Storage: OpenHouse supports a Hadoop Filesystem interface, compatible with HDFS and blob stores that support it. Storage interfaces can be augmented to plug in with native blob store APIs.
Authentication: OpenHouse supports token-based authentication. Given that token validation varies depending on the environment, custom implementations can be built according to organizational needs.
Authorization: OpenHouse Table Sharing APIs can be extended to suit organizational requirements, covering both metadata and data authorization. While we expect implementations to delegate data ACLs to the underlying storage (e.g., POSIX permissions for HDFS), for metadata role-based access control, we recommend the use of OPA.
Database: OpenHouse utilizes a MySQL database to store metadata pointers for Iceberg table metadata on storage. The choice of database is pluggable; OpenHouse uses the Spring Data JPA framework to offer flexibility for integration with various database systems.
Job Submission: OpenHouse code ships with the Apache Livy API for submission of Spark jobs. For custom managed Spark services, jobs services can be extended to trigger Spark applications.

Try it out

OpenHouse ships with a local Docker Compose environment, which you can use to bring up an entire data lakehouse stack built with OpenHouse in minutes. Follow the instructions in the setup guide.

If you want to deploy OpenHouse services to a Kubernetes cluster, instructions are available in the deploy guide.

Looking Ahead

Now that we've reached the open sourcing milestone, we invite you to explore OpenHouse and provide us with your valuable feedback. We're keen on collaborating with users to understand how OpenHouse performs within different environments, whether it's integrated into cloud infrastructures or adapted to preferred table formats.

In addition to broad exploration, our focus also includes tackling complex technical hurdles as we embark on our migration journey from Hive to OpenHouse. We aim to delve deeply into operationalizing OpenHouse and Iceberg at LinkedIn's scale.

Learn more at openhousedb.org

Acknowledgements

A heartfelt appreciation goes out to the development team for their steadfast execution and impactful contributions to LinkedIn over the past two years, while also remaining dedicated to the vision of open sourcing: Lei Sun, Sushant Raikar, Stanislav Pak, Abhishek Nath, Malini Venkatachari, Rohit Kumar, Levi Jiang, Ann Yang, and Sumedh Sakdeo.

I would also like to thank our management, who are willing to and supportive of sharing this work with the community: Manisha Kamal, Swathi Koundinya, Sumitha Poornachandran, Renu Tewari, Kartik Paramasivam, and Raghu Hiremagalur.

Many thanks to the thought leadership of Eric Baldeschwieler, Owen O’ Malley, Sriram Rao, Vasanth Rajamani, and Kapil Surlaker, who helped shape the product value proposition. We’re also grateful to our OSS code reviewers Jiangjie Qin and Erik Krogen.

Finally, OpenHouse is a product of many passionate discussions with technical leaders across LinkedIn: Walaa Eldin Moustafa, Bhupendra Jain, Ratandeep Ratti, Kip Kohn, Yash Ganti, Shardul Mahadik, Slim Bouguerra Prasad Karkera and Chen Qiang.

Topics: Open Source