Home/Articles/Data Architecture

Data Architecture

11 min read

May 2025

Why On-Prem Data Lakehouse Matters — Especially for BFSI Enterprises

The cloud lakehouse narrative is compelling — until you run it through the BFSI lens of data sovereignty, regulatory compliance, and true total cost of ownership. Here is the architecture case for on-prem.

Kiran Pawar

BFSI Transformation Consultant

Every major cloud vendor has spent the last three years selling BFSI firms on the dream of the cloud data lakehouse. Unified storage and compute. Infinite scale. Delta Lake on Databricks, or its equivalent on Azure, AWS, or GCP. The pitch is compelling, the demos are impressive, and the analyst endorsements are unambiguous.

And yet, when you strip back the sales narrative and run the decision through the actual constraints that govern BFSI data architecture — regulatory requirements, data sovereignty obligations, audit trail mandates, and long-run total cost of ownership — a very different picture emerges.

This is not an argument against cloud in general. It is a specific argument about why the on-premises data lakehouse deserves serious consideration in the BFSI context — and why I have built one from scratch for an enterprise client and found it to be the right call.

What a Data Lakehouse Actually Is

Before the architecture debate, let us be precise about what we are talking about. A data lakehouse is an architecture that combines the scalability and low-cost storage of a data lake with the reliability, governance, and query performance characteristics traditionally associated with a data warehouse.

The key innovation is the table format layer — Delta Lake (Databricks), Apache Iceberg (open standard), or Apache Hudi — that sits on top of object storage and provides ACID transactions, schema enforcement, time travel, and efficient query planning. This is what allows a data lake to behave like a warehouse for BI and analytics workloads while retaining its flexibility for data science and AI use cases.

The Architecture Stack

A complete lakehouse requires: object storage (S3/MinIO/HDFS), a table format (Delta Lake/Iceberg), a compute engine (Apache Spark, Trino, or DuckDB), a metadata catalog (Apache Hive Metastore or Unity Catalog equivalent), and a governance/access control layer. The cloud vendors bundle these together. On-prem, you assemble them yourself — and that assembly process is where the real expertise lies.

The BFSI Data Sovereignty Problem

The most fundamental issue for Indian BFSI firms is data localization. The Reserve Bank of India's data localization circular (2018) requires that all payment system data be stored exclusively in India. SEBI's cybersecurity framework and circulars on data storage have progressively tightened requirements around where and how financial data may be held. The Digital Personal Data Protection Act 2023 has introduced further obligations around personal financial data.

In practice, this means that a significant portion of a BFSI firm's operational and analytical data cannot simply be moved to a hyperscaler's data centre without careful legal analysis — and even where it is technically permissible, the risk and compliance overhead of doing so may outweigh the operational benefits.

Customer PAN, Aadhaar linkage, and account data falls under personal data protection requirements
Transaction data for stock broking clients carries SEBI data handling obligations
Payment data for firms operating as payment aggregators or PSPs is subject to RBI localization mandates
Audit trail data for regulatory reporting must be available for inspection — often with specific retention and integrity requirements

For many BFSI firms, the practical conclusion is that a significant portion of their data estate belongs on-premises or in a locally-hosted private cloud environment — which makes a cloud-first lakehouse strategy genuinely complex to execute cleanly.

The Economics at Scale

The second argument is economic, and it is one that enterprise architecture teams consistently underestimate until they see the actual invoices.

Cloud lakehouse platforms price across several dimensions: storage, compute (clusters priced by DBU or equivalent), data transfer (egress charges), and premium features like Unity Catalog for governance. At small scale, these costs are manageable. At enterprise scale — petabytes of historical transaction data, dozens of analysts and data scientists running concurrent queries, daily batch processing of millions of transactions — the economics shift dramatically.

The Egress Cost Reality

Data egress — moving data out of a cloud environment — is one of the most consistently underestimated costs in cloud architecture. For a BFSI firm with significant data volumes feeding multiple downstream systems (regulatory reporting, client portals, risk systems), egress charges alone can represent a substantial ongoing cost that simply does not exist in an on-prem architecture.

The on-prem lakehouse, once the capital investment in infrastructure is made, has a very different cost profile: primarily power, cooling, hardware amortization, and people. For firms with stable or predictable workloads — which describes most BFSI analytics environments — the three-to-five-year total cost of ownership of an on-prem lakehouse is typically significantly lower than the equivalent cloud deployment.

The Open-Source Stack: What We Actually Built

The most common objection to on-prem lakehouse architecture is operational complexity: "We don't have Databricks engineers. We can't build and maintain this ourselves." It is a valid concern — five years ago. Today, the open-source ecosystem has matured to a point where a well-architected on-prem lakehouse is genuinely operationally manageable by a competent data engineering team.

Here is the stack we assembled for a BFSI client, running entirely on-premises on commodity hardware:

On-Premises Open-Source Data Lakehouse — Architecture

Consuming Layer

Custom Management UI

Catalog · Lineage · Access Mgmt

BI Tools

Power BI · Superset · Looker

AI Assistants

via MCP Server + RBAC

Governance Layer

Apache Ranger — Access Control & PII Masking

Role-based column masking · Row-level security · Immutable audit log · Retention enforcement

Compute Layer

Apache Spark

Batch ETL · ML Pipelines

Trino

Interactive SQL · Ad-hoc analytics

Hive Metastore + Atlas

Schema registry · Lineage

Table Format

Delta Lake (Open-Source · Linux Foundation)

ACID Transactions · Time Travel · Schema Evolution · Parquet file format

Storage Layer

MinIO Object Storage — On-Premises (S3-Compatible)

Erasure coding · Distributed across nodes · Zero cloud dependency · Full data sovereignty

Governance (cross-cutting)

Open-source

AI-ready

Full open-source data lakehouse stack — from MinIO object storage up to BI tools and AI assistants, with Apache Ranger enforcing governance at every layer

Storage Layer: MinIO

MinIO is a high-performance, S3-compatible object storage system that runs on-premises. It provides the same API surface as Amazon S3, which means all tools that work with S3 work with MinIO — including Delta Lake, Apache Spark, and most modern data tools. We deployed it on a distributed cluster across multiple nodes, with erasure coding for data durability.

Table Format: Delta Lake (Open-Source)

Delta Lake is the open-source table format originally developed by Databricks, now available under the Linux Foundation. It provides ACID transactions, time travel (query the table as it existed at any point in history), schema evolution, and efficient query planning on top of Parquet files. Critically, the open-source version runs entirely without Databricks licensing.

Compute Engine: Apache Spark

Apache Spark handles all batch processing, transformation, and large-scale analytics. Deployed on Kubernetes for resource management and elasticity within the data centre, it provides the compute backbone for ETL pipelines, data science workloads, and heavy analytical queries.

Query Engine: Trino (formerly PrestoSQL)

For interactive analytical queries — the kind that analysts and BI tools generate — Trino provides fast, distributed SQL execution directly against the Delta Lake tables without needing to spin up a Spark cluster for every query. This significantly reduces the latency and resource overhead for ad-hoc analytics.

Custom Management UI

One genuine gap in the open-source lakehouse ecosystem is the management interface. Databricks and Snowflake have polished, integrated UIs for data cataloguing, lineage, access management, and monitoring. The open-source equivalent requires assembly.

We built a custom web frontend that provides: a data catalog with search and discovery, table-level and column-level access controls, query history and cost tracking, pipeline monitoring dashboards, and a governance workflow for data classification and PII flagging. This was a deliberate investment — and one that paid for itself by enabling non-technical stakeholders to understand and govern the data estate without needing to touch command-line tools.

Governance and PII Masking at the Architecture Level

In a BFSI context, data governance is not a nice-to-have layer that you bolt on after the platform is built. It is a core architectural requirement. The lakehouse architecture we built treats governance as a first-class concern:

Column-level PII classification: Every column in every table is classified at onboarding time. PAN numbers, Aadhaar references, account numbers, client contact details — all flagged and registered in the governance layer.
Dynamic masking by role: Analysts see masked versions of PII columns by default. Data engineers with elevated access see actual values for debugging purposes. Compliance teams have full access with audit logging on every row accessed.
Immutable audit trails: Every query that touches a PII-classified column is logged with user identity, timestamp, query text, and row count accessed. This log is itself stored in an append-only Delta table that cannot be modified even by administrators.
Automated data retention: Regulatory retention schedules are enforced at the table level — data past its retention window is automatically archived to cold storage and flagged for review before deletion.

AI Connectivity: Linking the Lakehouse to Enterprise AI

The final piece of the architecture — and one that has become increasingly important — is how the lakehouse connects to AI tools and assistants. The emergence of Model Context Protocol (MCP) as a standard for AI tool connectivity has created a clean architecture pattern for this.

We deployed an MCP server layer that sits in front of the lakehouse and exposes data access capabilities to AI assistants with full role-based access control. An analyst using an AI assistant to query trading data sees exactly the same data they would see through direct query access — masked where appropriate, scoped to their authorised datasets, with every AI-mediated data access logged alongside human-initiated queries.

This means the AI layer inherits all the governance properties of the lakehouse, rather than creating a new governance perimeter to manage. The AI cannot access data that the user cannot access — by architecture, not by trust.

When Cloud Lakehouse Is the Right Answer

To be clear: I am not arguing that on-prem is always right. There are genuine cases where a cloud lakehouse is the better answer:

Firms without the capital for significant on-prem infrastructure investment, where the operational expenditure model of cloud is financially preferable
Organisations with highly variable workloads — significant seasonality or unpredictable spikes — where cloud elasticity provides genuine value
Teams that lack the engineering depth to operate distributed systems and where managed services provide a meaningful reduction in operational burden
International BFSI operations where data localization requirements vary by jurisdiction and cloud region selection provides a cleaner compliance path

“The right lakehouse architecture is not the one with the best marketing. It is the one that satisfies your regulatory obligations, fits your cost structure, and your team can actually operate reliably.”

The Decision Framework

If you are evaluating lakehouse architecture for a BFSI organisation, here are the four questions that actually drive the right answer:

1.What are your specific data localization obligations? Start with a legal and compliance inventory before any technology decision. Know exactly which data categories must stay on-prem.
2.What is your 5-year total cost projection for each option? Build the honest TCO model — cloud pricing at your projected data volumes including egress, vs. on-prem capital and operating costs. The crossover point is earlier than most teams expect.
3.What engineering capability do you have or can you build? On-prem lakehouse is operationally manageable but requires genuine distributed systems capability. Cloud managed services reduce operational burden at a cost. Honestly assess what your team can sustain.
4.What is your AI strategy? If AI-driven analytics is a near-term priority, the lakehouse architecture needs to accommodate AI tool connectivity with governance built in from the start — not retrofitted later.

Ready to Evaluate Your Lakehouse Options?

If you are a BFSI technology leader navigating this decision, the right starting point is an honest architecture assessment — not a vendor demo. I am happy to have that conversation.

Discuss This with Kiran

If this resonates with challenges your firm is facing, let's have a strategic conversation about your data transformation journey.

Start a Conversation Connect on LinkedIn

More Insights

Revenue Analytics

Why BFSI Firms Need a Revenue Intelligence Layer, Not Just Dashboards

7 min read

Leadership

The Fractional BI Leader Model: Enterprise Analytics Without Full-Time Overhead

5 min read

All Articles