Data Warehouse vs Data Lake: The Key Differences Every Organization Should Know
For organizations introducing themselves to the world of big data, one of the first terms learned is ‘data warehouse’. It’s a concept that has been encapsulated perfectly within its simple name, and this has aided many businesses in grasping both the idea and the value of big data.But in recent times there has been a fresh face on the big data storage scene which has been confusing this previously simple issue. Talk has turned from data warehouses to data lakes, leaving big data beginners the world over scratching their heads.So what exactly are data warehouses and data lakes? How do they differ from each other? And which is best for your situation?We’ll get to all of these questions in good time. But first let’s first look at exactly how these two terms are defined.
According to Wikipedia, data warehouses are “…central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.”Data warehouses are stocked with data that has been transformed and structured in a way that makes it as easily utilized as possible. Data is loaded only when its use and purpose has been defined, and is organized by subject area. As such a data warehouse will offer a rough representation of each area of the business, albeit an abstract one.The term ‘data lake’ has generally been credited to James Dixon, Pentaho founder and CTO. His layman’s explanation of the concept goes at least some of the way to describing its function.“If you think of a data mart [a subset of a data warehouse] as a store of bottled water – cleansed, packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”These ‘data streams’ flowing in are perhaps a data lake’s defining feature. All data is allowed in, no matter what state it might be in. Data is only transformed into a more usable state when it’s chosen for analysis. As such, the lake is a big old repository of data, be it structured, semi-structured, unstructured or raw.
The Key Differences between Data Warehouses and Data Lakes
You’ll have sensed by their definitions that data warehouses and data lakes are two very different beasts, but how do those differences translate to the real world? Let’s take a look at the key differences between the two from the perspective of the end user.
The state of the data: As outlined above, the main difference between data lakes and data warehouses relates to the data within each system. A data warehouse stores data that has been carefully transformed and curated, while a data lake is simply a repository into which any old data is poured, regardless of what state it’s in.
The processing of that data: The state of the data then informs how it is processed by the end user. To understand this we need to first understand a basic concept of data warehousing: ETL – Extract, Transform and Load. First you extract all data from your sources, then you transform that data to make it conform to the expected output, and finally you load that transformed data into your data warehouse.In a data lake situation, the ‘transform’ and ‘load’ stages trade places, changing ETL into ELT. This is because data lakes utilize a schema-on-read approach, different to a data warehouse’s schema-on-write approach. Schema-on-read gives shape and structure to the data only when needed, rather than making it uniform from the outset.
The storage capabilities: Lacking the need to transform data before it is stored, data lakes have a far greater capacity for data. In fact that is exactly what they were designed for in the first place – low cost, high volume storage.
The inherent agility: By definition a data warehouse is a highly structured repository. While that structure might not be technically hard to change, it is laborious. There are a wealth of business processes that are tied to it, all of which need to be taken into account if even the slightest structural change is made. In comparison a data lake lacks structure completely, allowing reconfiguration on the fly.
The security: Security is where the uniformity and structure of data warehouses begin to pay dividends. The amount of control that you have over a data warehouse allows you to lock down the information inside far more effectively than you can with a data lake. You can also be sure that rogue data won’t corrupt from the inside. But as data lakes become more commonplace, their security will most certainly improve.
The aptitude of the end user: Data lakes are a relatively new concept. They also, by nature, demand an extensive skillset in order to be utilized efficiently and effectively. As such – at this point in their development at least – they are best left to the data scientists. Data warehouses, on the other hand, have been around for decades. UX and UI have developed to the point where data warehouses are accessible to everyone, no matter their technical aptitude.
To Dive into the Lake or to Take a Tour of the Warehouse
So which is right for your situation? As it currently stands, data warehousing represents the wisest choice for organizations looking to capitalize on data. While data lakes seem to open up unending possibilities, their development simply isn’t at the level it needs to be for the average end user.But there’s no reason why an organization couldn’t utilize both approaches. What if, for example, you tipped external sources of data into a data lake, while you kept internal data in a data warehouse? Or you could transfer archived data into a data lake, keeping your data warehouse fresh, current and uncluttered. There’s no right or wrong answer to the question of data repositories – each organization’s situation will demand a slightly different solution. But knowing your choices, and thus making a wise one, is always the first step.