Data Warehousing: Should You Store Data Internally or in the Cloud?

by Mark Peterman

The decisions don’t stop once an organization has committed to investing in BI (Business Intelligence); in fact, they just begin.

One of the least understood and perhaps most divisive decisions to be made regarding your new BI system will have to do with the storage of its data. The issue is not with where your data will be stored as it will most likely be in a data warehouse, but rather where your data warehouse will be stored.

For most organizations it’s a race in two. You can either keep your data warehouse within the four walls of your business, or you can upload it to that mysterious place we call ‘the cloud’.

So which is the best choice? To find out, we should look both at how data warehousing has developed up to this point, and the sort of considerations that your organization should be keeping in mind when making the call.

A Very Brief History of Data Warehousing

While various things that could be labeled as data warehousing have existed since the early 70s (when market research giant ACNielsen provided its clients with a sales enhancement tool called a ‘data mart’), the term ‘data warehousing’ wasn’t seen in print until the late 80s.

Perhaps unsurprisingly, the term originated from Big Blue. It was IBM that both saw the obvious benefits of data warehousing, and had the wherewithal to act. But while IBM’s efforts were kept very much in-house, a man by the name of Bill Inmon saw the commercial potential of the technology, and is credited with making data warehousing available to all businesses during the early 90s, through his company Prism Solutions.

In the intervening quarter century the technology has developed at light speed, but some changes have been more marked than others. The foundation on which data warehousing has been built is the storage of data within internal repositories, but with the recent explosion of cloud computing, it’s fair to say that this foundation has been rocked.

Such a paradigm shift can be unsettling, and may raise questions in the minds of those who are either new to the field, or are used to doing things a certain way. So is cloud computing compatible with data warehousing? Can data warehousing adapt to the vagaries of the cloud? And even if it can, is it a wise storage choice?

Considerations for Those Deciding between Internal or Cloud Storage

To answer the questions above, let’s take a look at some of the most common reservations that are voiced, and the considerations that one needs to make, when deciding between internal and cloud storage.

Data security

The most common reservation for organizations in this space concerns security. ‘If I send my data elsewhere, how do I know it’ll be kept safe?’ Some organizations feel far more comfortable with their own security safeguards than they do with those of cloud computing services. It’s an entirely understandable thought, but it’s one that has a clear logical rebuttal.

It must be understood that the full time job of cloud service providers is to keep their clients’ data secure – their existence is entirely dependent on it. A reputable provider will employ a team of specialists tasked with keeping data safe. Now ask yourself, does your business boast multiple teams of experts whose sole focus is on the security of your data? Unless you’re one of the biggest players in the market or a government organization, the answer is likely no. Therefore in all likelihood you’re less equipped to handle your own sensitive data than a cloud provider is.

Gartner, in fact, has forecasted that Through 2020, cloud infrastructure workloads will suffer at least 60% fewer security incidents than those in traditional data centers.

It may be a tough pill to swallow, but when seen through an entirely rational lens, using the cloud will likely represent a security upgrade for your data.

Efficiency and cost savings

Internal data storage can be a pricey undertaking. Internal data warehousing compels an organization to purchase hardware, to stay on top of compliance and regulatory concerns and to hire contractors, employees and consultants to oversee its operation.

Cloud computing, on the other hand, represents a classic example of economies of scale. By creating an architecture that can be utilized by large numbers of users, the amount of resources required to store and manage x amount of data is greatly reduced.

When Amazon Web Services (AWS) launched its cloud based data warehousing tool Redshift back in 2012, the potential cost savings of utilizing such a service were writ large. AWS quoted a figure of $1000 per terabyte per year to manage data, while the coinciding figure for internally managing data sat at around $19,000 to $25,000. And in the years since, data warehousing in the cloud has only become cheaper.

Sharing data

One understated benefit of using the cloud is the ability to quickly and easily share data with those that need it. While you’ll obviously apply strict controls to exactly who can see what, an internal system simply can’t compete with the cloud when it comes to instantaneous data sharing with those outside your organization.

The ability to share data can in fact turn into a potential profit center for a business, particularly those with high volume, low sensitivity data. Manufacturers, for example, can purchase detailed information regarding the performance of their products from retailers, and use this data to enhance their offering. Or, in the case of iRobot and its infomercial-famous Roomba vacuum cleaner, information about the layout of your home could be sold off.

The value of microseconds

But where does the cloud fall down? One area is in lag time. Sure, with internet speeds getting exponentially higher this will be less and less of a concern for most, but the basic rules of physics still state that no matter how quick your connection, it still takes a few milliseconds more for data to travel across the word than it does to travel to an internal data warehouse sitting just a stone’s throw away. Even if information is travelling at the speed of light through fiber-optic cables (which it can’t) and didn’t have to pass through any switches or relays (which it will), the data would still take over 3 milliseconds to get to the other side of the planet.

It’s for this reason that in environments were milliseconds count – in high-frequency trading, for example – the trend is towards super localized data warehouses that facilitate the transaction of data in a literal instant. While not a common concern, it’s certainly one to consider.

The need to upload

Another obvious concern for those considering data storage in the cloud is the need to get it there. When you have terabytes or even petabytes of data to upload to the cloud, obvious questions regarding security and bandwidth are raised.

Many cloud providers get around this in an ever so quaint way, by accepting preloaded hard drives via secure delivery. Who’d imagine that snail mail could be so helpful in cloud computing? Others use third party providers like Equinix to facilitate direct connections to the cloud.

There are ways and means to mitigate the challenges presented by the transfer of excessive amounts of data, but some would argue that they aren’t particularly elegant. And these challenges evaporate as soon as you commit to internal data warehousing.

The choice is yours

So which side of the data warehouse storage fence does your organization sit? The answer, perhaps annoyingly, is not black and white. It will depend on how each of these considerations affects your business, and can only be found by carefully analyzing your strengths, weaknesses, needs and wants.

The one concrete fact that you can take away from this article is that the importance of data warehousing will only get more pronounced into the future, so whichever data warehousing path you choose, you must ensure that it will be capable of servicing your organization well into the future.