Datawarehouse in cloud computing

Wednesday, February 9, 2011

Business Intelligence in the Cloud: Sorting Out the Terminology

Over the past few months, both mainstream and startup software vendors have been announcing a steady stream of hosted business intelligence (BI) solutions. The various terms used to describe these offerings are often confusing. Examples of terms here include on-demand BI, BI software-as-a-service (SaaS), and BI in the cloud. The objective of this article is to help reduce this confusion by explaining the technologies behind these terms and providing an overview of the capabilities they provide.
Let’s start by reviewing six different business intelligence product offerings announced in this area this year. [For simplicity, I will use the term hosted solution to describe all of the products involved in these announcements.]

Panorama Software’s PowerApps provides a hosted and scalable analytical engine for managing OLAP cubes built from relational data. The first application to use this engine will be Panorama Analytics for Google Spreadsheets. PowerApps is currently in beta testing and will be made available by the end of 2008.
Vertica Analytic Database for the Cloud is a hosted and scalable version of Vertica’s columnar relational database system. The underlying grid-based computing system supporting this solution is provided by Amazon’s Elastic Computer Cloud offering.
Kognitio’s Data Warehousing as a Service (DaaS) is a hosted and scalable version of Kognitio’s relational database system.
LucidEra offers several hosted BI analytical applications that work with Salesforce.com and other CRM offerings. LucidEra Lead Insight is a new solution that provides pre-built metrics focused on the quality and performance of marketing leads, and how effectively they are converting to opportunities and to closed business.
Business Objects Business Intelligence OnDemand is a solution that provides a hosted version of the BusinessObjects XI BI tools product set. Other hosted solutions from Business Objects include: Crystalreports.com for sharing reports over the web, Information OnDemand for third-party market and financial data, and Business Applications OnDemand, a set of hosted applications (acquired from Nsite) for quote and proposal management.
PivotLink (formerly SeaTab Software) provides a set of hosted analytical and reporting tools, together with pre-built industry specific metrics. PivotLink for Retail, for example, gives retailers metrics on highest-margin SKUs, inventory turns and stock levels, order status and store comparisons.

An examination of the websites for each of these six products shows the confusion caused by the many different terms used to describe hosted solutions. Here are some examples of the terms used by these vendors:
Panorama Software
… SaaS BI Solution
… analytics as a service platform
… cloud-based data warehouse and OLAP cubes
Vertica
… performance on demand
… analytic database for the cloud
… grid-enabled
Kognitio
… data warehousing as a service (DaaS)
… on-demand data warehousing
… hosted BI applications
LucidEra
… business visibility on demand
… on-demand business intelligence
… LucidEra SaaS
Business Objects
… hosted platform
… on-demand business intelligence
… software-as-a-service model
PivotLink
… intelligence on demand
… SaaS model
… subscription-based model

Monday, February 7, 2011

Business Intelligence in Cloud

Cloud computing based on pay-as-you-go hardware infrastructure like Azure, Amazon or Google offers business intelligence users new data warehouse options that bring unlimited scalability without traditional data center overhead and budget constraints. It can provide end to end platform for BI applications for storing and analyzing data. High scalable storage in cloud can be used to create large data warehouses enabling storage of data in Terabytes simultaneously utilizing the on-demand compute power to analyze data.

In my view, there are three main constraints to BI adoption in this new era of analytic data management for business intelligence:

Effort required in consolidating and cleaning enterprise data: Prior to integrating data within the system it is required to detect and correct (delete) corrupt or inaccurate data from the store. This process will involve removing any typos, validating data definitions against the destination datawarehouse and correcting values against a known list of entities.
Cost of BI technology: This requires well developed infrastructure in terms of high end data center servers and software for ETL (Extraction, transformation and loading) process. Developing BI applications means additional overheads on IT infrastructure leading to more financial commitment for driving business.
Performance impact on existing infrastructure / inadequate IT infrastructure: Based on the business processes, rules and attributes defined by the business, data required for business intelligence is extracted from legacy and transaction systems, cleansed, validated and loaded into a dedicated database, known as a data warehouse. Developing BI applications using existing IT infrastructure can have a performance impact on the existing applications as it requires heavy processing during ETL processes and may be sometimes inadequate for BI processes.

Cloud computing is potentially an answer to two of these problems- second and third one. Its compute power enables organizations to analyze terabytes of data faster using BI applications more economically than ever before and is delivered in an on-demand basis. Cloud customers “rent” dedicated servers and the enterprise need to house, secure, and manage them.

Advantages of Transforming BI in Cloud

Cloud computing is transforming the economics of BI and opens up the opportunity for smaller enterprises to compete using the insight that BI provides. Cloud-based analytics will impact BI by:

Easier evaluation of Technology: Cloud enables software companies to make new technology available to evaluators on a self-service basis, avoiding the need to download and set up free software downloads or acquire hardwares fitting to the technology.
Increased short-term ad-hoc analysis: Where short term needs (weeks or months) for BI is required, cloud services are ideal. A data mart can be created in a few hours or days, used for the necessary period, and then cancel the cloud cluster, leaving behind no redundant hardware or software licenses. The cloud makes short term projects very economical.
Increased flexibility: Due to the avoidance of long term financial commitments, individual business units will have the flexibility to fund more data mart projects. This is ideal for proof of concept, and ad-hoc analytic data projects on-demand. This agility enables isolated business units to respond to BI needs faster than their competitors and increase the quality of their strategy setting and execution.
Drive data warehousing in MB markets: Medium-size businesses often have very large volumes of data for analysis, yet only a few IT resource at their disposal to analyze tons of terabytes of historical data to fine tune market strategies. Cloud-based analytics can enable such businesses to warehouse and analyze terabytes of data in spite of these resource constraints.
Drive the analytic SaaS market: Companies that collect economic, market, advertising, scientific, and other data and then offer customers the ability to analyze it online will be able to bring their solutions to market with much less risk and cost by utilizing cloud infrastructures during the early stages of growth.

Growth Considerations

As data volumes grow, for analytic cloud projects to succeed they will require an architecture that is designed to function efficiently in elastic, hosted computing environments. At a minimum, such application must include the following architectural features:

“Scale-out” shared-nothing architecture: To handle changing analytic workloads as elastically as the cloud. Auto-scaling of Virtual Machine (VM) can be used to proved necessary compute power required during heavy workload and an efficient algorithm need’s to be worked out in order to auto-scale in VM’s when not required.
Aggressive data storage: Cloud provides an appropriate infrastructure for storing large amount of data at low cost. No further additional overheads are required to store data on cloud thus helping achieve manpower savings for operations like data backup and server maintenance. For example in case of Windows Azure, Table Storage is designed to be massively scalable and a typical Azure Table can contain billions of records amassing to Terabytes of data. Blob Storage provides a means to store unstructured data much in the same way that would store a bunch of images on the File System of a server. Blobs can be mounted as XDrives on the Virtual Machine instance where a particular service is running and accessed exactly like a file system would.
Automatic replication and failover: This will provide high availability in the cloud. In case of Windows Azure, data is stored on 3 nodes to enhance both access speeds and reduce data redundancy.

Challenges

While the current environment hints at several of the aforementioned advantages of the technology, they also bring into focus the challenges that need to be overcome in order to make BI on Cloud model work. Moving data to the cloud is an expensive proposition due to high network cost invloved in the movement of existing data from on-premise to cloud. Storing data in the cloud can be troublesome as data is core and proprietary to enterprises also BI components as services are not availabe from BI vendors.

Friday, November 5, 2010

Hadoop – the Open source BI+Data warehouse solution

Increasing the capacity of computer will not help now as we live in Data Age, so we will use multiple computers and treat them as one.
It’s not easy to measure data volume stored electronically. As per few statistics published by IDC studies (http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf)

The digital universe in 2007 — at 2.25 x 1021 bits (281 exabytesor 281 billion gigabytes) — was 10% bigger than we thought.
By 2011, the digital universe will be 10 times the size it wasin 2006.

Interesting!! So why Im talking about data when I’m suppose to talk about BI, let we find out.

As this statistics shows the data is growing and growing even greater then what we can even think of. So, domain of analysis on this scale of data also requires innovative solutions that will make sense from the cost and data type point of view.
As we know this market is never to say “dead market” and now even there is more and more emphasis on Knowledge driven applications for enterprises. Some of these applications are developed internally or customers are betting on vendors to fulfill their dreams. Whatever is the case, the heart of these types of applications is BI.
As per few reports the total market of BI itself is around 5 billion US dollars and that is keep growing.
BI systems feast on data, traditionally BI system work on structured data and should say very successful. Traditional BI solutions are relied on RDBMS data stores; people can have BI system that uses RDBMS store or it uses with the combination of Data warehouse(DW) . Data warehouse is an intermediate storage between application data stores and BI systems. Data warehouse doing the intensive job of extracting, transforming and loading data.
In this field there are many big players who are providing DB related products and latched the opportunity of BI+Data warehousing. As it was in the niche domain and not many players, vendors have started giving solutions with a full bowl of salt and not pinch, vendors have provided services which are very-very expensive and the maintenance cost is also too high. This has started hitting customers sooner or later and in this recession, on everybody’s the first thing was saving.
The second problem with the current or traditional BI+DW solution is that it was having very limited feature for unstructured or semi-structured data processing.
Now a day’s customers are looking for Knowledge driven applications that requires crunching of unstructured and semi-structured data along with structured data.
Customers are interested in analyzing GB or PB’s of data. Sometimes this analysis is not the first most requirement of the BI solution as business critical problem, but as an analysis ecosystem for knowledge based systems this type of BI analysis is also very important.
With the proliferation of Web and Web2.0 technologies, customers are having large amount of unstructured data too compare to structured one, and this is keep growing then even what we can imagine. The client systems that are providing more and more interaction with the customers, that will have more and more unstructured and semi-structured data and interestingly this data is also that important as people say for BI+DW data sets.
So, customer wants to process this data as well for solving and explore business critical problems.          At this point customer has to think in two directions a. structure b. unstructured and semi-structured data. User can go with the traditional approach for pure RDBMS based repositories but for later direction, traditional approach fails because of the structure of data and scale of data.
As traditional solutions cost will keep growing as your data grows after a threshold till what we have already paid a hefty amount for BI+DW traditional solution. With most the solutions we need high end hardware to run this software that also increases the cost of solution and reduces ROI of the customer.
Now if we talk about second direction in which unstructured or semi –structured data needs to be crunched then the biggest hurdle is the scale (depends on the source and event of information) of the data present and going to increase day by day. Even if we think that traditional BI+DW solution stack will be able to process this type of data, but with this scale the cost of the solution would be very-very high and it’s not a onetime problem and needs to be addressed properly for gaining and taking edge in longer run.
Even when the source of information in the existing systems are not RDBMS based and mostly we need to do ETL steps to make it available in Data Warehouse that is RDBMS based. In this situation BI+DW can be replaced by other solutions that are based on unstructured and semi-structured data. These solutions provide two simple to understand advantages are cost and scale of data. Example in this category is in Telecom domain the main BI activity lies around Call Detail Records (CDR) that is mostly ASCII format. In telecom domain thanks to the huge number of consumers, the resultant traces of service usage by consumers are CDR, so the amount of data needs to be handled is huge.
So, scalability is one of the prominent problems in telecom BI+DW domain. Hence, solutions that provide economical solution is the need of Telecom service providers. Brains worked to solve the above reasoned out problems were a bit successful when open source community to work on this domain. A need of system that should be cheap compare to traditional BI+DW solutions and that should be having advantage in processing unstructured and semi-structured data.
First we will see some real examples of data processing for various industries, this gives fair idea of where all hadoop can be useful.
Sample of data that will be useful for companies range from
a.       Server logs
Use cases:
a.       Fault detection
b.      Performance related analysis

b.      Network logs for
Use cases:

a.       Network optimization
b.      Network fault detection

c.       Transaction logs
Use Case:
a.       Financial related analysis

d.      Email traces /logs
Use Cases:
a.       Consumer email analysis
b.      Decision system on the basis of Emails analysis

e.      Call Data Records (Telecom domain)
Use Cases:
a.       Precision Marketing Analysis
                                                               i.      User behavior analysis
                                                             ii.      Prediction on the basis of Customer churn
                                                            iii.      Service association analysis
f.        Distributed Search
Use Case:
a.       At the scale of Web data (Petabytes)
Examples prescribed her e are limited and are increasing day by day.
The interesting points are
a.       Data can be structured, semi-structured or unstructured
b.      Scale of the data is keep growing, and that needs to be tacked using innovative solutions apart from the Traditional BI+DW solutions.
c.       For few of the use cases you still can go and might afford sophisticated BI solution based Data Warehouse and ETL but for few normally even we don’t take backup for more than an year, because storage limitations and if we do analysis on this data it would be very costly. But still even this data is also important for various analyses.

One of the prominent options to target these problems is to use Open Source BI solutions.
Hadoop is one of the Open Source solutions that got a lot of attention from industry, this is an Apache product and Doug Cutting baby (the Lucene API parent as well), I’ll discuss on Hadoop further here.
Hadoop (http://hadoop.apache.org ) is under Apache group umbrella. It is designed on Shared Nothing architecture (SN) principle. SN is for distributed computing that means each node is independent and autonomous, plus there is no single point of bottleneck. Google has demonstrated how SN can scale almost to infinite, Google called it sharding .
Hadoop has key components that make it suitable for solving these problems
a.       Hadoop core: Provide the features like Fault Tolerance, Job Monitoring etc.
b.      Distributed File System: HDFS
c.       Map Reduce Implementation : for parallel processing
d.      SQL interface for Map Reduce : for Data warehouse kind of solutions : HIVE
It has more components and many features, but I restrict to give an intro of Hadoop as an option while choosing Open Source BI tool set / framework for commercial solution or research work.
Companies have already materialized Hadoop for Data warehouse / BI implementation. For example facebook has created BI / Data warehouse solution that is based on Hadoop
Statistics about facebook cluster is something like:
·         4 TB of compressed new data added per day
·         135TB of compressed data scanned per day
·         7500+ Hive jobs on production cluster per day
·         80K compute hours per day
(src: http://www.borthakur.com/ftp/hadoopworld.pdf)
   One more innovative company in Cloud Computing category has recently published about using Hadoop to solve business problem, Rackspace has done a innovative work to analyze Email logs to support their customer better as Customer Service is the USP of Rackspace (http://blog.racklabs.com/?p=66). Interesting to find the usage of Solr and Lucene in this implementation.

There are many other companies who are using Hadoop for BI / Data warehouse related problems.
There are some limitations of Hadoop also that also I like to highlight in brief:
1.       Hadoop is built for Batch jobs kind of work and not interactive work
2.       Hadoop has high latency and low throughput (because of its distribution of jobs in nature)
So, it’s wise to explore and invest in Open Source BI solutions and one flavor in that space is Hadoop.
Right now IPL (Indian Premier League) of cricket is not that famous as EPL (English Premier League) of football, as it’s only 3 year old only but still IPL has gone ages J
Same way Hadoop is new but here to remain and will get into mainstream as more research and applications materialize Hadoop to showcase its capability and value.
Continue to this blog, further I’ll like to discuss the architecture of BI+DW solution that gives us more points of discussion and improve to create a hadoop based BI+DW framework

Are data warehouses made for the cloud?

While cloud computing is still a relatively new term, the concept of outsourced data analytics has been around for a while. Still, recent buzz around cloud databases has ramped up interest in the potential of "data warehousing as a service" – for obvious reasons.

Because of the rapid scalability made possible by cloud-based models, along with the potentially cheaper alternative it provides for smaller businesses, data warehousing seems like a natural fit for the cloud. But is this really the case?
"It depends. I think there is certainly a natural fit in that the cloud provides an easy way for data warehouses to scale," said John Welch, a chief architect with Mariner, a North Carolina-based business intelligence consulting firm. "But there's still a fundamental issue in that the cloud today has some real problems transferring data to and from the cloud, particularly when you're talking about large amounts. So there are pros and cons of looking at it from that standpoint

Welch, who is scheduled to present on BI in the cloud at PASS Summit 2009, said the problem of moving data to and from cloud-based data warehouses stems from slower Internet connections, which usually is not an issue with internal networks.
"When you start thinking about taking all that data and pushing it across a much slower internet-based connection to move data up to a cloud-based server … your window gets a lot longer, just in terms of the pure data transmittal that has to be done," he said. "So that's really the big drawback from the traditional data warehousing standpoint."
Welch added that for this reason, several cloud services are moving toward a trickle feed model, where each individual transaction is sent to the cloud server as it happens, rather than having one large, daily batch load. Unfortunately, trickle feed models have drawback as well.
"The challenge there is that it typically isn't the most efficient way to process the data," said Welch, adding that in some cases, certain data quality or profiling techniques may rely on a large batch of data coming in as a single unit. Otherwise, the data might be profiled or assessed inaccurately, he said.

Microsoft has taken some steps to address the issue of moving data in the cloud. The most recent community technical preview (CTP) of SQL Azure Database includes BULK INSERT support to speed up the process.

"I've seen reports of up to four, six and even 10 times faster using [BULK INSERT]," Welch said. "Obviously, the mileage varies depending on your Internet connection and exactly what you're doing with it. But it does [account for] a pretty significant difference in speed as far as doing those bulk batch-type operations."
The BULK INSERT addition is just one example of the many changes Microsoft has made to Azure in the past year, which has been a welcome sight for DBAs and developers.
"I applaud how closely Microsoft has worked with the community to adapt their product to a rapidly changing marketplace," said Brent Ozar, a SQL Server expert with Quest Software Inc. "Knowing that Microsoft has been willing to change Azure has been reassuring for those of us who are developing applications that will count on its future expansion."

Another barrier to SQL Server data warehouses in the cloud involves the amount of storage available with SQL Azure Database. Currently, there is a cutoff of 10 GB of data for Azure Business Edition, which Welch described as "fairly limiting" for BI applications and data warehouses.
"That's enough for a small departmental application, but if you're looking at doing larger projects, the 10 GBs is limiting. So certainly increasing those limits over the next couple of years will be helpful," he said.

Welch added that Microsoft is already talking about increasing the amount of data that can be stored on Azure. The result could be a major boost for Microsoft in the battle with other cloud platforms with an eye on data warehousing, such as Greenplum's Enterprise Data Cloud (EDC) initiative or nCluster Cloud Edition from Aster Data Systems.

There are other steps Welch said he hopes to see Microsoft take with Azure in the future, such as the deployment of other parts of the SQL Server stack – like SQL Server Analysis Services or Reporting Services -- into a cloud-based model.
"I'd like to see them make those available in the same way that they are making the SQL Azure Database available, in a highly-scalable, on-demand type of model," he explained. "There are no definite plans around this, but it's something I look forward to."
The current CTP for SQL Azure Database is feature-complete. David Robinson, a senior program manager on Microsoft's SQL Azure team, said in a blog post that the Azure CTP will remain free until the service goes live in November. SQL Azure Database Business Edition is currently set to be priced at $99.99 per month.

Cloud Computing Meets Data Warehousing

Thus is the dynamic landscape of the emerging cloud computing environment. What will the effect be of the encounter between cloud computing and data warehousing? First, data warehousing will do to the cloud what it did to web service – raise the bar. Second, it will push the pendulum back in the direction of data marts. Third, it will deflate the inevitable hype being generated in the press.
First, data warehousing raises the bar on cloud computing. Capabilities such as data aggregation, roll up and related query intensive operations may usefully be exposed at the interface whether as Excel-like functions or actual API calls. Cloud computing is the opposite of traditional data warehousing. Cloud computing wants data to be location independent, transparent and function shippable, whereas the data warehouse is a centralized, persistent data store. Run-time metadata will be needed so that data sources can be registered, get on the wire and be accessible as a service. In the race between computing power and the explosion of data, large volumes of data continue to be stuffed behind I/O subsystems with limited bandwidth. Growing data volumes are winning. Still, with cloud computing (as with web services), the service, not the database, is the primary data integration method.
Second, data warehousing in the cloud will push the pendulum back in the direction of data marts and analytic applications. Why? Because it is hard to image anyone moving an exiting multiterabyte data warehouse to the cloud. Such databases will be exposed to intra-enterprise corporate clouds, so the database will need to be web service friendly. In any case, it is easy to imagine setting up a new ad hoc analytic app based on an existing infrastructure and a data pull of modest size. This will address the problem of data mart proliferation since it will make clear the cost and provide incentives for the business to throw it away when it is no longer needed.
Third, the inevitable hype around cloud computing will get a good dose of reality when it confronts the realities of data warehousing. Questions that a client surely needs to ask are: If I want to host the data myself, is there a tool to move it? Since this might be special project, how much does it cost? What are the constraints on tariffs (costs)? The phone company requires regulatory approval to raise your rates; but that is not the case with Amazon or Google or Layered Technology. Granted that strong incentives exist to exploit network effects (economies of scale and Moore’s Law like pricing). It is a familiar and proven revenue model to give away the razor and charge a little bit extra for the razor blade. Technology lock-in! It is an easy prediction to make that something like that will occur once the computing model has been demonstrated to be scalable, reliable and popular.
Under a best case scenario, economies of scale – large data warehousing applications – will enable a win-win scenario where large clients benefit from inexpensive options. However, in an economic downturn, the temptation will be overwhelming to raise prices once technology lock-in has occurred. Since this is a new infrastructure play, it is too soon for anything like that to occur. Indeed, this is precisely the kind of innovation that will enable the economy to dig itself out of the hole into which the mortgage mess has landed us. Unfortunately, it will not make houses more affordable. It will, however, enable business executives and information technology departments to do more with less, to work around organizational latency in any department and to compete with agility in the digital economy. It is simply not credible to assert that any arbitrary cloud computing provider will simply be able to accommodate a new client who starts out requiring an extra ten terabytes of storage. Granted, the pipeline to the hardware vendors is likely to be a high priority one. The sweet spot for fast provisioning of data warehousing in the cloud is still small- and medium-sized business and applications.

Thursday, November 4, 2010

Oracle's cloud stratgey for dataware house

Data warehouse

A data warehouse (DW) is a database used for reporting. The data is offloaded from the operational systems for reporting. The data may pass through an Operational Data Store (ODS) for additional operations before it is used in the DW for reporting.

A data warehouse maintains its functions in three layers: staging, integration and access. A principle in data warehousing is that there is a place for each needed function in the DW. The functions are in the DW to meet the users' reporting needs. Staging is used to store raw data for use by developers (analysis and support). The integration layer is used to integrate data and to have a level of abstraction from users. The access layer is for getting data out for users.

This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, catalogued and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & OBrien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.

Cloud Computing

National Institute of Standards and Technology, Information Technology Laboratory

Note 1: Cloud computing is still an evolving paradigm. Its definitions, use cases, underlying technologies, issues, risks, and benefits will be refined in a spirited debate by the public and private sectors. These definitions, attributes, and characteristics will evolve and change over time.

Note 2: The cloud computing industry represents a large ecosystem of many models, vendors, and market niches. This definition attempts to encompass all of the various cloud approaches.

Definition of Cloud Computing:

Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.

Characteristics of Cloud Computing

· On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider.

· Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).

· Resource pooling. The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, network bandwidth, and virtual machines.

· Rapid elasticity. Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.

· Measured Service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.

Oracle’s Cloud based Strategy for Data warehouse

1. Based on high volume hardware

2. Based on grid processing model

3. Uses intelligent shared storage

4. Based on open standards based operating systems

5. Provide a highly available platform

6. Deliver increased utilization

1) Use of High Volume Hardware One of the main factors holding back many data warehouses today is their inability to quickly and easily integrate new hardware developments - new CPUs, bigger disks, faster interconnects etc. Every new advance in hardware delivers faster performance for less money. This makes it vital that such developments can incorporated into the data center as soon as possible.

This is especially true in data warehousing where many customers need to store ever-increasing volumes of data as well as supporting an ever-growing community of users running every more complex queries. New innovations such as Intel’s latest Nehalem chipset, interconnect technology from the super-computing industry (InfiniBand), ever expanding SATA/SAS disk storage capacities and the introduction of SSD/flash technology are all vital in terms of delivering increased performance, improving overall efficiency and reducing total costs.

Many customers are being told that the simplest way to access new technology is to off-load some, or all, of their processing to the cloud (at this point I am not differentiating between public and private clouds). The problem is that simply moving to the cloud (public or private) does not mean guarantee access to the latest hardware innovations. In many cases, it is simply a way of masking the use of proprietary hardware (and related software) that is probably well passed its sell-by date.

2) Use of Grid Processing Most customers are working with large numbers of dedicated servers with storage assigned to each application. This creates a computing infrastructure where resources are tied to specific applications, resulting in an inflexible architecture. This increases cost, power requirements, reduces overall performance, scalability and availability.

The way to resolve these issues is to move to an approach based on the grid. Grid Computing is a virtualizes and pools IT resources, such as compute power, storage and network capacity into a set of shared services that can be distributed and re-distributed as needed. Grid computing provides the required flexibility to meet the changing needs of the business. It is much easier to support short-term special projects for a department if additional resources can quickly and easily provisioned. Placing applications on a grid computing based architecture enables multiple applications to share computing infrastructure, resulting in much greater flexibility, cost, power efficiency, performance, scalability and availability, all at the same time.

Oracle Real Application Clusters (RAC) allows the Oracle Database to run in a grid platform. Nodes, CPUs, storage and memory can all be dynamically provisioned while the system remains online. This makes it easy to maintain service levels while at the same time lowering overall costs through improved utilization. In fact you could consider the "C" in RAC as referring to "Cloud" rather than "Cluster".

Adding additional resources “on-demand” is a key requirement for delivering cloud computing. I would argue that this can be a complicated process within a shared-nothing infrastructure. This makes this type of approach unsuitable for use with a cloud computing strategy. In reality adding something relatively simple such as more storage has a profound impact on the whole environment. The database has to be completely rebuilt to re-distribute data evenly across all the disks. For many vendors adding more storage space also means adding more processing nodes to ensure the hardware remains in balance with the data. This all creates additional downtime for the business, as the whole platform has to go offline while the new resources are added and configured which impacts SLAs.

3) Use of Intelligent Shared StorageToday’s data warehouse is completely different from yesterday’s data warehouse. Data volumes, query complexity and numbers of users have all increased dramatically and will continue to increase. The pressure to analyze increasing amounts of data will put more strain on the storage layer and many systems will struggle with I/O bottlenecks. With traditional storage, creating a shared storage grid is difficult to achieve because of the inability to prioritize the work of the various jobs and users consuming I/O bandwidth from the storage subsystem. The same occurs when multiple databases share the storage subsystem

Exadata delivers a new kind of storage – intelligent storage - specifically built for the Oracle database. Exadata has powerful smart scan features which reduce the time taken to find the data relevant to a specific query and begin the process of transforming the data into information. At the disk level there is a huge amount of intelligent processing to support a query. Consequently, the result returned from the disk is reduced to the necessary information to satisfy a query, being significantly smaller than compared with a traditional block storage approach (as used by many other vendors such as data warehouse).

The resource management capabilities of Exadata storage can prevent one class of work, or one database, from monopolizing disk resources and bandwidth and ensures user defined SLAs are met when using Exadata storage. With an Exadata system it is possible to identify various types of workloads, assign priority to these workloads, and ensure the most critical workloads get priority.

The tight integration between the storage layer, Exadata, and the Oracle Oracle Database ensures customers to get all the benefits of extreme performance with all the scalability and high availability required to support a “cloud” based enterprise data warehouse.

4) Use of Open Standards Based Operating SystemsThe same concept that applies to hardware also applies to operating systems. Customers need to move from proprietary operating systems to one based on open standard, such as Linux. The use of open standards based operating systems also allows new technologies to rapidly incorporated.

Oracle provides its own branded version of Linux – Oracle Enterprise Linux. Oracle is committed to making Linux stronger and better. Oracle works closely with, and contributes to, the Linux community to ensure the Oracle Database runs optimally across all major flavors of Linux. This cooperation extends to the very latest technology supporting both Exadata and Sun Oracle Database Machine: such as the support of InfiniBand as a networking infrastructure. Oracle is working with the Linux community to help standardize the use of InfiniBand interconnects. Oracle has already released the InfiniBand drivers it developed for use with Oracle Database Machine to the open-source community.

With its support for Linux, use of commodity hardware components, intelligent shared storage and grid architecture, Oracle is able to deliver the most open approach to enterprise data warehousing in the market today and support the key elements needed to allow customers to develop a successful cloud based data warehouse strategy.

5) Use of a Highly Available FrameworkIn a hardware-cloud it is vital that there is no single point of failure. As the number of applications sharing the hardware increases so does the impact of a loss of service. A data warehouse, either inside or outside the cloud, can be subjected to both planned and unplanned outages. There are many types of unplanned system outage such as computer failure, storage failure, human error, data corruption, lost writes, hangs or slow downs and even complete site failure. Planned system outages are the result of needing to perform routine and periodic maintenance operations and new deployments. The key is to minimize the amount of downtime to reduce the impact on productivity, lost revenue, damaged customer relationships, bad publicity, and lawsuits.

A data warehouse built around a shared nothing architecture is vulnerable to the loss of a node or a disk since losing one or both of these items means that a specific portion of the data set is unavailable. As a result queries and/or processes have to be halted until the node/disk is repaired.

A shared everything architecture, such as Oracle’s, is the ideal solution for cloud computing since there is no single point of failure. If a disk or node fails, queries and/or processes are simply serviced from another disk containing a copy of the data from the failed disk or transparently moved to another node. This is achieved without interruptions in service, saving cost and ensuring business continuity.

Exadata has been designed to incorporate the same standards of high availability (HA) customers have come to expect for Oracle products. With Exadata, all database features and tools work just as they do with traditional non-Exadata storage. With the Exadata architecture, all single points of failure are eliminated. Familiar features such as mirroring, fault isolation, and protection against drive and cell failure have been incorporated into Exadata to ensure continual availability and protection of data. Other features to ensure high availability within the Exadata Storage Server are described below

Oracle's Hardware Assisted Resilient Data (HARD) Initiative is a comprehensive program designed to prevent data corruptions before they happen. Data corruptions are very rare, but when they happen, they can have a catastrophic effect on a database, and therefore a business. Exadata has enhanced HARD functionality embedded in it to provide even higher levels of protection and end-to-end data validation for your data.

6) Delivers Increased Utilization One of the key aims of cloud computing is to increase the utilization of existing hardware. What is actually required is a new approach to hardware that allows applications to be consolidated onto a single, scalable platform. This allows resources (disk, processing, memory) to be shared across all applications and allocated as required. If one particular application needs additional short-term resources for a special project, the infrastructure should be flexible enough to allow those resources to be made available without significantly impacting other applications.

The use of high volume hardware, grid architecture, highly available framework and open standards makes it possible to create a suitable platform for consolidation to support enterprise wide applications.

Now we have the second part of our cloud strategy in place: a hardware platform to support the data warehouse cloud:

The last stage is to review the key software requirements needed to support and develop a successful cloud based data warehouse strategy