Data warehouse
A data warehouse (DW) is a database used for reporting. The data is offloaded from the operational systems for reporting. The data may pass through an Operational Data Store (ODS) for additional operations before it is used in the DW for reporting.
A data warehouse maintains its functions in three layers: staging, integration and access. A principle in data warehousing is that there is a place for each needed function in the DW. The functions are in the DW to meet the users' reporting needs. Staging is used to store raw data for use by developers (analysis and support). The integration layer is used to integrate data and to have a level of abstraction from users. The access layer is for getting data out for users.
This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, catalogued and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & OBrien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.
Cloud Computing
National Institute of Standards and Technology, Information Technology Laboratory
Note 1: Cloud computing is still an evolving paradigm. Its definitions, use cases, underlying technologies, issues, risks, and benefits will be refined in a spirited debate by the public and private sectors. These definitions, attributes, and characteristics will evolve and change over time.
Note 2: The cloud computing industry represents a large ecosystem of many models, vendors, and market niches. This definition attempts to encompass all of the various cloud approaches.
Definition of Cloud Computing:
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.
Characteristics of Cloud Computing
· On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider.
· Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).
· Resource pooling. The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, network bandwidth, and virtual machines.
· Rapid elasticity. Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
· Measured Service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.
Oracle’s Cloud based Strategy for Data warehouse
1. Based on high volume hardware
2. Based on grid processing model
3. Uses intelligent shared storage
4. Based on open standards based operating systems
5. Provide a highly available platform
6. Deliver increased utilization
1) Use of High Volume Hardware One of the main factors holding back many data warehouses today is their inability to quickly and easily integrate new hardware developments - new CPUs, bigger disks, faster interconnects etc. Every new advance in hardware delivers faster performance for less money. This makes it vital that such developments can incorporated into the data center as soon as possible.
This is especially true in data warehousing where many customers need to store ever-increasing volumes of data as well as supporting an ever-growing community of users running every more complex queries. New innovations such as Intel’s latest Nehalem chipset, interconnect technology from the super-computing industry (InfiniBand), ever expanding SATA/SAS disk storage capacities and the introduction of SSD/flash technology are all vital in terms of delivering increased performance, improving overall efficiency and reducing total costs.
Many customers are being told that the simplest way to access new technology is to off-load some, or all, of their processing to the cloud (at this point I am not differentiating between public and private clouds). The problem is that simply moving to the cloud (public or private) does not mean guarantee access to the latest hardware innovations. In many cases, it is simply a way of masking the use of proprietary hardware (and related software) that is probably well passed its sell-by date.
2) Use of Grid Processing Most customers are working with large numbers of dedicated servers with storage assigned to each application. This creates a computing infrastructure where resources are tied to specific applications, resulting in an inflexible architecture. This increases cost, power requirements, reduces overall performance, scalability and availability.
The way to resolve these issues is to move to an approach based on the grid. Grid Computing is a virtualizes and pools IT resources, such as compute power, storage and network capacity into a set of shared services that can be distributed and re-distributed as needed. Grid computing provides the required flexibility to meet the changing needs of the business. It is much easier to support short-term special projects for a department if additional resources can quickly and easily provisioned. Placing applications on a grid computing based architecture enables multiple applications to share computing infrastructure, resulting in much greater flexibility, cost, power efficiency, performance, scalability and availability, all at the same time.
Oracle Real Application Clusters (RAC) allows the Oracle Database to run in a grid platform. Nodes, CPUs, storage and memory can all be dynamically provisioned while the system remains online. This makes it easy to maintain service levels while at the same time lowering overall costs through improved utilization. In fact you could consider the "C" in RAC as referring to "Cloud" rather than "Cluster".
Adding additional resources “on-demand” is a key requirement for delivering cloud computing. I would argue that this can be a complicated process within a shared-nothing infrastructure. This makes this type of approach unsuitable for use with a cloud computing strategy. In reality adding something relatively simple such as more storage has a profound impact on the whole environment. The database has to be completely rebuilt to re-distribute data evenly across all the disks. For many vendors adding more storage space also means adding more processing nodes to ensure the hardware remains in balance with the data. This all creates additional downtime for the business, as the whole platform has to go offline while the new resources are added and configured which impacts SLAs.
3) Use of Intelligent Shared StorageToday’s data warehouse is completely different from yesterday’s data warehouse. Data volumes, query complexity and numbers of users have all increased dramatically and will continue to increase. The pressure to analyze increasing amounts of data will put more strain on the storage layer and many systems will struggle with I/O bottlenecks. With traditional storage, creating a shared storage grid is difficult to achieve because of the inability to prioritize the work of the various jobs and users consuming I/O bandwidth from the storage subsystem. The same occurs when multiple databases share the storage subsystem
Exadata delivers a new kind of storage – intelligent storage - specifically built for the Oracle database. Exadata has powerful smart scan features which reduce the time taken to find the data relevant to a specific query and begin the process of transforming the data into information. At the disk level there is a huge amount of intelligent processing to support a query. Consequently, the result returned from the disk is reduced to the necessary information to satisfy a query, being significantly smaller than compared with a traditional block storage approach (as used by many other vendors such as data warehouse).
The resource management capabilities of Exadata storage can prevent one class of work, or one database, from monopolizing disk resources and bandwidth and ensures user defined SLAs are met when using Exadata storage. With an Exadata system it is possible to identify various types of workloads, assign priority to these workloads, and ensure the most critical workloads get priority.
The tight integration between the storage layer, Exadata, and the Oracle Oracle Database ensures customers to get all the benefits of extreme performance with all the scalability and high availability required to support a “cloud” based enterprise data warehouse.
4) Use of Open Standards Based Operating SystemsThe same concept that applies to hardware also applies to operating systems. Customers need to move from proprietary operating systems to one based on open standard, such as Linux. The use of open standards based operating systems also allows new technologies to rapidly incorporated.
Oracle provides its own branded version of Linux – Oracle Enterprise Linux. Oracle is committed to making Linux stronger and better. Oracle works closely with, and contributes to, the Linux community to ensure the Oracle Database runs optimally across all major flavors of Linux. This cooperation extends to the very latest technology supporting both Exadata and Sun Oracle Database Machine: such as the support of InfiniBand as a networking infrastructure. Oracle is working with the Linux community to help standardize the use of InfiniBand interconnects. Oracle has already released the InfiniBand drivers it developed for use with Oracle Database Machine to the open-source community.
With its support for Linux, use of commodity hardware components, intelligent shared storage and grid architecture, Oracle is able to deliver the most open approach to enterprise data warehousing in the market today and support the key elements needed to allow customers to develop a successful cloud based data warehouse strategy.
5) Use of a Highly Available FrameworkIn a hardware-cloud it is vital that there is no single point of failure. As the number of applications sharing the hardware increases so does the impact of a loss of service. A data warehouse, either inside or outside the cloud, can be subjected to both planned and unplanned outages. There are many types of unplanned system outage such as computer failure, storage failure, human error, data corruption, lost writes, hangs or slow downs and even complete site failure. Planned system outages are the result of needing to perform routine and periodic maintenance operations and new deployments. The key is to minimize the amount of downtime to reduce the impact on productivity, lost revenue, damaged customer relationships, bad publicity, and lawsuits.
A data warehouse built around a shared nothing architecture is vulnerable to the loss of a node or a disk since losing one or both of these items means that a specific portion of the data set is unavailable. As a result queries and/or processes have to be halted until the node/disk is repaired.
A shared everything architecture, such as Oracle’s, is the ideal solution for cloud computing since there is no single point of failure. If a disk or node fails, queries and/or processes are simply serviced from another disk containing a copy of the data from the failed disk or transparently moved to another node. This is achieved without interruptions in service, saving cost and ensuring business continuity.
Exadata has been designed to incorporate the same standards of high availability (HA) customers have come to expect for Oracle products. With Exadata, all database features and tools work just as they do with traditional non-Exadata storage. With the Exadata architecture, all single points of failure are eliminated. Familiar features such as mirroring, fault isolation, and protection against drive and cell failure have been incorporated into Exadata to ensure continual availability and protection of data. Other features to ensure high availability within the Exadata Storage Server are described below
Oracle's Hardware Assisted Resilient Data (HARD) Initiative is a comprehensive program designed to prevent data corruptions before they happen. Data corruptions are very rare, but when they happen, they can have a catastrophic effect on a database, and therefore a business. Exadata has enhanced HARD functionality embedded in it to provide even higher levels of protection and end-to-end data validation for your data.
6) Delivers Increased Utilization One of the key aims of cloud computing is to increase the utilization of existing hardware. What is actually required is a new approach to hardware that allows applications to be consolidated onto a single, scalable platform. This allows resources (disk, processing, memory) to be shared across all applications and allocated as required. If one particular application needs additional short-term resources for a special project, the infrastructure should be flexible enough to allow those resources to be made available without significantly impacting other applications.
The use of high volume hardware, grid architecture, highly available framework and open standards makes it possible to create a suitable platform for consolidation to support enterprise wide applications.
Now we have the second part of our cloud strategy in place: a hardware platform to support the data warehouse cloud:
The last stage is to review the key software requirements needed to support and develop a successful cloud based data warehouse strategy