The world runs on digital data. It’s the backbone of industries as diverse as media and shipping, and it’s what makes the internet the internet. Without it, we would have little of our modern technology or modern business methods. And without data storage, we wouldn’t have data.
As someone who works or aspires to work in information technology, you undoubtedly know the importance of data storage. What you may know less about are data warehouses.
A data warehouse is essentially a high-tech database capable of both data storage and data analysis. It’s the engine behind the rise of big data analytics, allowing organizations to not only store and access data but to compare data on a massive scale, uncovering previously unknown links that can drive innovations, improve efficiencies, and inform decision-making.
Data warehouses have become so important to the future of business that students in IT degree programs—such as MS in Data Science, MS in Information Technology and Master of Information Systems Management (MISM) programs—routinely study them in depth. For example, in Walden University’s MS in Information Technology program, student’s take a course entitled Introduction to Big Data Analytics, in which they read from an important article on big data published in Communications of the Association for Information Systems.
The article—“Tutorial: Big Data Analytics: Concepts, Technologies, and Applications” by Hugh Watson1—provides expert-level information on data warehouses, including these three observations every IT manager should be aware of:
Advanced analytics can be very computationally intensive and create performance problems for descriptive analytics such as reports and dashboards when competing for computing resources. Query managers (part of the RDBMS) can help by prioritizing the order in which queries are processed, but do not provide a complete solution. Another approach is to create an analytic sandbox where modelers can “play” with advanced analytics without affecting other users.
Sandboxes can be real or virtual. With a real sandbox, a separate platform such as an appliance is used. Data for the sandbox is sourced from the data warehouse and possibly augmented by other data that the modelers add. In a virtual sandbox, a partition of the data warehouse is loaded with the data the modelers need. The data warehouse and sandbox reside in the same database software but operate as separate systems. A virtual sandbox requires data warehousing software that supports its creation and use.
The cloud is now in the mainstream of computing. With the cloud, computing resources are virtualized and offered as a service over the internet. The potential benefits of the cloud include access to specialized resources, quick deployment, easily expanded capacity, the ability to discontinue a cloud service when it is no longer needed, cost savings, and good backup and recovery. These same benefits make the cloud attractive for big data and analytics.
Cloud-based services come in a variety of forms. Public clouds are offered by third-party providers; private clouds are implemented within a company’s firewall. Concern about data security is a primary reason that private clouds are sometimes preferred over public clouds. We will discuss public clouds—although the same approaches and technologies are used with private clouds.
Cloud services are available as software-as-a-service (SaaS), platform-as-a-service (PaaS), or infrastructure-as-a-service (IaaS), depending on what software is provided. Cloud services are all similar in that a company’s data is loaded to the cloud, stored, and analyzed, and the results are downloaded to users and applications.
With SaaS, the vendor provides the hardware, application software, operating system, and storage. The user uploads data and uses the application software to either develop an application (e.g., reports) or simply process the data using the software (e.g., credit scoring). Many BI and analytics vendors offer cloud services versions of their software, including Cognos, Business Objects, MicroStrategy, and SAS. SaaS is a particularly attractive option for firms that lack the financial or human resources to implement and maintain the software and applications in-house.
PaaS differs from SaaS in that the vendor does not provide the software for building or running specific applications; this is up to the company. Only the basic platform is provided. The benefits of this approach include not having to maintain the computing infrastructure for applications that are developed; access to a dependable, highly scalable infrastructure; greater agility in developing new applications; and possible cost savings. Examples of PaaS include Oracle Cloud Computing, Microsoft Windows Azure, and Google App Engine.
With IaaS, the vendor provides raw computing power and storage; neither operating system nor application software are included. Customers upload an image that includes the application and operating system. Because the customer provides the operating system, different ones can be used with different applications. IaaS vendors’ offerings include Amazon EC2 (part of the Amazon Web Services offerings), Rackspace, and Google Compute Engine.
Of all the platforms and approaches to storing and analyzing big data, none is receiving more attention than Hadoop/MapReduce. Its origins trace back to the early 2000s, when companies such as Google, Yahoo!, and Facebook needed the ability to store and analyze massive amounts of data from the internet. Because no commercial solutions were available, these and other companies had to develop their own.
Important to the development of Hadoop/MapReduce were Doug Cutting and Mike Cafarella, who were working on an open-source Web search engine project called Nutch when Google published papers on the Google File System (2003) and MapReduce (2004). Impressed with Google’s work, Cutting and Cafarella incorporated the concepts into Nutch. Wanting greater opportunities to further his work, Cutting went to work for Yahoo!, which had its own big data projects under way. With Yahoo!’s support, Cutting created Hadoop (named after Cutting’s son’s stuffed elephant) as an open-source Apache Software Foundation project.
Apache Hadoop is a software framework for processing large amounts of data across potentially massively parallel clusters of servers. To illustrate, Yahoo has over 42,000 servers in its Hadoop installation. Hadoop is open source and can be downloaded at www.apache.org. The key component of Hadoop is the Hadoop Distributed File System (HDFS), which manages the data spread across the various servers. It is because of HDFS that so many servers can be managed in parallel. HDFS is file-based and does not need a data model to store and process data. It can store data of any structure but is not an RDBMS. HDFS can manage the storage and access of any type of data (e.g., Web logs, XML files) as long as the data can be put in a file and copied into HDFS.
Understanding the value of data warehouses is just one of the many things you need to know to be a successful IT manager, whether you’re working as a manager in the data analytics, software development, systems analysis, systems administration, cyber security, or even the IT business consulting field. To gain the full breadth of knowledge you need, you’ll likely want an advanced information technology degree, such as an MS in Data Science, MS in Information Technology or a Master of Information Systems Management.
In the past, attending a master’s degree program often required taking time off from your current job or trudging to night school across town—but thanks to online education, you can earn your MS in Data Science, MS in Information Technology or MISM degree far more conveniently. In fact, when you earn a degree online from Walden University, you can complete your courses right from home. Plus, online learning lets you take classes at whatever time of day works best for your schedule, making it possible to keep working as a data analyst, business intelligence manager, computer programmer, software engineer, database administrator, systems analyst, systems administrator, or anything else while you attend school.
Hugh Watson’s “Tutorial: Big Data Analytics: Concepts, Technologies, and Applications” is just one of the many resources on many topics you can be exposed to when you earn an advanced IT degree. Choosing to earn your master’s can be the best possible choice for your career.
Walden University is an accredited institution offering MS in Information Technology, MS in Data Science, and Master of Information Systems Management degree programs online. Expand your career options and earn your degree in a convenient, flexible format that fits your busy life.
Walden University is accredited by The Higher Learning Commission, www.hlcommission.org.