Using Hadoop for Data Science

What Is Hadoop?

Hadoop is an open-source software framework that provides for processing of large data sets across clusters of computers using simple programming models. Hadoop is designed to scale up from single servers to thousands of machines.

Hadoop grew out of an open-source search engine called Nutch, developed by Doug Cutting and Mike Cafarella. Back in the early days of the Internet, the pair wanted to invent a way to return web search results faster by distributing data and calculations across different computers so multiple tasks could be executed at the same time.

The distributed computing and processing portion of Nutch was eventually split off and named Hadoop (after Cutting’s son’s toy elephant). Yahoo released Hadoop as an open-source project in 2008, and today the Hadoop ecosystem is managed and maintained by the non-profit Apache Software Foundation (ASF), an international community of software developers and contributors.

Four core modules are included in the ASF’s basic framework:

Hadoop Common consists of the common utilities that support the other Hadoop modules.
Hadoop Distributed File System is a distributed file system that provides high-throughput access to application data.
Hadoop YARN is a framework for job scheduling and cluster resource management.
Hadoop MapReduce is a YARN-based system for parallel processing of large data sets.

Other elements of the Hadoop ecosystem of technologies include the following:

Pig is a high-level data-flow language and execution framework for parallel computation. It allows users to perform data extractions and transformations and basic analysis without having to write MapReduce programs.
Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying. It was initially developed by Facebook.
HBase is a scalable, distributed database that supports structured data storage for large tables.
Ambari is a web interface for provisioning, managing, and monitoring Hadoop services and components.
Cassandra is a scalable multi-master database system.
Oozie is a Hadoop job scheduler.
Sqoop is a connection and transfer mechanism that moves data between Hadoop and relational databases.
Spark is an open-source cluster computing framework with in-memory analytics.
Zookeeper is a high-performance coordination service for distributed applications.

Hadoop can be downloaded for free, however commercial distributions such as Cloudera, Hortonworks, and MapR are also available. For a fee, you get the software vendor’s version of the framework along with additional software components, tools, training, and documentation.

Why Use Hadoop?

Hadoop has a lot to offer. SAS Institute identifies the following five benefits:

Computing power: Hadoop’s distributed computing model allows it to process huge amounts of data. The more nodes you use, the more processing power you have.
Flexibility: Hadoop stores data without requiring any preprocessing. Store data—even unstructured data such as text, images, and video—now; decide what to do with it later.
Fault tolerance: Hadoop automatically stores multiple copies of all data, and if one node fails during data processing, jobs are redirected to other nodes and distributed computing continues.
Low cost: The open-source framework is free, and data is stored on commodity hardware.
Scalability: You can easily grow your Hadoop system, simply by adding more nodes.

Although the development of Hadoop was motivated by the need to search millions of webpages and return relevant results, it today serves a variety of purposes. Hadoop’s low-cost storage makes it an appealing option for storing information that is not currently critical but that might be analyzed later. Hadoop storage is unencumbered by the schema-related constraints commonly found in SQL-based systems. Organizations are using Hadoop to stage large amounts of raw, sometimes unstructured data for loading into enterprise data warehouses. Many of Hadoop’s largest adopters use it for the real-time data analysis that enables web-based recommendation systems.

Who Uses Hadoop?

Apache Software Foundation maintains a list of companies using Hadoop, and usage goes beyond powering search engines or analyzing customer behavior to better target ads. Here’s how some big names are using Hadoop:

eBay uses Hadoop for search optimization.
University of Maryland uses it as part of the Google/IBM academic cloud computing initiative.
At Facebook, Hadoop is used to store copies of internal log and dimension data sources and as a source for not only reporting and analytics but also machine learning.
Hadoop powers LinkedIn‘s People You May Know feature.
Hadoop enables Opower to suggest ways for consumers to save money on energy bills.
To determine user preferences, Orbitz uses Hadoop to analyze every aspect of visitors’ sessions on its sites.
Spotify uses Hadoop for content generation and for data aggregation, reporting, and analysis.
Twitter uses Hadoop to store and process tweets and log files.
Yahoo! has more than 40,000 computers running Hadoop to support research for Ad Systems and Web Search.

Interested in a different career? Check out our other bootcamp guides below:

University of London

info

Online BSc Data Science and Business Analytics

The online BSc Data Science and Business Analytics from the University of London, with academic direction from LSE, enables students to build essential technical and critical thinking skills and prepare for careers in data science, analytics and other growing fields – while they work, without relocating.

info SPONSORED

Learn Hadoop

You’ve got lots of choices for online Hadoop training. Here are some options to consider:

MapR Technologies, the provider of a leading Hadoop distribution, offers free full-length, on-demand courses on a range of Hadoop technologies. Developers, data analysts, and administrators alike can learn Hadoop through interactive labs and quizzes.
MapR’s competitor Cloudera also offers online training. Its free video training sessions are taught by industry-leading Hadoop experts.
Hadoop 101 is but one of the Hadoop courses on offer from Big Data University. It will teach you the basics, after which you can dig deeper into such Hadoop technologies as Hive, HBase, Pig, Oozie, and Zookeeper.
Udemy offers more than 30 courses on Hadoop, with titles such as Become a Certified Hadoop Developer and Hadoop Made Very Easy. The beginner level courses Big Data and Hadoop Essentials, Basic overview of Big Data Hadoop, and Hadoop Starter Kit are free.

Last updated: June 2020