Open Source Tools for Big Data Analysis
What Is “Big Data?”
We live in a golden age of data. We and many of the products we own (computers, smartphones, household appliances, cars and many more) generate a huge amount of data every day. In 2017, The Economist described data as “the world’s most valuable resource,” and since then data has been described as “the new oil.” Every company has an enormous amount of data on its products and their usage, customers, users, employees and more. Analyzing said data is extremely important in making data-driven decisions, solving problems, improving revenue, preventing customer churn, etc. Two questions arise: What is big data? How does it differ from “regular” data?
Sponsored Schools
SPONSORED
Oracle describes big data as data sets that “are so voluminous that traditional data processing software just can’t manage them.” Size isn’t the only unique feature, though. Big data is often characterized by the “Five V’s”:
- Volume: The amount or the quantity of the data that’s stored, generated or analyzed. The size of the data will determine if it can be considered “big data.”
- Velocity: The rate or the speed at which companies or organizations collect, generate or stream the data.
- Variety: The type and nature of the data. Data can be collected from many different sources: text data (reviews, emails, tweets, posts, etc.), image data, videos, audio and more — and all of them can have many forms.
- Veracity: The quality of the data. Is the data “clean” or “messy?” Is it missing a significant amount of information and/or variables?
- Value: The potential of the data. Can the data be used to solve a business problem and/or to be analyzed in a way that will lead to data-driven decisions?
If traditional data processing isn’t enough to manage big data, what are the other options? We will focus on some open source tools for big data analysis and analytics. Open source software is a category of software for which the original source code is made freely available and may be redistributed and modified according to the requirement of the user.
Frameworks
Hadoop
Apache Hadoop is an assortment of open source software for distributed and parallelized computing, specifically for the task of analyzing and processing large data sets. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. It allows multiple computers to distribute file storage (also known as “Clustered File System”) and process big data by utilizing the MapReduce algorithm.
Hadoop splits files into large blocks and distributes them across nodes in a cluster. The Hadoop ecosystem contains different subprojects (tools) that are used to help Hadoop modules and offer that functionality. In particular:
- HDFS is a distributed, scalable and portable file system written in Java for the Hadoop framework.
- YARN manages computing resources in clusters and uses them to schedule users’ applications. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons (programs that run in the background). The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).
- Hive is a platform used to develop SQL TypeScripts to do MapReduce operations. Hive enables analysis of large data sets using a language very similar to standard ANSI SQL. This means anyone who can write SQL queries can access data stored on the Hadoop cluster. Hive offers a simple interface for log processing, text mining, document indexing, customer-facing business intelligence (e.g., Google Analytics), predictive modeling and hypothesis testing.
Learn more about how to use Hadoop for data science.
Spark
No discussion of open source big data analysis tools would be complete without Apache Spark. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. It can run on Hadoop as a replacement for MapReduce. Spark was developed in response to limitations in the MapReduce cluster computing paradigm. In MapReduce, data is read from the disk, and then a function is mapped across the data. Then, the map results are reduced and stored back to HDFS. Spark relaxes the constraints of MapReduce by doing the following:
- Generalizes computation from MapReduce-only graphs to arbitrary Directed Acyclic Graphs (DAGs)
- Removes a lot of boilerplate code present in Hadoop
- Allows tweaks to otherwise inaccessible components in Hadoop, such as the sort algorithm
- Loads data into cluster memory, rather than reading from the disk, speeding up I/O
This may make Spark preferable for certain use cases, especially ingesting streaming data, doing real-time interactive analytics and running machine learning algorithms.
Apache Spark is an analytics engine for large scale data and can be run using different languages like Python, R, Java and Scala, and it also supports different tools for SQL. Spark provides functionality for data processing and analysis, machine learning, graph processing and structured processing.
Programming Languages
Python
Python is an interpreted, general purpose and open-source programming language created by Guido van Rossum and released in 1991. Thanks to different libraries like NumPy for scientific computing, pandas for data analysis, Matplotlib for data visualization and more, Python has become the programming language of choice for many when it comes to data analysis, data science and machine learning. Python (or Python libraries) can be used to utilize big data frameworks. For example:
- PySpark is the Python API for Spark (remember, Spark is written in Scala. See below for more information on Scala). It can be used for exploratory data analysis (EDA) on a large scale, building machine learning pipelines and models, interacting with Resilient Distributed Datasets (RDDs) and more.
- Mrjob is an open source MapReduce Python package/wrapper that was developed by Yelp. This package helps writing and running Hadoop streaming jobs.
R
R is a free software programming language for statistical computing, data analysis and data visualizations. It also has its own software environment and a built-in command line interface, although many users tend to use RStudio. R is very common among statisticians (it is described as being developed by statisticians for statisticians) and is widely used in academic settings. Just like any other programming language, R’s capabilities can be expanded by installing different libraries, thus allowing big data analysis capabilities. Some of these packages are:
- pbdR (Programming with Big Data in R) is a series of R packages for data analysis and statistical programming for big data. pdbR uses distributed systems, meaning data are split across different systems/computers, and allows for processing very large data sets.
- MapReduce library in R: As the name implies, this package allows users to utilize the MapReduce algorithm in R.
- The “SparkR” can be used as a front end for Apache Spark in R.
Scala
Scala is another open source, general purpose programming language. Its name is a portmanteau of “Scalable” and “Language.” Scala is both an object-oriented and functional programming language. The Apache Spark framework is written in Scala, which made Scala gain popularity as a language for data analysis for big data. In addition, Scala is the language that a number of other big data and data analysis frameworks are written in, including Apache Kafka (a distributed streaming platform for handling real-time data feeds), Apache Samza (a stream processing framework), Apache scalding (a Scala API for Cascading, an abstraction of MapReduce) and Apache Flink (a framework for distributed stream and batch data processing), to name a few.
Julia
Julia is a general-purpose programming language, but it is very well suited for data analysis and computational science. Julia is provided under the MIT license and is free for everyone to use. All source code is publicly viewable on GitHub. Julia is a general-purpose language, but its developers say that it is aimed specifically at scientific computing, machine learning, data mining, large-scale linear algebra, and distributed and parallel computing. An important feature here for big data analysis is Julia’s excellent support of distributed and parallel computing, allowing for faster and more efficient data analysis and machine learning modeling with big data sets. Julia has support for “traditional” data frames but also supports working with much larger data sets (think terabytes of data) using the JuliaDB package. In addition, Julia can work with almost all databases using JDBC.jl and ODBC.jl drivers. It also integrates with the Hadoop ecosystem using Spark.jl, HDFS.jl and Hive.jl.
Database Languages
SQL
SQL stands for Structured Query Language and is probably the most common language for managing and querying data from databases; check out Stack Overflow’s Developer Survey, one of the biggest amateur and professional developers surveys in the world. SQL is the third most popular programming, scripting and markup language for both non-professional and professional developers. In addition, the top four most commonly used databases are a “flavor” of SQL: MySQL, PostgreSQL, Microsoft SQL Server and SQLite. In SQL databases, data is structured — organized in tables of rows and columns. SQL-style databases use tabular data structures with relations connecting tables. There are many different versions of SQL style databases: Oracle, Microsoft, PostgreSQL, MySQL, etc. Some of them are free/open source (Postgres, MySQL) while others are proprietary (Oracle, Microsoft SQL). There are some differences, but the SQL language used for each type of server is rather similar.
Presto
Presto is an open source distributed SQL query engine for big data for running queries on large-scale databases with gigabytes to petabytes of data. Presto can interact with multiple data sources, including Hive, Cassandra, relational databases or even proprietary data stores. A neat feature is that users are able to combine data from multiple sources using a single Presto query.
NoSQL
The opposite of SQL, NoSQL databases store data (for the most part) in a non-structured manner with unique ways of interacting with it. NoSQL databases are increasingly used for big data. Some examples of NoSQL-style databases are key-value, column-based and graph-based databases. There are also document-based databases, which store semi-structured data. A popular document-based database is MongoDB, which stores information as a JSON file. There are two editions of MongoDB, with one of them freely available as part of the open source community.
Visualization
Apache Superset
Apache Superset is a business intelligence web application that can be used for data exploration and data visualization. Superset was originally developed by Airbnb, and they later made it available to the open source community. It can work well with various web servers (Gunicorn, Nginx, Apache), database engines (MySQL, PostgresSQL, MariaDB), results backend (AWS S3, Redis, Memcached), caching layer (Memcached, Redis, etc.) and services like NewRelic, StatsD and DataDog. Superset also has the ability to run analytics against most popular database technologies. It is written in Python, so Python is needed to work with it.
Apache Zeppelin
Apache Zeppelin is an open source, web-based notebook (similar to Jupyter Notebook) for interactive data analysis and exploration visualization of large-scale data. Zeppelin integrates with other frameworks, like Apache Spark and Apache Flink. Apache Zeppelin allows users to create interactive visualizations with Python, Scala, SQL or R.
Last updated: August 2020