Three options to work with Spark without paying a penny!

File:Apache Spark logo.svg - Wikimedia Commons

Spark as an open-source distributed cluster computing framework is widely used to handle big data. In this article, we will show you how to use Spark based on free services. This is useful for practice and educational purposes. Then you will be prepared to pay for more powerful hardware for your bigger projects.

 

Big data as it comes from the name, is conjugated with volume, velocity and variety (3V model) and hence require new ways to deal with compared to standard data analysis and processing. In practice, and a bit simplified, you have a case of big data if your dataset is larger than the amount of memory you have on your local PC — working on that dataset will require a different approach than your standard (small data) analysis routines. In fact, instead of running the codes on a single local machine capable of handling data smaller than computer memory, big data is analyzed on a series of machines (cluster) providing much more memory. Apache Spark is an open-source cluster-computing framework written in Scala which allows you to process big data with an interface for programming the clusters on a cloud-based system like Google cloud, Microsoft Azure or Amazon AWS, with a support of Scala, R, Python, Java and SQL queries.

 

AWS opens a new region in Hong Kong – PPC Land Google Cloud Platform (GCP) Training Courses | SpringPeople Microsoft Azure Logo Svg, HD Png Download - kindpng
 
  

Which operating system is better to use? You can pick up any choice from Windows, Mac OS and Linux, but since almost all the cluster services are Linux-based, it is recommended to use a Linux distribution for such purpose. As a suggestion, you can install an Ubuntu-based distribution like Ubuntu, Kubuntu, Lubuntu or Zorin OS if you want a fancy desktop environment.

Spark has APIs to Scala, Python, R and Java and supports libraries to be handled by these different programming languages. Since Scala is Spark’s native language, it is faster than the other, however, it can be harder to master due to more arcane syntax. The first suggestion for Data Science tasks is, therefore (as usual) Python, because of simplicity and a wide variety of libraries to work with data. But if you look for earlier releases and features of Spark, you can use Java.

Spark on Amazon Elastic Compute Cloud (EC2)

Amazon EC2 is a web service for creating resizable cloud-based computing capacity (similar process to virtual machines) and is the choice for whoever wants to work with AWS cloud services. Amazon provides a paid service called AWS EMR which provides a scalable and flexible environment to use clusters based on the computing power and cost required. EC2 is not a free service, but you can get access to Amazon Free Tier plan to have 1 year of free limited services. Here is what you need to do:

1. Create an AWS Free Tier account via https://aws.amazon.com/free. Note that you need to fill in billing information, but don’t worry it is free — but until a point. Please ensure you understand the limitations of the free service as this is not a free forever service!

2. Create an EC2 instance on AWS. See https://docs.aws.amazon.com/efs/latest/ug/gs-step-one-create-ec2-resources.html for more details.

3. Connect to the created EC2 instance through a secure shell (SSH) connection. This will be done through a private key file (.pem) file and the DNS address of the EC2 instance, which both can be acquired through your AWS account plus a few command lines on the terminal.

4. Finally set up Spark and Jupyter on the created EC2 instance, installing required packages through the connected terminal.

Spark on Databricks

Databricks has been developed by the creator of Spark, providing clusters that can run on AWS or Azure (“Azure Databricks”) using a built-in notebook system (very similar to Jupyter notebook) and lets the user have access to both cloud storage or a local machine. Databricks has a very straightforward online installation via https://databricks.com/try-databricks. Note again that it is not a totally free service, however, they provide a free community version which supports a 6GB cluster. Databricks is the choice for whoever wants a community-based, fast installation, online work with Spark. See https://docs.databricks.com/getting-started/quick-start.html for more details.

Spark on a local machine

It is also an option to run Spark on a local machine. All you need to do is to download the Spark version of your choice from https://spark.apache.org/downloads.html and install the required packages including a Java Runtime Environment (JRE) and Scala. See https://medium.com/beeranddiapers/installing-apache-spark-on-ubuntu-8796bfdd0861 for more details and step by the step installation process.

Eventually, it should be noted that free services are developed for educational purposes, and hands-on practice with normal data, they don’t provide much power and storage to handle big data and you need to buy a paid service to run your final code on the big data.