Apache Spark

Spark definition Apache Spark is a distributed computing system of free software, which allows to process large sets of data on a set of machines simultaneously, providing horizontal scalability and fault tolerance. To meet these features provides a program development model that allows you to run code in a distributed way so that each machine…

Read More »

Apache Spark Components

Components Spark Core Spark core is the core where all the architecture is supported, provides: Distributing tasks Programming Input/output operations Using Java, Python, Scala and R programming interfaces focused on RDDs’s abstraction. It establishes a functional model that allows to perform invoke operations in parallel like map, filter or reduces on a RDD, for it…

Read More »

Install Hortonworks in Virtual Box for Spark

Download Hortonworks Data Platform (HDP) Sandbox Virtualbox Installation First install virtual box and once installed go to the virtual machine of Hortonworks and run it, this will appear an installation of this machine in virtual box. Configure the features of the machine, comment that minimally needs 8GB of RAM.   Hortonworks Configuration Once the machine…

Read More »

Read CSV in Databricks in Spark

Load CSV in Databricks Databricks Community Edition provides a graphical interface for file loading. This interface is accessed in the DataBase > Create New Table. Once inside, the fields must be indicated: Upload to DBF: name of the file to Load. Select a cluster to preview the Table: the cluster on which to perform the…

Read More »

Use of pipelines in Apache Spark in Python

Example of pipeline concatenation In this example, you can show an example of how elements are included in a pipe in such a way that finally all converge in the same point, which we call “features” from pyspark.ml import Pipeline from pyspark.ml.feature import VectorAssembler # Define the Spark DF to use df = spark.createDataFrame([ (‘line_1’,…

Read More »

Apache Spark libraries and installation in Python

Prerequisites Java 6 or higher Python Interpreter 2.6 or higher Installation Install is very simple just download the latest version of Spark and unzip wget http://apache.rediris.es/spark/spark-1.5-0/spark-1.5.0-bin-hadoop2.6.tgz tar -xf spark-1.5.0-bin-hadoop2.6.tgz Interpreter execution To run it can be done through the Pyspark interpreter or by loading a file.py ./spark-1.5.0-bin-hadoop2.6/bin/pyspark from pyspark import SparkConf, SparkContext sc = SparkContext()…

Read More »