Read and write in parquet format in Python

Generate data to use for reading and writing in parquet format Example of random data to use in the following sections data = [] for x in range(5): data.append((random.randint(0,9), random.randint(0,9))) df = spark.createDataFrame(data, (“label”, “data”)) df.show() +—–+—-+ |label|data| +—–+—-+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +—–+—-+…

Read More »

Read & write JSON in Python

Generate data to use to read & write JSON Example of random data to use in the following sections data = [] for x in range(5): data.append((random.randint(0,9), random.randint(0,9))) df = spark.createDataFrame(data, (“label”, “data”)) df.show() +—–+—-+ |label|data| +—–+—-+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +—–+—-+ Write data…

Read More »

Dates in Python

Create date from a String import pandas as pd startdate = “10/10/2018” my_date = pd.to_datetime(startdate) print(my_date.strftime(“%Y-%m-%d”)) 2018-10-10 Create current date import datetime my_date = datetime.datetime.now() print(my_date.strftime(“%Y-%m-%d”)) 2018-10-10 Increase days enddate = my_date + pd.DateOffset(days=5) print(enddate.strftime(“%Y-%m-%d”)) 2018-10-15 Decrease days enddate = my_date – pd.DateOffset(days=5) print(enddate.strftime(“%Y-%m-%d”)) 2018-10-05 Pass date to Numeric: Unit timestamp print(“Unix Timestamp: “, (time.mktime(my_date.timetuple())))…

Read More »

Read CSV in Databricks in Spark

Load CSV in Databricks Databricks Community Edition provides a graphical interface for file loading. This interface is accessed in the DataBase > Create New Table. Once inside, the fields must be indicated: Upload to DBF: name of the file to Load. Select a cluster to preview the Table: the cluster on which to perform the…

Read More »

Use of pipelines in Apache Spark in Python

Example of pipeline concatenation In this example, you can show an example of how elements are included in a pipe in such a way that finally all converge in the same point, which we call “features” from pyspark.ml import Pipeline from pyspark.ml.feature import VectorAssembler # Define the Spark DF to use df = spark.createDataFrame([ (‘line_1’,…

Read More »

Apache Spark libraries and installation in Python

Prerequisites Java 6 or higher Python Interpreter 2.6 or higher Installation Install is very simple just download the latest version of Spark and unzip wget http://apache.rediris.es/spark/spark-1.5-0/spark-1.5.0-bin-hadoop2.6.tgz tar -xf spark-1.5.0-bin-hadoop2.6.tgz Interpreter execution To run it can be done through the Pyspark interpreter or by loading a file.py ./spark-1.5.0-bin-hadoop2.6/bin/pyspark from pyspark import SparkConf, SparkContext sc = SparkContext()…

Read More »