Read and write in parquet format in Python

Generate data to use for reading and writing in parquet format Example of random data to use in the following sections data = [] for x in range(5): data.append((random.randint(0,9), random.randint(0,9))) df = spark.createDataFrame(data, (“label”, “data”)) df.show() +—–+—-+ |label|data| +—–+—-+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +—–+—-+…

Read More »

Read & write JSON in Python

Generate data to use to read & write JSON Example of random data to use in the following sections data = [] for x in range(5): data.append((random.randint(0,9), random.randint(0,9))) df = spark.createDataFrame(data, (“label”, “data”)) df.show() +—–+—-+ |label|data| +—–+—-+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +—–+—-+ Write data…

Read More »

Apache Hadoop YARN

Yarn definition Yarn (Yet Another Resource negotiator) is a data operating system and distributed Resource Manager, also known as Hadoop 2 as it is the evolution of Hadoop Map-Reduce. The most significant changes of Hadoop 2 over Hadoop 1 is that the thread technology is included, this technology provides an effective allocation of resources, for…

Read More »

Group dataframe elements in Scala

Example: Grouping data in a simple way Example where people table is grouped by last name. df.groupBy(“surname”).count().show() +——-+—–+ |surname|count| +——-+—–+ | Martin| 1| | Garcia| 3| +——-+—–+ Example: pooling data combined with filter Example where the people table is grouped by surname and the ones with more than 2 appearances are Selected. df.groupBy(“surname”).count().filter(“count > 2”).show()…

Read More »

Kerberos

Kerberos definition Kerberos is an authentication protocol that allows two computers to demonstrate their identity mutually in a secure way. Implemented on a client server architecture and works on the basis of tickets that serve to demonstrate the identity of the users. Authentication between two computers is carried out using a trusted third party called…

Read More »

Generate a Kerberos authentication keytab in a Hadoop cluster

Access the cluster by SSH ssh user_name@server_cluster_name Authentication in the Shell Kinit user_name@REINO.COM If authentication is successful, we will receive a ticket-granding ticket (TGT) from the KDC. This means that we have authenticated with the server, but we have not yet received permission to access any service. Browse ticket cache To verify that we have…

Read More »

Scala Filter DataFrame

Filter data with like Filtering is made to select the people whose surname contains “Garc” and which age is under 30. val df = sc.parallelize(Seq( (“Paco”,”Garcia”,24,24000,”2018-08-06 00:00:00″), (“Juan”,”Garcia”,26,27000,”2018-08-07 00:00:00″), (“Ana”, “Martin”,28,28000,”2018-08-14 00:00:00″), (“Lola”,”Martin”,29,31000,”2018-08-18 00:00:00″), (“Sara”,”Garcia”,35,34000,”2018-08-20 00:00:00″) )).toDF(“name”,”surname”,”age”,”salary”,”reg_date”) val type_df = df.select($”name”,$”surname”,$”age”,$”salary”, unix_timestamp($”reg_date”, “yyyy-MM-dd HH:mm:ss”).cast(TimestampType).as(“timestamp”)) type_df.show() val filter_df = type_df.filter(“surname like ‘Garc%’ AND age < 30”)…

Read More »

Apache Sqoop Examples

Prerequisites of Apache Sqoop Examples The prerequisites for these examples are the same as for the previous post of Sqoop. These examples create a database “myddbb” and a table with values entered “mytable” and another empty table “mytable2”. Example of loading data from MySQL to HDFS (compression: Snappy and Avro format) $ sqoop import \…

Read More »

Linear Regression in Scala

The following post shows the steps to recreate an example of linear regression in Scala. Set the data set Defines the set of data to apply to the model. import org.apache.spark.ml.linalg.Vectors val df = spark.createDataFrame(Seq( (0, 60), (0, 56), (0, 54), (0, 62), (0, 61), (0, 53), (0, 55), (0, 62), (0, 64), (1, 73),…

Read More »