Group dataframe elements in Scala

Example: Grouping data in a simple way Example where people table is grouped by last name. df.groupBy(“surname”).count().show() +——-+—–+ |surname|count| +——-+—–+ | Martin| 1| | Garcia| 3| +——-+—–+ Example: pooling data combined with filter Example where the people table is grouped by surname and the ones with more than 2 appearances are Selected. df.groupBy(“surname”).count().filter(“count > 2”).show()…

Read More »

Scala Filter DataFrame

Filter data with like Filtering is made to select the people whose surname contains “Garc” and which age is under 30. val df = sc.parallelize(Seq( (“Paco”,”Garcia”,24,24000,”2018-08-06 00:00:00″), (“Juan”,”Garcia”,26,27000,”2018-08-07 00:00:00″), (“Ana”, “Martin”,28,28000,”2018-08-14 00:00:00″), (“Lola”,”Martin”,29,31000,”2018-08-18 00:00:00″), (“Sara”,”Garcia”,35,34000,”2018-08-20 00:00:00″) )).toDF(“name”,”surname”,”age”,”salary”,”reg_date”) val type_df = df.select($”name”,$”surname”,$”age”,$”salary”, unix_timestamp($”reg_date”, “yyyy-MM-dd HH:mm:ss”).cast(TimestampType).as(“timestamp”)) type_df.show() val filter_df = type_df.filter(“surname like ‘Garc%’ AND age < 30”)…

Read More »

Linear Regression in Scala

The following post shows the steps to recreate an example of linear regression in Scala. Set the data set Defines the set of data to apply to the model. import org.apache.spark.ml.linalg.Vectors val df = spark.createDataFrame(Seq( (0, 60), (0, 56), (0, 54), (0, 62), (0, 61), (0, 53), (0, 55), (0, 62), (0, 64), (1, 73),…

Read More »

Scala DataFrames

Create DataFrames Example of how to create a dataframe in Scala. import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}; val data = List( Row(“Peter”,”Garcia”,24,24000), Row(“Juan”,”Garcia”,26,27000), Row(“Lola”,”Martin”,29,31000), Row(“Sara”,”Garcia”,35,34000) ) val rdd = sc.parallelize(data) val schema = StructType( List( StructField(“name”, StringType, nullable=false), StructField(“surname”, StringType, nullable=false), StructField(“age”, IntegerType), StructField(“salary”, IntegerType) ) ) val df = spark.createDataFrame(rdd,schema) df.printSchema() df.show() root |– name:…

Read More »

Scala Lists

Create lists Examples that define the lists to be used in the rest of the sections of the post val list1 = 1::2::3::4::5::Nil val list2 = List(1,2,3,4,5) val list3 = List.range(1,6) val list4 = List.range(1,6,2) val list5 = List.fill(5)(1) val list6 = List(“Peter”,”Tommy”,”Sonia”,”Mary”) val list7 = List.tabulate(5)(n => n * n) list1: List[Int] = List(1,…

Read More »

HDFS – compress & decompress in Scala

Displays a number of examples of file compression and decompression in different formats of both rendering and Compression. Compress Json Files val rdd = sc.parallelize( Array(1, 2, 3, 4, 5) ) // Define RDD val df = rdd.toDF() // df transform df.write.mode(“overwrite”).format(“json”).save(“hdfs:///formats/file_no_compression_json”) df.write.mode(“overwrite”).format(“json”).option(“compression”, “gzip”).save(“hdfs:///formats/file_with_gzip_json”) df.write.mode(“overwrite”).format(“json”).option(“compression”, “snappy”).save(“hdfs:///formats/file_with_snappy_json”) Compress parquet Files val rdd = sc.parallelize( Array(1, 2,…

Read More »

Spark Streaming (Batch & Streaming processing )

Spark Streaming definition Apache Spark Streaming is an extension of the Spark core API, which responds to real-time data processing in a scalable, high-performance, fault-tolerant manner. Spark Sreaming live was developed by the University of California at Berkeley, currently Databrinks which is responsible for supporting and making improvements. As a general idea it could be…

Read More »

RDD definition

RDD definition RDD Resilient distributed datasets represents an immutable and partitioned collection of elements that can be operated in parallel. A RDD can be created or paralelizando a collection of data (list, dictionary,..) or loading it of an external storage system, such as a file sharing system, HDFS, HBase, or any data source that offers…

Read More »