Scala DataFrames

Create DataFramesScala_logo

Example of how to create a dataframe in Scala.

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val data = List(
  Row("Peter","Garcia",24,24000),
  Row("Juan","Garcia",26,27000),
  Row("Lola","Martin",29,31000),
  Row("Sara","Garcia",35,34000)
)
val rdd = sc.parallelize(data)
val schema = StructType(
  List(
    StructField("name", StringType, nullable=false),
    StructField("surname", StringType, nullable=false),
    StructField("age", IntegerType),
    StructField("salary", IntegerType)
  )
)

val df = spark.createDataFrame(rdd,schema)

df.printSchema()

df.show()
root
 |-- name: string (nullable = false)
 |-- surname: string (nullable = false)
 |-- age: integer (nullable = true)
 |-- salary: integer (nullable = true)
 +------+--------+----+-------+
 | name | surname| age| salary|
 +------+--------+----+-------+
 | Peter| Garcia | 24 | 24000 |
 | Juan | Garcia | 26 | 27000 |
 | Lola | Martin | 29 | 31000 |
 | Sara | Garcia | 35 | 34000 |
 +------+--------+----+-------+

 

Creating Dataframe with Random data

import scala.util.Random

val df = sc.parallelize(
  Seq.fill(5){(Math.abs(Random.nextLong % 100000L),Math.abs(Random.nextLong % 100L))}
).toDF("salary" , "age")

df.show()
+------+----+ 
|salary| age|
+------+----+ 
| 41772| 17 | 
| 74772| 66 | 
| 6326 | 60 |
| 72581| 70 |
| 53037|  0 |
+-------+---+

 

Transforming RDD to Dataframe

val name_cols=Array("id", "name", "values")
val df=sc.parallelize(Seq(
(1,"Mario", Seq(0,2,5)),
(2,"Sonia", Seq(1,20,5)))).toDF(name_cols: _*)
df.show()
 +---+------+----------+
 | id| name |  values  |
 +---+------+----------+
 | 1| Mario | [0, 2, 5]|
 | 2| Sonia |[1, 20, 5]|
 +---+------+----------+

 

Transforming Dataset to Dataframe

import org.apache.spark.sql.functions._

val wordsDataset = sc.parallelize(
                     Seq("Hello world hello world", 
                         "no hello no world no more", 
                         "count words"))
                    .toDS()

val result = wordsDataset
              .flatMap(_.split(" "))               // Split sentences into words
              .filter(_ != "")                     // Filter entry words
              .map(_.toLowerCase())
              .toDF()                              // Convert to DF to add and order
              .groupBy($"value")                   // Count occurrences of words
              .agg(count("*") as "ocurrences")
              .orderBy($"ocurrences" desc)        // Show occurrences per word
result.show()
 +---------+------------+
 | value   | ocurrences |
 +---------+------------+
 | count   |  1         |
 | words   |  1         |
 | more    |  1         |
 | no      |  3         |
 | hello   |  3         |
 | world   |  3         |
 +---------+------------+

 

Transforming lists to Dataframe

val A = List("Paco","Sara","Tom","Rosa")
val B = List(1,2,3,4)
val C = List(5,6,7,8)

val zip = A.zip(B).zip(C)
val tup = zip.map{case ((w,x),y)=>(w,x,y)}
val df = tup.toDF("A","B","C")
df.show()
+----+---+---+
|   A|  B|  C|
+----+---+---+
|Paco|  1|  5|
|Sara|  2|  6|
|Tom |  3|  7|
|Rosa|  4|  8|
+----+---+---+