Filter data with like
Filtering is made to select the people whose surname contains “Garc” and which age is under 30.
val df = sc.parallelize(Seq(
("Paco","Garcia",24,24000,"2018-08-06 00:00:00"),
("Juan","Garcia",26,27000,"2018-08-07 00:00:00"),
("Ana", "Martin",28,28000,"2018-08-14 00:00:00"),
("Lola","Martin",29,31000,"2018-08-18 00:00:00"),
("Sara","Garcia",35,34000,"2018-08-20 00:00:00")
)).toDF("name","surname","age","salary","reg_date")
val type_df = df.select($"name",$"surname",$"age",$"salary", unix_timestamp($"reg_date", "yyyy-MM-dd HH:mm:ss").cast(TimestampType).as("timestamp"))
type_df.show()
val filter_df = type_df.filter("surname like 'Garc%' AND age < 30")
filter_df.show()+------+--------+----+-------+-------------------+ |name |surname |age |salary | timestamp| +------+--------+----+-------+-------------------+ | Paco| Garcia| 24| 24000|2018-08-06 00:00:00| | Juan| Garcia| 26| 27000|2018-08-07 00:00:00| | Ana| Martin| 28| 28000|2018-08-14 00:00:00| | Lola| Martin| 29| 31000|2018-08-18 00:00:00| | Sara| Garcia| 35| 34000|2018-08-20 00:00:00| +------+--------+----+-------+-------------------+ +------+--------+----+-------+-------------------+ |name |surname |age |salary | timestamp| +------+--------+----+-------+-------------------+ | Paco| Garcia| 24| 24000|2018-08-06 00:00:00| | Juan| Garcia| 26| 27000|2018-08-07 00:00:00| +------+--------+----+-------+-------------------+
Filtering data by matching item
Filtering is made to select people who with the surname “Garcia”
df.filter("surname== 'Garcia'").show()+------+--------+----+-------+ | name| surname| age| salary| +------+--------+----+-------+ | Paco| Garcia| 24| 24000| | Juan| Garcia| 26| 27000| | Sara| Garcia| 35| 34000| +------+--------+----+-------+
Filtering data from the result of a pool
Filtering is done to select the surnames to be repeated more than twice
df.groupBy("surname").count().filter("count > 2").show()+--------+-----+ | surname|count| +--------+-----+ | Garcia| 3| +--------+-----+






0 Comments