Connect with Scala to the HDFS of Hadoop

HDFsWrite data to HDFS

Example of how to write RDD data in a HDFS of Hadoop.

Delete the file if it exists
Import Scala. sys. process. _
"HDFs DFS-rm-R/pruebas"!

Record a RDD in HDFS
Val Rdd = sc. parallelize (List (
    (0, 60),
    (0, 56),
    (0, 54),
    (0, 62),
    (0, 61),
    (0, 53),
    (0, 55),
    (0, 62),
    (0, 64), 
    (1, 73),
    (1, 78),
    (1, 67),
    (1, 68),
    (1, 78)
))
Rdd. SaveAsTextFile ("HDFs:///pruebas/prueba1.csv")
Rdd. Collect

Write Data in HDFS (2nd form)

An example of how to write plain text data to a Hadoop HDFS.

import org. apache. Hadoop. conf. Configuration;
import org. apache. Hadoop. fs. FileSystem;
import org. apache. Hadoop. fs. Path;
Import java. io. PrintWriter;

Object APP {

println ("Writing test in HDFS...")
Val conf = new Configuration ()
Val fs = FileSystem. Get (CONF)
Val output = fs. Create (New Path ("HDFs://sandbox-hdp.hortonworks.com: 8020/Tests/test2. txt"))
Val writer = new PrintWriter (output)
Try
    Writer. write ("Hello World") 
    Writer. write ("n")
}
Finally
    Writer. Close ()
}
Print ("Finished!")
}

Add data to HDFS

Example of adding data of type dataframe to a HDFS

Val df = Seq ((1, 2), (3, 4), (5.6), (0.0)). ToDF ("Col_0", "Col_1")
DF. Show ()

DF. Write. Mode ("Overwrite"). Format ("Parquet"). Save ("HDFs:///incrementar_datos.parquet")

DF. Write. Mode ("append"). Format ("Parquet"). Save ("HDFs:///incrementar_datos.parquet")

Val df2 = Spark
  . read
  . Format ("parquet")
  . Option ("InferSchema", True)
  . Load ("HDFs:///incrementar_datos.parquet")

Df2. Show ()
+-----+-----+ DF
| Col_0 | Col_1 |
+-----+-----+
|    1 |    2 |
|    3 |    4 |
|    5 |    6 |
|    0 |    0 |
+-----+-----+

+-----+-----+ DF2
| Col_0 | Col_1 |
+-----+-----+
|    1 |    2 |
|    3 |    4 |
|    1 |    2 |
|    3 |    4 |
|    5 |    6 |
|    0 |    0 |
|    5 |    6 |
|    0 |    0 |
+-----+-----+

 

Read RDDs from HDFS

Simple example of how to read data from a HDFS.

Val rdd2 = sc. TextFile ("HDFs:///pruebas/prueba1.csv")
Rdd2. Collect ()

Note: SC refers to SparkContext, in many big data development environment is already instantiated but we should instantiate the object.

 

Read Dataframes from HDFS

import org. apache. Spark. sql. SparkSession

import org. apache. Spark. sql. DataFrame

Val df: DataFrame = Spark
  . read
  . Format ("CSV")
  . Option ("header", false)
  . Option ("InferSchema", True)
  . Load ("HDFs:///pruebas/prueba1.csv")

DF. Show ()

Note: Spark refers to SparkSession, in many big data development environment is already instantiated but we should instantiate the object.