Generate data to use to read & write JSON
Example of random data to use in the following sections
data = []
for x in range(5):
data.append((random.randint(0,9), random.randint(0,9)))
df = spark.createDataFrame(data, ("label", "data"))
df.show()+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write data in JSON format
path_json = "/prueba.json" # Leer desde HDFS
path_json = "D:/prueba.json" # Leer desde fichero local
df.write \
.mode("overwrite") \
.format("json") \
.save(path_json)Read data in JSON format
df2 = spark\
.read\
.option("multiline", "true") \
.json(path_json)
df2.show()+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write gzip compressed data in JSON format
path_json_gzip = "/prueba_gzip.json" # Leer desde HDFS
path_json_gzip = "D:/prueba_gzip.json" # Leer desde fichero local
df.write\
.mode("overwrite")\
.format("json")\
.option("compression", "gzip")\
.save(path_json_gzip)Read gzip compressed data in JSON format
df2 = spark\
.read\
.option("multiline", "true") \
.json(path_json_gzip)
df2.show()+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write deflate compressed data in JSON format
path_json_deflate = "/prueba_deflate.json" # Leer desde HDFS
path_json_deflate = "D:/prueba_deflate.json" # Leer desde fichero local
df.write\
.mode("overwrite")\
.format("json")\
.option("compression", "deflate")\
.save(path_json_deflate)Read deflate compressed data in JSON format
df2 = spark\
.read\
.option("multiline", "true") \
.json(path_json_deflate)
df2.show()+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+
Write bzip2 compressed data in JSON format
path_json_bzip2 = "/prueba_bzip2.json" # Leer desde HDFS
path_json_bzip2 = "D:/prueba_bzip2.json" # Leer desde fichero local
df.write\
.mode("overwrite")\
.format("json")\
.option("compression", "bzip2")\
.save(path_json_bzip2)Read bzip2 compressed data in JSON format
df2 = spark\
.read\
.option("multiline", "true") \
.json(path_json_bzip2)
df2.show()+-----+----+ |label|data| +-----+----+ | 4| 0| | 7| 0| | 1| 1| | 3| 8| | 3| 5| +-----+----+




0 Comments