Read & write JSON in Python

Generate data to use to read & write JSONPython logo

Example of random data to use in the following sections

data = []
for x in range(5):
    data.append((random.randint(0,9), random.randint(0,9)))
df = spark.createDataFrame(data, ("label", "data"))

df.show()
+-----+----+
|label|data|
+-----+----+
|    4|   0|
|    7|   0|
|    1|   1|
|    3|   8|
|    3|   5|
+-----+----+

Write data in JSON format

path_json = "/prueba.json" # Leer desde HDFS
path_json = "D:/prueba.json" # Leer desde fichero local

df.write \
    .mode("overwrite") \
    .format("json") \
    .save(path_json)

Read data in JSON format

df2 = spark\
    .read\
    .option("multiline", "true") \
    .json(path_json)

df2.show()
+-----+----+
|label|data|
+-----+----+
|    4|   0|
|    7|   0|
|    1|   1|
|    3|   8|
|    3|   5|
+-----+----+

Write gzip compressed data in JSON format

path_json_gzip = "/prueba_gzip.json" # Leer desde HDFS
path_json_gzip = "D:/prueba_gzip.json" # Leer desde fichero local

df.write\
    .mode("overwrite")\
    .format("json")\
    .option("compression", "gzip")\
    .save(path_json_gzip)

Read gzip compressed data in JSON format

df2 = spark\
    .read\
    .option("multiline", "true") \
    .json(path_json_gzip)

df2.show()
+-----+----+
|label|data|
+-----+----+
|    4|   0|
|    7|   0|
|    1|   1|
|    3|   8|
|    3|   5|
+-----+----+

Write deflate compressed data in JSON format

path_json_deflate = "/prueba_deflate.json" # Leer desde HDFS
path_json_deflate = "D:/prueba_deflate.json" # Leer desde fichero local

df.write\
    .mode("overwrite")\
    .format("json")\
    .option("compression", "deflate")\
    .save(path_json_deflate)

Read deflate compressed data in JSON format

df2 = spark\
    .read\
    .option("multiline", "true") \
    .json(path_json_deflate)

df2.show()
+-----+----+
|label|data|
+-----+----+
|    4|   0|
|    7|   0|
|    1|   1|
|    3|   8|
|    3|   5|
+-----+----+

Write bzip2 compressed data in JSON format

path_json_bzip2 = "/prueba_bzip2.json" # Leer desde HDFS 
path_json_bzip2 = "D:/prueba_bzip2.json" # Leer desde fichero local

df.write\
    .mode("overwrite")\
    .format("json")\
    .option("compression", "bzip2")\
    .save(path_json_bzip2)

Read bzip2 compressed data in JSON format

df2 = spark\
    .read\
    .option("multiline", "true") \
    .json(path_json_bzip2)

df2.show()
+-----+----+
|label|data|
+-----+----+
|    4|   0|
|    7|   0|
|    1|   1|
|    3|   8|
|    3|   5|
+-----+----+