Massive data storage systems – Big data

The main data storage systems for BIG data ecosystems are:

  • HDFS: Storage System par excellence of Hadoop.
  • Apache Hbase: A column-oriented database management system that runs on the HDFS and is typically used to distribute data sets.
  • S3: Amazon storage System, counterpart to HDFS.
  • Kudo: column-oriented database storage Manager for cloudy.
  • ElasticSearch: Real-time Open-source search server that provides indexed and distributed storage
  • Cassandra: Column-oriented non-SQL database.
  • MongoDB: Non-SQL document-oriented database.
  • MariaDB: column-oriented non-SQL database.

 

Queries about HDFS are complex and cumbersome to write, for them there are higher-level applications that provide a layer of abstraction to facilitate communication, these are:

  • Apache Hive: Distributed data storage infrastructure built on Hadoop to provide grouping, querying, and data analysis. Converts SQL or PIG statements into a MapReduce job.
  • Apache Impala: Alternative to Hive used by Clouda. SQL query engine for massive parallel processing (MPP) of data stored in a Hadoop cluster.
  • Apache Pig: High-level language for MapReduce coding. It converts a high-level description of how data should be processed into MapReduce’s “jobs,” without having to write long jobs chains each time, improving the productivity of developers.

The storage in HDFS can be carried out using different formats (Textfile, Sequence File, parquet, Avro or ORC) and different types of compression (snappy, gzip, deflate, bzip2 or zlib)