Apache HIVE – Converts SQL statements to a MapReduce job

by | May 22, 2016 | Big Data | 0 comments

Prerequisites

See the section “commands to manipulate HDFS files”

Databases

Before you can use it it is necessary to start the service of Hadoop

And start the service of HIVE to execute commands, to do so find the folder HIVE and once there run:

$ cd hive
$ bin/hive

Check Existing DB

hive> SHOW DATABASES;

Create db

Being inside we can execute the sentence to create a database, as an example:

hive> CREATE DATABASE diegocalvo

Select the database to work with

hive> use diegocalvo;

Tables

Before manupular tables of a databases in the same state, indicate to the system that databases to use, using the command “use NOMBRE_BBDD”

See tables

hive> show tables;

Create External Table

But the Hive service is initiated to perform all the steps indicated in the creation of the database and then:

CREATE EXTERNAL TABLE localization (localizationID INT, zone STRING, neighborhood STRING)
 COMMENT 'location table'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 STORED AS TEXTFILE
 LOCATION '/user/hadoop/localization';
CREATE EXTERNAL TABLE consumption (TIMESTAMP date, company STRING, consumptionID INT)
 COMMENT ' Table of consumption'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 STORED AS TEXTFILE
 LOCATION '/user/hadoop/consumption';

Create internal table for joins

CREATE EXTERNAL TABLE union (date TIMESTAMP, company STRING, ID INT, zone STRING, neighborhood STRING)
 COMMENT 'table of union between consumption and localization'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 STORED AS PARQUET

Insert data into internal table

INSERT INTO TABLE union
 SELECT C.date, C.company, C.ID, L.Zone, L.neighborhood
 FROM consumption C INNER JOIN location L
 ON C.consumptionID = L.localizationID

Queries: Counting repeated days of the week

SELECT count(*), date_format(date, 'EEEE')
 FROM union
 GROUP BY date_format(date, 'EEEE');

Queries: The 10 most active neighborhoods

SELECT count(*) AS cnt, neighborhood
 FROM union
 GROUP BY neighborhood
 SORT BY cnt DESC LIMIT 10;

Queries: Count how many items for each day of the month

SELECT DayOfMonth(date), count(*)
 FROM union
 GROUP BY dayofmonth(date);

Queries: More Active time slots

SELECT hour(date), COUNT(*) as cnt
 FROM union
 GROUP BY hour(date)
 ORDER BY cnt DESC LIMIT 5;

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *