Big Data and Visualization
Late Bhausaheb Hiray S.S. Trust’s Institute of Computer Application
ISBN 9788119221158
 Publication Date
  Pages 118

PAPERBACK

EBOOK (EPUB)

EBOOK (PDF)

Late Bhausaheb Hiray S.S. Trust was established by Dr. Baliramji Hiray (ex-Education Minister Government of Maharashtra) in 1977, with the sole view of providing quality Education to the people of Maharashtra. It is a charitable trust with leading social workers, philanthropists, and doctors as members. With various courses to cater to multiple segments of urban and rural people, all provided at a reasonable cost, these colleges are spread over Mumbai, Nasik and Malegaon, offering several educational courses/ programme for over 10,000 students. The Institute of Computer Application started the M.C.A. programme in year 2001 with prior approval from the AICTE and affliation with University of Mumbai. All students are selected based on the Common Entrance Test (CET) score covering topics on aptitude and computer concepts conducted by the CET Cell of the Government of Maharashtra and the Directorate of Technical Education (DTE) of Maharashtra. It is a 2-year full time postgraduate course open to graduates of any field with Mathematics (10+2 level). Divided into 4 semesters, the course includes a full time 6-month internship IT industry as the last semester.

Other titles available by the Institute of Computer Application

1. Advance Java Practical Journal: FYMCA-Year 2020-21

2. Artificial Intelligence and Machine Learning

3. Data Mining and Business Intelligence Lab Manual

4. Ethical Hacking Lab Manual

5. Internet of Things Lab Manual

6. Distributed System and Cloud Computing Lab Manual

7. Network Simulator -3 (NS-3) Practical Lab Manual

8. Blockchain & Solidity Program Lab Manual

9. Web Technology with Node js, Angular js and MySQL

FrontCover
FrontCover
Halftitle page
Title page
Copyright information
Contents
Chapters 1-5
Chapter 1 Introduction of Big Data
1.1. Introduction
Chapter 2 HDFS and MapReduce
2.1. Introduction
2.2. Hardware Required
2.3. Software Required
2.4. Installation and Configuration
2.5. Practice
2.6. MapReduce
2.7. Practice
Chapter 3 No SQL
3.1. Introduction
3.2. Hardware Required
3.3. Software Required
3.4. Installation and Configuration
3.5. Practice
Chapter 4 Hadoop Eco-system (HIVE and PIG)
4.1. Introduction
4.2. Hardware Required
4.3. Software Required
4.4. Installation and Configuration
4.5. Practice
Chapter 5 Data Visualization
5.1. Introduction
5.2. Hardware Required
5.3. Software Required
5.4. Installation and Configuration
5.5. Practice
BackCover

First Edition, 2023

Copyright © Late Bhausaheb Hiray S.S. Trust’s Institute Of Computer Application, Bandra (E), Mumbai-51, 2023

All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the author, except in the case of brief quotations embodied in critical reviews and certain other non-commercial uses permitted by copyright law. For permission requests, write to the publisher at the address below.

This book can be exported from India only by the publishers or by the authorized suppliers. Infringement of this condition of sale will lead to Civil and Criminal prosecution.

Paperback ISBN: 978-81-19221-26-4

eBook ISBN: 978-81-19221-15-8

WebPDF ISBN: 978-81-19221-03-5

Note: Due care and diligence has been taken while editing and printing the book; neither the author nor the publishers of the book hold any responsibility for any mistake that may have inadvertently crept in.

The publishers shall not be liable for any direct, consequential, or incidental damages arising out of the use of the book. In case of binding mistakes, misprints, missing pages, etc., the publishers’ entire liability, and your exclusive remedy, is replacement of the book within one month of purchase by similar edition/reprint of the book.

Printed and bound in India by

16Leaves

2/579, Singaravelan Street

Chinna Neelankarai

Chennai – 600 041, India

info@16leaves.com

www.16Leaves.com

Call: 91-9940638999

Contents

1. Introduction of Big Data

1.1. Introduction

2. HDFS and MapReduce

2.1. Introduction

2.2. Hardware Required

2.3. Software Required

2.4. Installation and Configuration

2.5. Practice

2.6. MapReduce

2.7. Practice

3. No SQL

3.1. Introduction

3.2. Hardware Required

3.3. Software Required

3.4. Installation and Configuration

3.5. Practice

4. Hadoop Eco-system (HIVE and PIG)

4.1. Introduction

4.2. Hardware Required

4.3. Software Required

4.4. Installation and Configuration

4.5. Practice

5. Data Visualization

5.1. Introduction

5.2. Hardware Required

5.3. Software Required

5.4. Installation and Configuration

5.5. Practice

Chapter 1 Introduction of Big Data

1.1 Introduction

Big data analytics describes the process of uncovering trends, patterns, and correlations in large amounts of raw data to help make data-informed decisions. These processes use familiar statistical analysis techniques—like clustering and regression—and apply them to more extensive datasets with the help of newer tools.

Big data is a relative term. If big data is referred by “volume” of transactions and transaction history, then hundreds of terabytes (1012 bytes) may be considered “big data” for a pharmaceutical company and volume of transactions in petabytes (1015 bytes).

Big Data Analytics as shown in Fig. 1.1 is the result of three major trends in computing: Mobile Computing using hand-held devices, such as smartphone and tablets; Social Networking, such as Facebook and Pinterest; and Cloud Computing by which one can rent or lease the hardware setup for storing and computing.

Figure 1.1 Big Data: Result of three computing trends.

Figure 1.1 Big Data: Result of three computing trends.

Big data analytics is the use of advanced analytic techniques against very large, diverse big data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.

It can be defined as data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Characteristics of big data include high volume, high velocity and high variety. Sources of data are becoming more complex than those for traditional data because they are being driven by artificial intelligence (AI), mobile devices, social media and the Internet of Things (IoT). For example, the different types of data originate from sensors, devices, video/audio, networks, log files, transactional applications, web and social media — much of it generated in real time and at a very large scale.

Chapter 2 HDFS and MapReduce

2.1 Introduction

HDFS is a distributed file system that provides a limited interface for managing the file system to allow it to scale and provide high throughput. HDFS creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable reliable and rapid access. When a file is loaded into HDFS, it is replicated and fragmented into “blocks” of data, which are stored across the cluster nodes; the cluster nodes are also called the DataNodes. The NameNode is responsible for storage and management of metadata, so that when MapReduce or another execution framework calls for the data, the NameNode informs it where the data that is needed resides. Figure 2.1 shows the NameNode and DataNode block replication in HDFS architecture.

Figure 2.1 NameNode and DataNode block replication.

Figure 2.1 NameNode and DataNode block replication.

Hadoop Ecosystem

2.2 Hardware Required

»Windows 10,11,etc.

»8 GB of RAM

»i3 Processor

»64-bit operating system, x64-based processor

2.3 Software Required

»Oracle VM virtualBox

»cloudera-quickstart-vm-5.4.2-0-virtualbox

2.4 Installation and Configuration

Step 1: download cloudera image using below link https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-virtualbox.zip

Step 2: Unzip the downloaded Zipped file. After unzipping we will get a folder named cloudera-quickstart-vm-5.4. 2-0-virtualbox. Inside this folder two files will be there.

a)cloudera-quickstart-vm-5.4.2-0-virtualbox

b)cloudera-quickstart-vm-5.4.2-0-virtualbox-disk1

Now next step is to download Oracle VM virtual box by following below steps.

Step 3: Download Oracle VM virtualBox for WINDOWS using this link https://download.virtualbox.org/virtualbox/6.1.26/

Step 4: After downloading this .exe file simply double click on it and install it.

Step 5: Click on Next

Step 6: Click on Next

Step 7: Click on Next

Step 8: Click on Yes

Step 9: Click on Install

Step 10: After clicking install it will ask to allow changes in device then click on yes. After clicking on Yes it will start installing.

Step 11: Click on Finish. Below screen will appear. Click on OK.

Step 12: After clicking OK in step 11 below screen will appear then click on import icon.

Step 13: Import the cloudera file mentioned in step 2 (a) and click on Open.

Step 14: Click on Next in below screen.

Step 15: Click on import in below screen.

Step 16: After clicking on import it will start importing file.

Step 17: After importing file below screen will appear

Step 18: Click on setting then select system from left pane and click on processor tab and change processor value to 2 from 1.

Step 19: Click on Start button of the screen shown in step 17. After clicking start button below screen will appear.

Step 20: After closing welcome screen below screen will appear.

Step 21: Click Launch Cloudera Express icon on screen shown in step 20. Below screen will appear.

Step 22: Copy the text sudo /home/cloudera/cloudera-manager - - force and click on terminal icon(Black colour) on top and write down – express besides –force and press enter. Below screen will appear.

Step 23: In step 22 there is a link http://quickstart.cloudera:7180 . Right click and open in new tab.

Step 24: Give user name and password as cloudera.

Step 25: Start required services. E.g HDFS,HIVE,HBASE,ZOOKEEPER,SPARK etc.

Now we can execute any HDFS command and Linux command . For example after executing ls command below screen will appear.

2.5 Practice

Practical 1

Implementation of HDFS /Linux Commands (mkdir, touchz, copy from local/put, copy to local/get, movefrom local, cp, rmr, du, dus, stat).

A)ls:

Description: Shows list of files and directories.

Command: ls

Output:

B)pwd:

Description: Shows present working directory (e.g./home/cloudera).

Command: pwd

Output:

C)whoami:

Description: Shows the current user.

Command: whoami

Output:

D)mkdir:

Description: Create a new directory.

Command: mkdir firstdir

Output:

E)cd:

Description: This command use to change the directory.

Command: cd directory name (Ex- cd firstdir)

Output:

Comments should not be blank
Rating
Description

Late Bhausaheb Hiray S.S. Trust was established by Dr. Baliramji Hiray (ex-Education Minister Government of Maharashtra) in 1977, with the sole view of providing quality Education to the people of Maharashtra. It is a charitable trust with leading social workers, philanthropists, and doctors as members. With various courses to cater to multiple segments of urban and rural people, all provided at a reasonable cost, these colleges are spread over Mumbai, Nasik and Malegaon, offering several educational courses/ programme for over 10,000 students. The Institute of Computer Application started the M.C.A. programme in year 2001 with prior approval from the AICTE and affliation with University of Mumbai. All students are selected based on the Common Entrance Test (CET) score covering topics on aptitude and computer concepts conducted by the CET Cell of the Government of Maharashtra and the Directorate of Technical Education (DTE) of Maharashtra. It is a 2-year full time postgraduate course open to graduates of any field with Mathematics (10+2 level). Divided into 4 semesters, the course includes a full time 6-month internship IT industry as the last semester.

Other titles available by the Institute of Computer Application

1. Advance Java Practical Journal: FYMCA-Year 2020-21

2. Artificial Intelligence and Machine Learning

3. Data Mining and Business Intelligence Lab Manual

4. Ethical Hacking Lab Manual

5. Internet of Things Lab Manual

6. Distributed System and Cloud Computing Lab Manual

7. Network Simulator -3 (NS-3) Practical Lab Manual

8. Blockchain & Solidity Program Lab Manual

9. Web Technology with Node js, Angular js and MySQL

Table of contents
FrontCover
FrontCover
Halftitle page
Title page
Copyright information
Contents
Chapters 1-5
Chapter 1 Introduction of Big Data
1.1. Introduction
Chapter 2 HDFS and MapReduce
2.1. Introduction
2.2. Hardware Required
2.3. Software Required
2.4. Installation and Configuration
2.5. Practice
2.6. MapReduce
2.7. Practice
Chapter 3 No SQL
3.1. Introduction
3.2. Hardware Required
3.3. Software Required
3.4. Installation and Configuration
3.5. Practice
Chapter 4 Hadoop Eco-system (HIVE and PIG)
4.1. Introduction
4.2. Hardware Required
4.3. Software Required
4.4. Installation and Configuration
4.5. Practice
Chapter 5 Data Visualization
5.1. Introduction
5.2. Hardware Required
5.3. Software Required
5.4. Installation and Configuration
5.5. Practice
BackCover
Excerpt

First Edition, 2023

Copyright © Late Bhausaheb Hiray S.S. Trust’s Institute Of Computer Application, Bandra (E), Mumbai-51, 2023

All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the author, except in the case of brief quotations embodied in critical reviews and certain other non-commercial uses permitted by copyright law. For permission requests, write to the publisher at the address below.

This book can be exported from India only by the publishers or by the authorized suppliers. Infringement of this condition of sale will lead to Civil and Criminal prosecution.

Paperback ISBN: 978-81-19221-26-4

eBook ISBN: 978-81-19221-15-8

WebPDF ISBN: 978-81-19221-03-5

Note: Due care and diligence has been taken while editing and printing the book; neither the author nor the publishers of the book hold any responsibility for any mistake that may have inadvertently crept in.

The publishers shall not be liable for any direct, consequential, or incidental damages arising out of the use of the book. In case of binding mistakes, misprints, missing pages, etc., the publishers’ entire liability, and your exclusive remedy, is replacement of the book within one month of purchase by similar edition/reprint of the book.

Printed and bound in India by

16Leaves

2/579, Singaravelan Street

Chinna Neelankarai

Chennai – 600 041, India

info@16leaves.com

www.16Leaves.com

Call: 91-9940638999

Contents

1. Introduction of Big Data

1.1. Introduction

2. HDFS and MapReduce

2.1. Introduction

2.2. Hardware Required

2.3. Software Required

2.4. Installation and Configuration

2.5. Practice

2.6. MapReduce

2.7. Practice

3. No SQL

3.1. Introduction

3.2. Hardware Required

3.3. Software Required

3.4. Installation and Configuration

3.5. Practice

4. Hadoop Eco-system (HIVE and PIG)

4.1. Introduction

4.2. Hardware Required

4.3. Software Required

4.4. Installation and Configuration

4.5. Practice

5. Data Visualization

5.1. Introduction

5.2. Hardware Required

5.3. Software Required

5.4. Installation and Configuration

5.5. Practice

Chapter 1 Introduction of Big Data

1.1 Introduction

Big data analytics describes the process of uncovering trends, patterns, and correlations in large amounts of raw data to help make data-informed decisions. These processes use familiar statistical analysis techniques—like clustering and regression—and apply them to more extensive datasets with the help of newer tools.

Big data is a relative term. If big data is referred by “volume” of transactions and transaction history, then hundreds of terabytes (1012 bytes) may be considered “big data” for a pharmaceutical company and volume of transactions in petabytes (1015 bytes).

Big Data Analytics as shown in Fig. 1.1 is the result of three major trends in computing: Mobile Computing using hand-held devices, such as smartphone and tablets; Social Networking, such as Facebook and Pinterest; and Cloud Computing by which one can rent or lease the hardware setup for storing and computing.

Figure 1.1 Big Data: Result of three computing trends.

Figure 1.1 Big Data: Result of three computing trends.

Big data analytics is the use of advanced analytic techniques against very large, diverse big data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.

It can be defined as data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Characteristics of big data include high volume, high velocity and high variety. Sources of data are becoming more complex than those for traditional data because they are being driven by artificial intelligence (AI), mobile devices, social media and the Internet of Things (IoT). For example, the different types of data originate from sensors, devices, video/audio, networks, log files, transactional applications, web and social media — much of it generated in real time and at a very large scale.

Chapter 2 HDFS and MapReduce

2.1 Introduction

HDFS is a distributed file system that provides a limited interface for managing the file system to allow it to scale and provide high throughput. HDFS creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable reliable and rapid access. When a file is loaded into HDFS, it is replicated and fragmented into “blocks” of data, which are stored across the cluster nodes; the cluster nodes are also called the DataNodes. The NameNode is responsible for storage and management of metadata, so that when MapReduce or another execution framework calls for the data, the NameNode informs it where the data that is needed resides. Figure 2.1 shows the NameNode and DataNode block replication in HDFS architecture.

Figure 2.1 NameNode and DataNode block replication.

Figure 2.1 NameNode and DataNode block replication.

Hadoop Ecosystem

2.2 Hardware Required

»Windows 10,11,etc.

»8 GB of RAM

»i3 Processor

»64-bit operating system, x64-based processor

2.3 Software Required

»Oracle VM virtualBox

»cloudera-quickstart-vm-5.4.2-0-virtualbox

2.4 Installation and Configuration

Step 1: download cloudera image using below link https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-virtualbox.zip

Step 2: Unzip the downloaded Zipped file. After unzipping we will get a folder named cloudera-quickstart-vm-5.4. 2-0-virtualbox. Inside this folder two files will be there.

a)cloudera-quickstart-vm-5.4.2-0-virtualbox

b)cloudera-quickstart-vm-5.4.2-0-virtualbox-disk1

Now next step is to download Oracle VM virtual box by following below steps.

Step 3: Download Oracle VM virtualBox for WINDOWS using this link https://download.virtualbox.org/virtualbox/6.1.26/

Step 4: After downloading this .exe file simply double click on it and install it.

Step 5: Click on Next

Step 6: Click on Next

Step 7: Click on Next

Step 8: Click on Yes

Step 9: Click on Install

Step 10: After clicking install it will ask to allow changes in device then click on yes. After clicking on Yes it will start installing.

Step 11: Click on Finish. Below screen will appear. Click on OK.

Step 12: After clicking OK in step 11 below screen will appear then click on import icon.

Step 13: Import the cloudera file mentioned in step 2 (a) and click on Open.

Step 14: Click on Next in below screen.

Step 15: Click on import in below screen.

Step 16: After clicking on import it will start importing file.

Step 17: After importing file below screen will appear

Step 18: Click on setting then select system from left pane and click on processor tab and change processor value to 2 from 1.

Step 19: Click on Start button of the screen shown in step 17. After clicking start button below screen will appear.

Step 20: After closing welcome screen below screen will appear.

Step 21: Click Launch Cloudera Express icon on screen shown in step 20. Below screen will appear.

Step 22: Copy the text sudo /home/cloudera/cloudera-manager - - force and click on terminal icon(Black colour) on top and write down – express besides –force and press enter. Below screen will appear.

Step 23: In step 22 there is a link http://quickstart.cloudera:7180 . Right click and open in new tab.

Step 24: Give user name and password as cloudera.

Step 25: Start required services. E.g HDFS,HIVE,HBASE,ZOOKEEPER,SPARK etc.

Now we can execute any HDFS command and Linux command . For example after executing ls command below screen will appear.

2.5 Practice

Practical 1

Implementation of HDFS /Linux Commands (mkdir, touchz, copy from local/put, copy to local/get, movefrom local, cp, rmr, du, dus, stat).

A)ls:

Description: Shows list of files and directories.

Command: ls

Output:

B)pwd:

Description: Shows present working directory (e.g./home/cloudera).

Command: pwd

Output:

C)whoami:

Description: Shows the current user.

Command: whoami

Output:

D)mkdir:

Description: Create a new directory.

Command: mkdir firstdir

Output:

E)cd:

Description: This command use to change the directory.

Command: cd directory name (Ex- cd firstdir)

Output:

User Reviews
Comments should not be blank
Rating