Hadoop

From Hanlon Financial Systems Lab Web Encyclopedia
Jump to: navigation, search



Introduction

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop can work directly with any mountable distributed file system. The most common file system used by Hadoop is the Hadoop Distributed File System (HDFS). HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data. A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The DataNodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by NameNode.

Hardware

The Hadoop system in Hanlon Financial System Lab is composed of eight nodes (machines). One node is used as a NameNode, the others are used as DataNodes. Total amount of physical memory (RAM) for each node is 16GB. Total memory allocated for container is 8GB. (This can be increased by setting yarn config file). In total, 8*7=56GB can be used for Hadoop task. Total Vcores allocated for containers is 8.

Applications Installed

  • Pig
  • Hive
  • Python
  • Mysql
  • Rstudio
  • R packages including rJava, Rcpp, RJSONIO, bitops, digest","functional, stringr, plyr, reshape2, dplyr, R.methodsS3, caTools, Hmisc, rhdfs, rjson, memoise, rmr2, plyrmr.

Account Request

Please Visit us at Hanlon Financial System Lab, and consult the lab staffs for more information.

Useful links