Data modernization is an important topic as companies attempt to migrate onto new systems, consolidate for efficiency, and rationalize to achieve common standards. Companies are rationalizing to move onto standard hardware and software to reduce the complexity of systems, as well as consolidating multiple data centers into one or multiple databases on disparate systems onto one system. Companies are also migrating off of old proprietary systems such as UNIX and moving onto Linux, and migrating from old systems into the cloud.

How do companies achieve these lofty goals? These goals, such as data migration and/or data center consolidation can run into the many millions of dollars. These modernization projects are not for the faint of heart, but there is one product in the industry that vastly reduces the complexity, expense and time to achieve these modernization projects. That technology is data virtualization. Data virtualization, as implemented, for example, by Delphix, allows multiple copies of data to occupy even less than the space taken by the original source data. New copies can be made in minutes and take up almost no storage. Copies can be made from the source as it is now or as the source was a day ago or a week ago, and that point in time can be chosen down to the exact second.

How Does Data Virtualization Impact Modernization

Data virtualization eases the work in modernization by consolidating the storage used by multiple copies of data. Oracle estimates that for every production database there are 12 copies of databases in non-production environments such as development, QA, UAT, backup, reporting etc. Data such as databases is complex and difficult to move as the data is constantly changing and requires specialized procedures to copy. Across copies of a database, the majority of data blocks will be exact copies. With data virtualization, all duplicate data blocks are shared across all the copies. Thus, to move 12 databases from one data center to another is no longer moving 12 databases, but actually moving 1 virtualized data source. Delphix handles the live replication from one data center to another automatically. The amount of data required to move multiple copies of a database is often less than the size of the original database thanks to compression.

Data virtualization, as implemented by Delphix, can virtualize entire application stacks. Delphix can do for applications and file systems what it does for databases. The product can virtualize application binaries, config files, and other related files, as well as databases. It can also track changes via TimeFlow, and provision space-efficient virtual copies in minutes.

When data virtualization is implemented by Delphix, it comes with auto-transformation of data from Unix to Linux. Many companies have hundreds or thousands of Oracle databases running on legacy Unix platforms, often at 4-8x operating and maintenance costs of x86 platforms. Converting these databases from Unix to x86 Linux requires extended, manual, and error-prone efforts. Delphix accelerates the process significantly with our new data transformation feature that will automatically convert Unix databases to Linux.

Database virtualization by Delphix also comes with live archive. One of the biggest barriers to successful modernization is regulatory and data risk. If a cutover fails, firms may have no fallback option, and can face lost revenue, failed audits, and large compliance fines. Live Archive provides a “Plan B” for modernization cutover. Customers have used Live Archive to quickly deliver retired ERP applications for SOX and other audits.

Finally, data virtualization as implemented by Delphix comes with replication and supports automatic transparent replication from on premise systems to the cloud and back, making cloud migrations both simple and robust as it’s easy to test out cloud environments and to come back to using in house systems in the case of issues.



VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Following on from the last post, HDFS Permissions and Security – Part 11, this blog post will show a small piece of Java application which interacts with HDFS.

As shown in one of my previous posts, Hadoop Command Line Interface – Part 10, users can interact with HDFS using commands on the command line. Another way of interacting with HDFS is through the Hadoop API.

For example, the Java code below reads the contents of each file in a directory and then prints out its content to the screen. I assume that you are familiar with Java.

package org.datamanagement;
import java.util.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class PrintFileContents
public static void main (String [] args) throws Exception
FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] status = fs.listStatus(new Path(“hdfs://”));

for (int i = 0; i<status.length; i++)
BufferedReader reader = new BufferedReader(new InputStreamReader([i].getPath())));

String line;
line = reader.readLine();

while (line != null)
line = reader.readLine();
catch(Exception e)
System.out.println(“File not available”);

MapReduce Introduction

This was my last post on HDFS. Make sure you read my next blog, because that is when I will provide an introduction to MapReduce and several subsequent posts that I will publish will go into more detail about MapReduce.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Following on from my last blog post, HDFS Command Line Interface – Part 10, this week I will explain the HDFS permissions and security.

HDFS has a simple file permissions system which is based on the POSIX model. However, this does not provide a strong mechanism to protect HDFS files from unauthorised access. The system is limited to prevent accidental damage to files and prevent informal misuse of information by users with access to a cluster.

File and directory permissions can be assigned for users or groups of users and allows the following permissions to be defined:

  • read
  • write
  • execute

File permissions can be defined at three different levels of granularity:

  • file/directory owner
  • user/group in the same group as the owner
  • other users/groups in the system

Permissions can be changed using the same command as in Linux i.e. the -chmod, -chown and -chgrp command.

User identification

HDFS does not perform any checks to validate the user. Instead Hadoop uses the login username used to log into Hadoop. Similarly, the user’s current working group list is used as the group list in Hadoop.

Even though this system of user identification is rather simple, but it may be enhanced further in the future as Hadoop evolves.


Like in Linux you have superusers also in Hadoop. For Hadoop the superuser is the person who started Hadoop. For the superuser to run a command, commands must be executed as user superuser.

As a superuser file permissions are overwritten and therefore a superuser has full control over files and directories. Due to this high-level of privilege and simultaneously risk of making a mistake, it is recommended to use the superuser role only when absolutely needed. Instead it is better to set up users with certain restricted permissions.

The superuser changes if another user starts the Hadoop framework.


There is also a supergroup to which superusers can be assigned to. This group is set up in the configuration file using the parameter dfs.permissions.supergroup.

Disabling permissions

Permissions can be disabled by setting the parameter dfs.permissions to false. By default, file and directory permissions are enabled in HDFS. If permissions are disabled, these are not deleted but simply retained so that these can be enabled again in the future. Disabling permissions means that the HDFS system does not enforce any permissions on files and directories.

HDFS and Java Programming

In my next blog post I will give a brief example of some Java code that interacts with the HDFS. So make sure you check my blog next week.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Following from my last blog post, HDFS DataNode – Part 9, in this post I will go over a few of the HDFS commands which you can use to interact with your HDFS cluster through the command line interface.

Getting started with the commands

Most commands start with the command script bin/hadoop. This starts Hadoop with the Java Virtual Machine and allows to run a command. The commands are specified as follows:

username@machine:hadoop$ bin/hadoop moduleName -cmd args

The moduleName defines which subset of Hadoop functionality to use. There are two modules relevant to HDFS: dfs and dfsadmin.

-cmd defines the command in this module to execute followed by arguments required by the command.

The following sections explain some of the common commands you can execute.

Listing files

Assuming that you have set up a user called kevin and a node node1, the following command will show the contents of the root directory inside HDFS.

kevin@node1:hadoop$ bin/hadoop dfs -ls /

The command also shows that the -ls command is part of the dfs module.

Creating a Home directory

The following command creates a new directory test using the mkdir command. This directory will be under the root location as indicated by the forward slash.

kevin@node1:hadoop$ bin/hadoop dfs -mkdir /test

Uploading a file to HDFS

The following command uploads a file from the local host to the HDFS file system using the put command. The first parameter of the put command is the file name to upload. The second parameter is the location of the directory where to upload the file to.

kevin@node1:hadoop$ bin/hadoop dfs -put /home/somefile.txt /test/

If you run the previous to view a list of files you will see that the folder test now has a file called somefile.txt.

Display the contents of a file

If you want to see the contents of a text-based file, you can use the cat command followed by the name of the file you want to see the contents of. The following command prints the contents of the file somefile.txt to the screen:

kevin@node1:hadoop$ bin/hadoop dfs -cat /test/somefile.txt

For a full list of existing command, you can refer to the Hadoop Shell Commands on the Apache Hadoop website.

HDFS Permissions and Security

Make sure you check my next blog, because that is when I will be talking about the HDFS permissions and security.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Similar to my previous two blog posts, HDFS Architecture – Part 7 and HDFS NameNode – Part 8, in this post I will explain one last bit of information in more detail about the DataNode; the data integrity for blocks of data. If you have not read my two previous posts, I strongly suggest you read those as this post builds on top of those.

Hopefully, by the end of this post you will feel that you have now got a sufficient level of understanding about the concepts of the HDFS DataNode that you can confidently talk about it and understand it when someone mentions it.

Data Integrity for Blocks of Data

When a block of data is sent by the DataNode, the HDFS client checks the integrity of it using a checksum. Blocks of data may get corrupted for several reasons such as network problems or faults in the storage device.

When a HDFS client creates a file, it also creates and stores a checksum for each block of the data which belongs to the file. The checksum is stored in the form of a file and the blocks are then stored across one or more DataNodes.

At a later point of time, the HDFS client requests the file which will be aggregated from one or more DataNodes. As each block of data is retrieved from the DataNodes, the HDFS client computes a checksum and compares it against the checksum that it created previously. If the checksum for any block of data does not match then the HDFS client requests a replicated block of data from another DataNode.

HDFS Command Line Interface

In my next blog post I will be talking about the HDFS Command Line Interface where I will explain a few commands such as how to list files and create a directory. So hopefully you will check my blog again next week.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Following on from my last post, HDFS Architecture – Part 7, in this post I will be explaining a couple of extra things you should knowabout the HDFS NameNode. If you have not read my previous post, I strongly suggest to read it first because this post builds on top of it specifically the following two aspects which will be explained in this post:

  • Heart Beat Failures
  • Corruption of the EditLog and FsImage File

At the end of this post combined with my previous post, you will hopefully feel that you now have a good level of understanding about the HDFS NameNode such that you can confidently talk about it and understand it when someone mentions it.

Heart Beat Failures

As explained in the previous post, a DataNode sends Heart Beats to the NameNode on a regular basis to indicate that it is still active. The NameNode is able to detect when a DataNode has failed to send a Heart Beat. When this happens, the NameNodes stops sending any IO requests to the failed DataNode.

If a DataNode has died, it also means that replicated blocks become inaccessible. This therefore may trigger the NameNode to increase the replications of a particular set of blocks to ensure continuity in case further DataNodes fail with the same blocks of data.

Corruption of the EditLog and FsImage File

Both the EditLog and FsImage file are crucial for the functioning of HDFS. For this reason, the NameNode can be configured such that these files are replicated and kept in synch when any changes are made to the underlying files and directory metadata.

Since the NameNode is a single computer within the cluster, it creates a single point of failure. Manual intervention is required to fix the issue. This is definitely an area which must be and probably will be improved in the future by means of automated restarts and fail-over of the NameNode to another computer in the cluster.

HDFS DataNode

In my next blog post, I will talk about the second most important aspect of HDFS – the DataNode. So make sure you revisit my blog in a few days to gain a more detailed understanding about the DataNode just like I have done in this post for the NameNode.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Following from my previous post, HDFS Introduction – Part 6, in this blog post I will explain in detail the HDFS Architecture.

HDFS (Hadoop Distributed File System) is a distributed file system which is able to run on commodity hardware. It is different from other distributed file systems for the following reasons:

  • Able to run on commodity hard
  • Provides robust fault tolerance
  • Able to process, manage and access large amounts of data distributed across the cluster of computers

HDFS was originally built as infrastructure for the Apache Nutch project, but has now become an Apache Hadoop sub-project.

Since data is stored across hundreds or thousands of computers in a cluster, hardware failure as considered a norm in HDFS rather than an exception. However, HDFS has the capability to identify and quickly recover from any such failures and therefore provides continuity without interruption.

Another important point about HDFS is that it is designed for batch processing large amounts of data rather than small amounts of data retrieved through user interactivity.

HDFS’s main appeal is that it allows application to run on it which require access to large amounts of data. A single file may be gigabytes or even terabytes in size. A single data instance may be storing millions of files depending on the storage capacity of the particular computer. Similarly, HDFS is very flexible and scalable to support a computer cluster of hundreds or thousands of computers.

Files stored on HDFS cannot be modified once written and saved and therefore are read only in nature. HDFS only allows one write task at a time. In the future, HDFS may be extended so that at least data can be appended to files.

A benefit of HDFS is that it is very portable and therefore can be run on different hardware and software platforms.

HDFS File Organisation

HDFS supports hierarchical file organisation. The HDFS file system namespace is similar in that it allows users or applications to create directory and store files within it. Files can be renamed or moved between directories.

The NameNode is a crucial component within the HDFS architecture. It is responsible for managing the file system namespace. Applications can specify the number of replications of a file which should be maintained by HDFS and the NameNode records these details. The number of replications of a file is called the file replication factor. This information is important so that in case of any computer failures the cluster can recover from the failure and ensure continuity of processes needing the access to a file which was stored on the failed computer. The replication factor can be defined either when the file is created or changed at a later point of time.

To ensure that no information got out of synch in the NameNode, it received the following from each DataNode in the cluster:

  • Heart beat: A heart beat from the DataNode to the NameNode tells the NameNode that the DataNode is still alive.
  • Block report: This provides the NameNode with a list of block names within a DataNode.

The diagram below provides a visual summary of the abovementioned HDFS functions:

HDFS Architecture


File System Metadata

HDFS uses a transaction log file called EditLog which records any changes made to the file system metadata. For instance, if a file is moved from one directory to another or the file replication factor is changed then HDFS creates an entry in the EditLog. This EditLog is a file stored in the localhost operating system’s file system. Another file called FsImage stores information about the entire file system including which blocks belongs to which file. This file is also stored in the NameNode’s local file system together with the EditLog file.

The file system metadata is much smaller in size compared to the actual data files. In other words, e.g. 4 GB of space is sufficient to store metadata on a large number of files and directories.

When the NameNode is started, it reads the contents of both files; EditLog and FsImage. All transactional changes recorded in the EditLog are applied to an image of FsImage and this creates a new image of FsImage which is stored on disk. All the recorded transactions in EditLog are then deleted, since these have been applied to the new image of FsImage. This process is called CheckPoint which only happens when NameNode is started.

A DataNode stores HDFS data locally in its file system, but is not aware of HDFS. Each HDFS data is a block of data that makes up a file. DataNodes store a certain number of blocks in a single directory using some heuristics. When there are more blocks left to store, these blocks are stored inside a sub-directory. This process continues until all blocks of data have been stored. When the DataNode is started, it goes over this directory of blocks and generates a report. This report is called the Block Report and is sent to the NameNode.

HDFS NameNode

In my next post, I will be focusing specifically on the HDFS NameNode by explaining a few more details about it so that you have a full understanding about it.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Following on from my previous post, Hadoop Streaming – Part 5, in this post I will provide a detailed introduction to HDFS.

HDFS is the Hadoop Distributed File System, but was is actually a distributed file system? A distributed file system allows to store large volumes of data across a cluster of computers and provides access to it. There are different distributed file systems, but I will only explain the Network File System (NFS).

NFS – a distributed file system

NFS is one of the most popular distributed file systems. It is quite old and constrained, but its design is simple.

NFS provides remote access to a particular piece of data on a single computer. An NFS server makes some its file system accessible to external clients. This allows clients to mount the remote file system to their own file system and communicate with it in a transparent manner as if it was part of the local system. This means that clients do not need to be aware of the fact that they are working with a remote file system.

On the other hand, the limitation of this type of distributed file system is that the amount of data which can be stored in limited to the amount of data which can be stored on the single machine.

In addition, it lacks reliability to ensure that if the remote file system goes down that there is back-up by e.g. replicating the files to another server.

A further limitation of NFS is that all clients must access the same computer to access the data. If there are a lot of clients accessing the same computer then this would cause performance issues. It also means that the users must copy the data first onto their own machine before they can work with the data.

HDFS – a better distributed file system

HDFS overcomes many of the problems which other distributed file systems have included the abovementioned limitations in NFS:

  • HDFS is able to store large volumes of data (terabytes or gigabytes of data) across a cluster of computers rather than being limited to the storage space available of a single computer.
  • HDFS offers has more robust fault-tolerance capabilities such that in case of a computer failure the data is not corrupted and is still made available.
  • HDFS is able to provide fast response time to information requests since queries are distributed.
  • HDFS can easily be scaled up to server an increasing audience of clients wanting to access data simply by adding more computers to the cluster.
  • Using Hadoop MapReduce it is possible to read and work with the data on clients’ local computers.
  • HDFS can store structured as well as unstructured data making it particularly helpful for Big Data purposes where data is often unstructured or semi-structured. Dealing with unstructured data is the responsibility of the coder compared to structured data which has an underlying data model and is stored in a database. MapReduce works with Java APIs for the coding and files are loaded into HDFS.

Based on the history of how Hadoop originated as explained in the Hadoop History – Part 2 blog post, the HDFS architecture is based on the Google File System (GFS).

HDFS is a block-structured file system in which files are split up into equally-sized blocks of data. The blocks are then stored across the computers within the cluster. The individual computers within the cluster storing these blocks of data are called DataNodes.

The blocks of data are randomly stored on computers. Therefore if you want to retrieve a complete file, multiple blocks across computers may have to be accessed and assembled. Even though this sounds like a drawback on the other hand it overcomes the problem of being limited to the storage amount available on a single computer.

Splitting up data into blocks also means that if a file is larger than the hard drive on a single computer, it can be split up and stored across multiple computers which otherwise would not have been possible. This therefore also allows to handle large sized data input.

An obvious problem is that if a file was so large that it had to be split into multiple blocks and stored across several computers then during the assembly process the file may be incomplete if any one or more computers broke down. This is not a desirable option. HDFS therefore overcomes this problem by replicating each block across multiple computers. By default, it is 3 computers per block. This obviously leads to data replication, but Hadoop is able to use commodity computers the cost of replication should not be regarded as an issue especially when considered that this feature has the purpose of providing a higher level of reliability which is important during failures.

Most block-structured file system have a block size that is between 4 and 8 KB. However, the default block size in HDFS is 64 MB which is a much larger size. The advantage of such a large single block is that it provides faster reading of data which is stored sequentially.

Compared to other distributed file system such as NFS, Hadoop uses a smaller number of files but larger in size whereas e.g. NFS uses smaller files more a larger number of these. Hadoop’s files may therefore be hundreds of megabytes or even gigabytes. So a file of 100 MB does not even fill a couple of blocks fully.

Typically on a computer multiple specific locations within a file may be accessed randomly for retrieve small amounts of data. HDFS, on the other hand, read blocks from start to finish.

Since the files stored in the HDFS are not part of the ordinary file system, therefore if you run the Linux command ls on a computer running a DataNode daemon to view the contents of a directory then you will not see the files stored in the HDFS but merely the files used to host the Hadoop services. Therefore the local computer files are separate from the files which are in the HDFS.

The reason behind this is that HDFS operates in a separate which is isolated from the local files. The files i.e. blocks that make up the files in the HDFS are stored in a specific directory which is managed by the DataNode. If you want to work with the HDFS files then the usual Linux commands like cd, mv, ls etc will not work. HDFS has its own tools for file management.

It is also important to note that once a file in HDFS has been written, it cannot be modified but only be read.

HDFS must store the metadata (names of files and directories) reliably. Since there will be multiple clients accessing the HDFS and also cause modifications to this metadata, it is crucial for this metadata never to become desynchronised. To avoid this problem from occuring the metadata management is handled by a single machine called the NameNode which stores and manages all the metadata for the file system.

The metadata stores the following information for each file:

  • file names
  • directory names
  • permissions
  • locations of each block which makes up a file

This metadata is frequently accessed. To allow for fast access, this information can be stored inside the NameNode’s main memory.

When a file must be accessed, the client communicates to the NameNode which provides a list of the locations for the blocks that make up the file. The locations point to DataNodes which hold individual blocks. Rather than assembling each block into a complete file and then retrieving the file through the NameNode, clients read blocks directly from DataNodes in parallel. This ensures that the NameNode has only a specific purpose and avoid overloading it with several tasks.

It may happen that the NameNode fails, but this does will not impact the metadata, since multiple redundant systems allow the NameNode to preserve the HDFS metadata in case of a failure.

A NameNode failure is more severe for the cluster than the DataNode. Any NameNode failure makes the cluster inaccessible until the NameNode is restored. On the other hand, if a DataNode crashes the cluster still continues to operate.

However, since the NameNode does not have as much involvement the operation of the cluster as DataNodes therefore the chances of it failing are relatively low.

The diagram below summarises the above architectural components.

HDFS Architecture

HDFS Architecture

The next blog post, which I will publish some time next week, will explain the HDFS architecture.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Following on from last week’s post, Hadoop Components and Other Related Projects – Part 4, this post will introduce you to the Hadoop Streaming concepts.

In a nutshell, Hadoop streaming allows you to create and run a Map or Reduce job using a script or executable as a mapper or reducer.

How the Mapper works

When you use an executable file as a Mapper, each mapper task will run the executable file as a separate process. As the mapper process runs, it converts the input into lines and feeds these into stdin of the process. The mapper than takes the lines as output from the stdout process and converts it into a key-value pair. This key-value pair is the output of the mapper.

How the Reducer works

The Reducer takes the output from the Mapper. When an executable file is used as a reducer, each reducer task will run the executable file as a separate process. As the reducer process runs, it converts the input which are key/values pairs into lines and feeds these to the stdin of the process. The reducer collects the line-based outputs from the stdout and converts each line into a key/value pair. The key-value pair is the output of the reducer.

A Reducer reduces the intermediate data which share a key to a smaller set of values. The Reducer has three phases:

Shuffle: Input to the Reducer is the sorted output of the Mapper.

Sort: Reducer inputs are grouped by keys.

While the Mapper outputs are fetched, both the Shuffle and Sort occur simultaneously.

Reduce: In this phase the method reduce (WritableComparable, Iterator, OutputCollector, Reporter) is called for each key-value in the grouped inputs. The values are actually aggregated values after the Shuffle and Sort, because the values get aggregated for a particular key as input into this Reduce phase. The output of this phase is not sorted.

HDFS Introduction

Next week I will be publishing another blog post on the subject of Hadoop specifically to provide an introduction to HDFS first before further posts in the following weeks go into a lot more detail. So keep watching this blog.

VN:F [1.9.22_1171]
Rating: 4.0/5 (1 vote cast)

Following on from my previous post, Hadoop Installation – Part 3, in this post I will explain at a high-level the two main Hadoop components, HDFS and MapReduce, and also touch on some of the most popular Hadoop related projects such as Hive, HBase, Pig and others.

HDFS (Hadoop Distributed File System)

HDFS is the distributed file system which gives Hadoop its scalability feature and store data across the computers to enhance performance. I will go into a lot more detail about this in my upcoming posts and so keep it very brief in this post.


MapReduce is the parallel-processing engine which actually runs distributed code and analyses large volumes of data in the cluster. Since any processing requests are distributed across the cluster, therefore processing response times are significantly faster compared to single computer executions.

Other related applications to Hadoop are usually built on top of HDFS or MapReduce. For instance, Hive and Pig are both query languages similar to SQL which can query data on the Hadoop cluster and give the kind of performance you would see in data warehouses.

Since I will also be going into a lot more detail in future blog posts, I am keeping it very brief in this blog post.


Hive has the capability to take standard SQL code and convert it into a MapReduce job which then allows you to run your SQL on across a distributed file system i.e. the cluster of computers. Related to Hive is also Hue which is a browser-based GUI to do the Hive work.


Pig is a programming environment for coding MapReduce jobs. The Pig programming language has its own name and is called Pig Latin.


Another popular application you may have already heard of in relation to Hadoop is HBase. This is a NOSQL (Not Only SQL) database which relies on the HDFS as is distributed storage engine. This therefore makes it a very largely scalable database, but note that it is not a relational database but a NOSQL database. This is important to remember, because Hadoop’s power and popularity is partly because of its ability to be able to process large volumes of structured as well as unstructured data.


Zookeeper is a centralised service for maintaining configurations, naming, distributed synchronisation and provides group services. All these are used by distributed applications including for a Hadoop cluster. This is a complex application framework which has challenges dealing with race conditions – a typical challenge in multi-threaded programming. To good thing is that you do not need to do much with it, because it does all the work for you. It would be very unusual if you had to write a program which uses Zookeeper.


Chukwa is another useful open source application which provides a data collection facility for monitoring large distributed systems such as Hadoop. This application can be particularly useful for very large Hadoop clusters. Chukwa is built on top of HDFS and MapReduce and therefore benefits from scalability and reliability that Hadoop has to offer. Chukwa also has a toolkit for displaying, monitoring and analysing the results so that any necessary actions can be taken from the collected data.


Cassandra is another open source application which provides scalability and high-availability/fault tolerance. Cassandra can replicate across multiple data centers, provides low latency and gives users the assurance that it can deal with e.g. outages in a very efficient manner. Cassandra’s data model offers column indexes, denormalised representation of the data in the model and materialised views plus built-in caching. Hence it builds on top of Hadoop’s HDFS and MapReduce components.

So all the above applications and many more which I have and will not explain in this tutorials are all complementary to Hadoop and enhance its usage. Scalability is one of the important features of Hadoop and hence a lot of related applications are built on HDFS, but also we have seen that there are applications which are built on top of MapReduce such as Pig.

Depending on your requirements, you can glue together relevant applications with Hadoop.

Hadoop Packaged Solutions

If you do not have the time or patience of downloading each of the applications you require then there are also packaged solutions available which bundle up various Hadoop projects.

Similarly, these packaged solutions allow you to manage the Hadoop cluster from one solution rather than multiple applications.

Popular vendors offering packaged Hadoop solutions include Cloudera, Hortonworks, Map and EMC. All of these vendors offer a different set of packaged Hadoop solutions, although they all bundle together a set of common Hadoop applications such as HDFS, MapReduce, Hive, HBase etc. Having a packaged solution also means that these vendors have taken care of making sure that your entire Hadoop framework works smoothly in an integrated fashion.

Hadoop Streaming

In next week’s blog post I will talk about Hadoop Streaming. So make sure you visit my blog again next week.

VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)

About this blog

The purpose of this blog is for the author to share his knowledge and experience in various Data Management domains with the aim of helping readers to learn/expand on their existing knowledge in the area. This blog shares knowledge on Master Data Management, Data Quality, Data Governance, Data Integration, Data Analysis, Data Profiling, Data Warehouses, Data Marts, Data Modeling, Data Architecture, Metadata Management etc.

About the author

I am Manjeet Singh Sawhney and work as a Principal Consultant - Data Architect for Wipro Consulting Services in London (UK). Prior to this, I have worked for Direct Line Group, Accenture, Tibco Software, Initiate Systems (now IBM) and Tata Consultancy Services. My areas of expertise are Master Data Management (Customer, Product, Reference), Metadata Management, Data Governance, Data Quality, Data Integration, Data Migration, Data Warehouses, Data Marts, Data Modeling, Data Architecture, Data Profiling and Data Analysis. I am using this blog to share my knowledge and experience in these areas and hope that you will find it useful.

Advertisement & Sponsorship

If you wish to advertise (banner or text links) on this blog or sponsor any blog posts then contact the author using the 'Contact the Author' form below. Please include some details about your advertisement / sponsorship request and a response will be sent to you within 24 hours.

  • louie: interesting primer on hadoop. The whole big data thing seems very interesting especially data scienc [...]
  • Manjeet Singh Sawhney: Thanks very much for your kind comment, Grant. [...]
  • Heather: It seems like you actually know plenty pertaining to this specific issue and that shows via this s [...]
  • Manjeet Singh Sawhney: Thanks very much Gary. I agree with you that Data Profiling tools can be used as part of Data Govern [...]
  • Gary Allemann: HI Manjeet. A nice summary of the technical applicati0ons of data profiling tools. i would sug [...]

Follow me on Twitter

Seo Packages
offshore programmers . curso de uñas de gel . promo code