…covering everything which includes the word "data" or "information"
Following on from the last post, HDFS Permissions and Security – Part 11, this blog post will show a small piece of Java application which interacts with HDFS.
As shown in one of my previous posts, Hadoop Command Line Interface – Part 10, users can interact with HDFS using commands on the command line. Another way of interacting with HDFS is through the Hadoop API.
For example, the Java code below reads the contents of each file in a directory and then prints out its content to the screen. I assume that you are familiar with Java.
public class PrintFileContents
public static void main (String  args) throws Exception
FileSystem fs = FileSystem.get(new Configuration());
FileStatus status = fs.listStatus(new Path(“hdfs://tutorialspoint.com:9000/user/kevin/mydir”));
for (int i = 0; i<status.length; i++)
BufferedReader reader = new BufferedReader(new InputStreamReader(fs.open(status[i].getPath())));
line = reader.readLine();
while (line != null)
line = reader.readLine();
System.out.println(“File not available”);
This was my last post on HDFS. Make sure you read my next blog, because that is when I will provide an introduction to MapReduce and several subsequent posts that I will publish will go into more detail about MapReduce.
Following on from my last blog post, HDFS Command Line Interface – Part 10, this week I will explain the HDFS permissions and security.
HDFS has a simple file permissions system which is based on the POSIX model. However, this does not provide a strong mechanism to protect HDFS files from unauthorised access. The system is limited to prevent accidental damage to files and prevent informal misuse of information by users with access to a cluster.
File and directory permissions can be assigned for users or groups of users and allows the following permissions to be defined:
File permissions can be defined at three different levels of granularity:
Permissions can be changed using the same command as in Linux i.e. the -chmod, -chown and -chgrp command.
HDFS does not perform any checks to validate the user. Instead Hadoop uses the login username used to log into Hadoop. Similarly, the user’s current working group list is used as the group list in Hadoop.
Even though this system of user identification is rather simple, but it may be enhanced further in the future as Hadoop evolves.
Like in Linux you have superusers also in Hadoop. For Hadoop the superuser is the person who started Hadoop. For the superuser to run a command, commands must be executed as user superuser.
As a superuser file permissions are overwritten and therefore a superuser has full control over files and directories. Due to this high-level of privilege and simultaneously risk of making a mistake, it is recommended to use the superuser role only when absolutely needed. Instead it is better to set up users with certain restricted permissions.
The superuser changes if another user starts the Hadoop framework.
There is also a supergroup to which superusers can be assigned to. This group is set up in the configuration file using the parameter dfs.permissions.supergroup.
Permissions can be disabled by setting the parameter dfs.permissions to false. By default, file and directory permissions are enabled in HDFS. If permissions are disabled, these are not deleted but simply retained so that these can be enabled again in the future. Disabling permissions means that the HDFS system does not enforce any permissions on files and directories.
HDFS and Java Programming
In my next blog post I will give a brief example of some Java code that interacts with the HDFS. So make sure you check my blog next week.
Following from my last blog post, HDFS DataNode – Part 9, in this post I will go over a few of the HDFS commands which you can use to interact with your HDFS cluster through the command line interface.
Getting started with the commands
Most commands start with the command script bin/hadoop. This starts Hadoop with the Java Virtual Machine and allows to run a command. The commands are specified as follows:
username@machine:hadoop$ bin/hadoop moduleName -cmd args
The moduleName defines which subset of Hadoop functionality to use. There are two modules relevant to HDFS: dfs and dfsadmin.
-cmd defines the command in this module to execute followed by arguments required by the command.
The following sections explain some of the common commands you can execute.
Assuming that you have set up a user called kevin and a node node1, the following command will show the contents of the root directory inside HDFS.
kevin@node1:hadoop$ bin/hadoop dfs -ls /
The command also shows that the -ls command is part of the dfs module.
Creating a Home directory
The following command creates a new directory test using the mkdir command. This directory will be under the root location as indicated by the forward slash.
kevin@node1:hadoop$ bin/hadoop dfs -mkdir /test
Uploading a file to HDFS
The following command uploads a file from the local host to the HDFS file system using the put command. The first parameter of the put command is the file name to upload. The second parameter is the location of the directory where to upload the file to.
kevin@node1:hadoop$ bin/hadoop dfs -put /home/somefile.txt /test/
If you run the previous to view a list of files you will see that the folder test now has a file called somefile.txt.
Display the contents of a file
If you want to see the contents of a text-based file, you can use the cat command followed by the name of the file you want to see the contents of. The following command prints the contents of the file somefile.txt to the screen:
kevin@node1:hadoop$ bin/hadoop dfs -cat /test/somefile.txt
For a full list of existing command, you can refer to the Hadoop Shell Commands on the Apache Hadoop website.
HDFS Permissions and Security
Make sure you check my next blog, because that is when I will be talking about the HDFS permissions and security.
Similar to my previous two blog posts, HDFS Architecture – Part 7 and HDFS NameNode – Part 8, in this post I will explain one last bit of information in more detail about the DataNode; the data integrity for blocks of data. If you have not read my two previous posts, I strongly suggest you read those as this post builds on top of those.
Hopefully, by the end of this post you will feel that you have now got a sufficient level of understanding about the concepts of the HDFS DataNode that you can confidently talk about it and understand it when someone mentions it.
Data Integrity for Blocks of Data
When a block of data is sent by the DataNode, the HDFS client checks the integrity of it using a checksum. Blocks of data may get corrupted for several reasons such as network problems or faults in the storage device.
When a HDFS client creates a file, it also creates and stores a checksum for each block of the data which belongs to the file. The checksum is stored in the form of a file and the blocks are then stored across one or more DataNodes.
At a later point of time, the HDFS client requests the file which will be aggregated from one or more DataNodes. As each block of data is retrieved from the DataNodes, the HDFS client computes a checksum and compares it against the checksum that it created previously. If the checksum for any block of data does not match then the HDFS client requests a replicated block of data from another DataNode.
HDFS Command Line Interface
In my next blog post I will be talking about the HDFS Command Line Interface where I will explain a few commands such as how to list files and create a directory. So hopefully you will check my blog again next week.
Following on from my last post, HDFS Architecture – Part 7, in this post I will be explaining a couple of extra things you should knowabout the HDFS NameNode. If you have not read my previous post, I strongly suggest to read it first because this post builds on top of it specifically the following two aspects which will be explained in this post:
At the end of this post combined with my previous post, you will hopefully feel that you now have a good level of understanding about the HDFS NameNode such that you can confidently talk about it and understand it when someone mentions it.
Heart Beat Failures
As explained in the previous post, a DataNode sends Heart Beats to the NameNode on a regular basis to indicate that it is still active. The NameNode is able to detect when a DataNode has failed to send a Heart Beat. When this happens, the NameNodes stops sending any IO requests to the failed DataNode.
If a DataNode has died, it also means that replicated blocks become inaccessible. This therefore may trigger the NameNode to increase the replications of a particular set of blocks to ensure continuity in case further DataNodes fail with the same blocks of data.
Corruption of the EditLog and FsImage File
Both the EditLog and FsImage file are crucial for the functioning of HDFS. For this reason, the NameNode can be configured such that these files are replicated and kept in synch when any changes are made to the underlying files and directory metadata.
Since the NameNode is a single computer within the cluster, it creates a single point of failure. Manual intervention is required to fix the issue. This is definitely an area which must be and probably will be improved in the future by means of automated restarts and fail-over of the NameNode to another computer in the cluster.
In my next blog post, I will talk about the second most important aspect of HDFS – the DataNode. So make sure you revisit my blog in a few days to gain a more detailed understanding about the DataNode just like I have done in this post for the NameNode.
Following from my previous post, HDFS Introduction – Part 6, in this blog post I will explain in detail the HDFS Architecture.
HDFS (Hadoop Distributed File System) is a distributed file system which is able to run on commodity hardware. It is different from other distributed file systems for the following reasons:
HDFS was originally built as infrastructure for the Apache Nutch project, but has now become an Apache Hadoop sub-project.
Since data is stored across hundreds or thousands of computers in a cluster, hardware failure as considered a norm in HDFS rather than an exception. However, HDFS has the capability to identify and quickly recover from any such failures and therefore provides continuity without interruption.
Another important point about HDFS is that it is designed for batch processing large amounts of data rather than small amounts of data retrieved through user interactivity.
HDFS’s main appeal is that it allows application to run on it which require access to large amounts of data. A single file may be gigabytes or even terabytes in size. A single data instance may be storing millions of files depending on the storage capacity of the particular computer. Similarly, HDFS is very flexible and scalable to support a computer cluster of hundreds or thousands of computers.
Files stored on HDFS cannot be modified once written and saved and therefore are read only in nature. HDFS only allows one write task at a time. In the future, HDFS may be extended so that at least data can be appended to files.
A benefit of HDFS is that it is very portable and therefore can be run on different hardware and software platforms.
HDFS File Organisation
HDFS supports hierarchical file organisation. The HDFS file system namespace is similar in that it allows users or applications to create directory and store files within it. Files can be renamed or moved between directories.
The NameNode is a crucial component within the HDFS architecture. It is responsible for managing the file system namespace. Applications can specify the number of replications of a file which should be maintained by HDFS and the NameNode records these details. The number of replications of a file is called the file replication factor. This information is important so that in case of any computer failures the cluster can recover from the failure and ensure continuity of processes needing the access to a file which was stored on the failed computer. The replication factor can be defined either when the file is created or changed at a later point of time.
To ensure that no information got out of synch in the NameNode, it received the following from each DataNode in the cluster:
The diagram below provides a visual summary of the abovementioned HDFS functions:
File System Metadata
HDFS uses a transaction log file called EditLog which records any changes made to the file system metadata. For instance, if a file is moved from one directory to another or the file replication factor is changed then HDFS creates an entry in the EditLog. This EditLog is a file stored in the localhost operating system’s file system. Another file called FsImage stores information about the entire file system including which blocks belongs to which file. This file is also stored in the NameNode’s local file system together with the EditLog file.
The file system metadata is much smaller in size compared to the actual data files. In other words, e.g. 4 GB of space is sufficient to store metadata on a large number of files and directories.
When the NameNode is started, it reads the contents of both files; EditLog and FsImage. All transactional changes recorded in the EditLog are applied to an image of FsImage and this creates a new image of FsImage which is stored on disk. All the recorded transactions in EditLog are then deleted, since these have been applied to the new image of FsImage. This process is called CheckPoint which only happens when NameNode is started.
A DataNode stores HDFS data locally in its file system, but is not aware of HDFS. Each HDFS data is a block of data that makes up a file. DataNodes store a certain number of blocks in a single directory using some heuristics. When there are more blocks left to store, these blocks are stored inside a sub-directory. This process continues until all blocks of data have been stored. When the DataNode is started, it goes over this directory of blocks and generates a report. This report is called the Block Report and is sent to the NameNode.
In my next post, I will be focusing specifically on the HDFS NameNode by explaining a few more details about it so that you have a full understanding about it.
Following on from my previous post, Hadoop Streaming – Part 5, in this post I will provide a detailed introduction to HDFS.
HDFS is the Hadoop Distributed File System, but was is actually a distributed file system? A distributed file system allows to store large volumes of data across a cluster of computers and provides access to it. There are different distributed file systems, but I will only explain the Network File System (NFS).
NFS – a distributed file system
NFS is one of the most popular distributed file systems. It is quite old and constrained, but its design is simple.
NFS provides remote access to a particular piece of data on a single computer. An NFS server makes some its file system accessible to external clients. This allows clients to mount the remote file system to their own file system and communicate with it in a transparent manner as if it was part of the local system. This means that clients do not need to be aware of the fact that they are working with a remote file system.
On the other hand, the limitation of this type of distributed file system is that the amount of data which can be stored in limited to the amount of data which can be stored on the single machine.
In addition, it lacks reliability to ensure that if the remote file system goes down that there is back-up by e.g. replicating the files to another server.
A further limitation of NFS is that all clients must access the same computer to access the data. If there are a lot of clients accessing the same computer then this would cause performance issues. It also means that the users must copy the data first onto their own machine before they can work with the data.
HDFS – a better distributed file system
HDFS overcomes many of the problems which other distributed file systems have included the abovementioned limitations in NFS:
Based on the history of how Hadoop originated as explained in the Hadoop History – Part 2 blog post, the HDFS architecture is based on the Google File System (GFS).
HDFS is a block-structured file system in which files are split up into equally-sized blocks of data. The blocks are then stored across the computers within the cluster. The individual computers within the cluster storing these blocks of data are called DataNodes.
The blocks of data are randomly stored on computers. Therefore if you want to retrieve a complete file, multiple blocks across computers may have to be accessed and assembled. Even though this sounds like a drawback on the other hand it overcomes the problem of being limited to the storage amount available on a single computer.
Splitting up data into blocks also means that if a file is larger than the hard drive on a single computer, it can be split up and stored across multiple computers which otherwise would not have been possible. This therefore also allows to handle large sized data input.
An obvious problem is that if a file was so large that it had to be split into multiple blocks and stored across several computers then during the assembly process the file may be incomplete if any one or more computers broke down. This is not a desirable option. HDFS therefore overcomes this problem by replicating each block across multiple computers. By default, it is 3 computers per block. This obviously leads to data replication, but Hadoop is able to use commodity computers the cost of replication should not be regarded as an issue especially when considered that this feature has the purpose of providing a higher level of reliability which is important during failures.
Most block-structured file system have a block size that is between 4 and 8 KB. However, the default block size in HDFS is 64 MB which is a much larger size. The advantage of such a large single block is that it provides faster reading of data which is stored sequentially.
Compared to other distributed file system such as NFS, Hadoop uses a smaller number of files but larger in size whereas e.g. NFS uses smaller files more a larger number of these. Hadoop’s files may therefore be hundreds of megabytes or even gigabytes. So a file of 100 MB does not even fill a couple of blocks fully.
Typically on a computer multiple specific locations within a file may be accessed randomly for retrieve small amounts of data. HDFS, on the other hand, read blocks from start to finish.
Since the files stored in the HDFS are not part of the ordinary file system, therefore if you run the Linux command ls on a computer running a DataNode daemon to view the contents of a directory then you will not see the files stored in the HDFS but merely the files used to host the Hadoop services. Therefore the local computer files are separate from the files which are in the HDFS.
The reason behind this is that HDFS operates in a separate which is isolated from the local files. The files i.e. blocks that make up the files in the HDFS are stored in a specific directory which is managed by the DataNode. If you want to work with the HDFS files then the usual Linux commands like cd, mv, ls etc will not work. HDFS has its own tools for file management.
It is also important to note that once a file in HDFS has been written, it cannot be modified but only be read.
HDFS must store the metadata (names of files and directories) reliably. Since there will be multiple clients accessing the HDFS and also cause modifications to this metadata, it is crucial for this metadata never to become desynchronised. To avoid this problem from occuring the metadata management is handled by a single machine called the NameNode which stores and manages all the metadata for the file system.
The metadata stores the following information for each file:
This metadata is frequently accessed. To allow for fast access, this information can be stored inside the NameNode’s main memory.
When a file must be accessed, the client communicates to the NameNode which provides a list of the locations for the blocks that make up the file. The locations point to DataNodes which hold individual blocks. Rather than assembling each block into a complete file and then retrieving the file through the NameNode, clients read blocks directly from DataNodes in parallel. This ensures that the NameNode has only a specific purpose and avoid overloading it with several tasks.
It may happen that the NameNode fails, but this does will not impact the metadata, since multiple redundant systems allow the NameNode to preserve the HDFS metadata in case of a failure.
A NameNode failure is more severe for the cluster than the DataNode. Any NameNode failure makes the cluster inaccessible until the NameNode is restored. On the other hand, if a DataNode crashes the cluster still continues to operate.
However, since the NameNode does not have as much involvement the operation of the cluster as DataNodes therefore the chances of it failing are relatively low.
The diagram below summarises the above architectural components.
The next blog post, which I will publish some time next week, will explain the HDFS architecture.
Following on from last week’s post, Hadoop Components and Other Related Projects – Part 4, this post will introduce you to the Hadoop Streaming concepts.
In a nutshell, Hadoop streaming allows you to create and run a Map or Reduce job using a script or executable as a mapper or reducer.
How the Mapper works
When you use an executable file as a Mapper, each mapper task will run the executable file as a separate process. As the mapper process runs, it converts the input into lines and feeds these into stdin of the process. The mapper than takes the lines as output from the stdout process and converts it into a key-value pair. This key-value pair is the output of the mapper.
How the Reducer works
The Reducer takes the output from the Mapper. When an executable file is used as a reducer, each reducer task will run the executable file as a separate process. As the reducer process runs, it converts the input which are key/values pairs into lines and feeds these to the stdin of the process. The reducer collects the line-based outputs from the stdout and converts each line into a key/value pair. The key-value pair is the output of the reducer.
A Reducer reduces the intermediate data which share a key to a smaller set of values. The Reducer has three phases:
Shuffle: Input to the Reducer is the sorted output of the Mapper.
Sort: Reducer inputs are grouped by keys.
While the Mapper outputs are fetched, both the Shuffle and Sort occur simultaneously.
Reduce: In this phase the method reduce (WritableComparable, Iterator, OutputCollector, Reporter) is called for each key-value in the grouped inputs. The values are actually aggregated values after the Shuffle and Sort, because the values get aggregated for a particular key as input into this Reduce phase. The output of this phase is not sorted.
Next week I will be publishing another blog post on the subject of Hadoop specifically to provide an introduction to HDFS first before further posts in the following weeks go into a lot more detail. So keep watching this blog.
In: Big Data23 Dec 2013
Following on from my previous post, Hadoop Installation – Part 3, in this post I will explain at a high-level the two main Hadoop components, HDFS and MapReduce, and also touch on some of the most popular Hadoop related projects such as Hive, HBase, Pig and others.
HDFS (Hadoop Distributed File System)
HDFS is the distributed file system which gives Hadoop its scalability feature and store data across the computers to enhance performance. I will go into a lot more detail about this in my upcoming posts and so keep it very brief in this post.
MapReduce is the parallel-processing engine which actually runs distributed code and analyses large volumes of data in the cluster. Since any processing requests are distributed across the cluster, therefore processing response times are significantly faster compared to single computer executions.
Other related applications to Hadoop are usually built on top of HDFS or MapReduce. For instance, Hive and Pig are both query languages similar to SQL which can query data on the Hadoop cluster and give the kind of performance you would see in data warehouses.
Since I will also be going into a lot more detail in future blog posts, I am keeping it very brief in this blog post.
Hive has the capability to take standard SQL code and convert it into a MapReduce job which then allows you to run your SQL on across a distributed file system i.e. the cluster of computers. Related to Hive is also Hue which is a browser-based GUI to do the Hive work.
Pig is a programming environment for coding MapReduce jobs. The Pig programming language has its own name and is called Pig Latin.
Another popular application you may have already heard of in relation to Hadoop is HBase. This is a NOSQL (Not Only SQL) database which relies on the HDFS as is distributed storage engine. This therefore makes it a very largely scalable database, but note that it is not a relational database but a NOSQL database. This is important to remember, because Hadoop’s power and popularity is partly because of its ability to be able to process large volumes of structured as well as unstructured data.
Zookeeper is a centralised service for maintaining configurations, naming, distributed synchronisation and provides group services. All these are used by distributed applications including for a Hadoop cluster. This is a complex application framework which has challenges dealing with race conditions – a typical challenge in multi-threaded programming. To good thing is that you do not need to do much with it, because it does all the work for you. It would be very unusual if you had to write a program which uses Zookeeper.
Chukwa is another useful open source application which provides a data collection facility for monitoring large distributed systems such as Hadoop. This application can be particularly useful for very large Hadoop clusters. Chukwa is built on top of HDFS and MapReduce and therefore benefits from scalability and reliability that Hadoop has to offer. Chukwa also has a toolkit for displaying, monitoring and analysing the results so that any necessary actions can be taken from the collected data.
Cassandra is another open source application which provides scalability and high-availability/fault tolerance. Cassandra can replicate across multiple data centers, provides low latency and gives users the assurance that it can deal with e.g. outages in a very efficient manner. Cassandra’s data model offers column indexes, denormalised representation of the data in the model and materialised views plus built-in caching. Hence it builds on top of Hadoop’s HDFS and MapReduce components.
So all the above applications and many more which I have and will not explain in this tutorials are all complementary to Hadoop and enhance its usage. Scalability is one of the important features of Hadoop and hence a lot of related applications are built on HDFS, but also we have seen that there are applications which are built on top of MapReduce such as Pig.
Depending on your requirements, you can glue together relevant applications with Hadoop.
Hadoop Packaged Solutions
If you do not have the time or patience of downloading each of the applications you require then there are also packaged solutions available which bundle up various Hadoop projects.
Similarly, these packaged solutions allow you to manage the Hadoop cluster from one solution rather than multiple applications.
Popular vendors offering packaged Hadoop solutions include Cloudera, Hortonworks, Map and EMC. All of these vendors offer a different set of packaged Hadoop solutions, although they all bundle together a set of common Hadoop applications such as HDFS, MapReduce, Hive, HBase etc. Having a packaged solution also means that these vendors have taken care of making sure that your entire Hadoop framework works smoothly in an integrated fashion.
In next week’s blog post I will talk about Hadoop Streaming. So make sure you visit my blog again next week.
Following on from last week’s post, Hadoop History – Part 2, this post will take you through the installation process of Hadoop including pre-requisites.
Hadoop can either be set up to run in a single node (computer) cluster or in a cluster consisting of more than one computer. Even though you may first want to try install Hadoop on a single node cluster to become confident with the installation process, I will assume that you intend to install Hadoop on a multi-node cluster and therefore will explain the installation process for this use case. After all, Hadoop’s power is to be able to execute distributed code across a cluster of computers.
For further detailed instructions on how to install and configure Hadoop you can refer to the currently maintained documentation on the Apache Hadoop website at http://hadoop.apache.org/docs/current/.
The following is a list of applications you need to install prior to starting with the Hadoop installation.
Java 1.6 – Java Development Kit http://www.java.com
Linux – Linux Operating System with Bash shell
Hadoop can be downloaded from the Apache Hadoop website at http://hadoop.apache.org.
When you download the Hadoop installation file hadoop-2.0.3-alpha-src.tar.gz and unzip the file you will see the following directory structure.
After you have made sure that you have the necessary pre-requisites ready and unzipped the Hadoop installation file you are ready to start installing and configuring Hadoop.
These installation instructions explain how to set up a Hadoop cluster where one computer runs as the NameNode, one as the JobTracker and several computers as TaskTracker. It is assumed that each computer had a user called “hadoop” and that the Hadoop home directory is “/home/hadoop/”.
You need to make sure that each computer in the cluster has an installation of the Java JDK.
To start installing Hadoop follow these instructions:
1. Unpack the Hadoop installation file on each computer in the cluster such as into the following directory:
2. Next set up the Hadoop environment variables for the user using the Bash shell in Linux. To the bottom of the file ~/.bashrc add the following two lines to set up environment variables for Hadoop:
The HADOOPCOMMONHOME variable is used by Hadoop’s utility scripts and PATH defines Hadoop’s bin directory in order to be able to run Hadoop commands directly without needing to enter the full path to it. Therefore setting the HADOOPCOMMONHOME environment variable is mandatory, whereas setting the PATH environment variable is optional.
3. To the Hadoop environment configuration file conf/hadoop-env.h at the following:
export HADOOPLOG_DIR=/home/hadoop/ data/logs
This specifies the JAVAHOME directory and also the directory in which to store Hadoop logs.
4. Assuming that the NameNode runs on 10.1.1.10 enter the following XML configuration details into the file conf/core-site.xml:
5. In the XML file conf/hdfs-site.xml add the following configuration details:
The property dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local file system where the NameNode stores the namespace and logs. dfs.data.dir specifies the directory on the local file system of a DataNode where it stores its blocks.
6. In the XML file conf/mapred-site.xml configure the JobTracker on host/IP address 10.1.1.2 port 8500:
mapreduce.jobtracker.address specifies the host/IP address and port of the JobTracker. mapreduce.jobtracker.system.dir is the HDFS path where MapReduce stores the system files. mapreduce.cluster.local.dir is the list of paths on the local file system where transient MapReduce data is saved.
7. In the file conf/slaves delete localhost and enter all the TaskTracker names one per line such as the following:
8. Next you need to duplicate all the configuration files in the conf directory across all computers in the cluster.
Up to the above point, you have completed the installation of Hadoop. Now if you want to start running HDFS and MapReduce i.e. both the core components which make up Hadoop then follow the steps explained in the next section.
In order to start Hadoop, you need to start both HDFS and MapReduce.
1. Format a new HDFS node/computer on the NameNode (10.1.1.10) using the following command:
$ hadoop namenode -format
2. Start the HDFS NameNode with the following command in the Linux Bash shell:
3. Start the MapReduce JobTracker (10.1.1.2:8500) with the following command:
Stopping HDFS and MapReduce
In order to stop Hadoop, you need to stop both HDFS and MapReduce.
To stop both HDFS NameNode and MapReduce JobTracker simply replace the command line words start with stop:
to stop the HDFS NameNode
to stop the MapReduce JobTracker
Hadoop Components and Other Related Projects
In my next blog post next week, I will provide a high-level overview of the two main Hadoop components, HDFS and MapReduce, and also describe at a high-level a few of the Hadoop related projects such as Hive, HBase, Pig and others. So make sure you check my blog again next week.
The purpose of this blog is for the author to share his knowledge and experience in various Data Management domains with the aim of helping readers to learn/expand on their existing knowledge in the area. This blog shares knowledge on Master Data Management, Data Quality, Data Governance, Data Integration, Data Analysis, Data Profiling, Data Warehouses, Data Marts, Data Modeling, Data Architecture, Metadata Management etc.
I am Manjeet Singh Sawhney and work as a Principal Consultant - Data Architect for Wipro Consulting Services in London (UK). Prior to this, I have worked for Direct Line Group, Accenture, Tibco Software, Initiate Systems (now IBM) and Tata Consultancy Services. My areas of expertise are Master Data Management (Customer, Product, Reference), Metadata Management, Data Governance, Data Quality, Data Integration, Data Migration, Data Warehouses, Data Marts, Data Modeling, Data Architecture, Data Profiling and Data Analysis. I am using this blog to share my knowledge and experience in these areas and hope that you will find it useful.
If you wish to advertise (banner or text links) on this blog or sponsor any blog posts then contact the author using the 'Contact the Author' form below. Please include some details about your advertisement / sponsorship request and a response will be sent to you within 24 hours.