Hadoop is an open-source framework developed by Apache Software. It allows distributed processing of large datasets across clusters of commodity servers. Hadoop software creates a storage network using commodity servers, which are cheaper than traditional enterprise-class machines. A cluster is made up of a single master and multiple slaves. The master controls all the slave machines within the cluster and coordinates data transactions. Each of the nodes in the cluster stores part of the data. The cluster takes care of storing the data redundantly to ensure reliability. This redundancy happens at two levels – node level (through replication) and rack-level replication (by replicating data across nodes in different racks). Hadoop can scale from a single server to thousands of machines, each offering local computation and storage.
Components of Hadoop
It is a framework for storing, processing, and analyzing large amounts of data. This framework is composed of three main components: HDFS (Hadoop Distributed File System), MapReduce, and YARN.
- HDFS is a distributed filesystem that allows users to store data in a single file across multiple machines. It also allows users to access this data via the HTTP protocol or by writing programs that interface with it directly.
- MapReduce is an algorithm that can be used to perform large-scale processing of data sets. It provides high-level abstractions over complicated algorithms, allowing users to focus on their applications rather than the underlying technology.
- YARN is a cluster management tool that enables applications running on Hadoops to access local resources such as CPU, memory, and disk space within an allocated amount of resources while also achieving high throughput by distributing tasks across multiple machines using a queue-based scheduler.
ls: This command is used to list all the files. Use lsr for the recursive approach. It is useful when we want a hierarchy of a folder.
bin/hdfs dfs -ls <path>
mkdir: To create a directory. In Hadoop dfs, there is no home directory by default. So, let’s first create it.
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer
touchz: It creates an empty file.
bin/hdfs dfs -touchz <file_path>
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the most important command. Local filesystem means the files present on the OS.
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
cat: To print file contents.
bin/hdfs dfs -cat <path>
copyToLocal (or) get: To copy files/folders from hdfs store to the local file system.
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
moveFromLocal: This command will move files from local to hdfs.
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
cp: This command is used to copy files within hdfs. Lets copy folder infraveo to infraveo_copied.
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from the infraveo folder to infraveo_copied.
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
rmr: This command deletes a file from HDFS recursively. It is a very useful command when you want to delete a non-empty directory.
bin/hdfs dfs -rmr <filename/directoryName>
du: It will give the size of each file in the directory.
bin/hdfs dfs -du <dirName>
dus:: This command will give the total size of the directory/file.
bin/hdfs dfs -dus <dirName>
stat: It will give the last modified time of directory or path. In short it will give stats of the directory or file.
bin/hdfs dfs -stat <hdfs file>
Versions of Hadoop
Hadoop 1: The most basic versions of Hadoop include Hadoop Common, Hadoop Distributed File System (HDFS), and MapReduce.
Hadoop 2: The only difference between Hadoop 1 and Hadoop 2 is that Hadoop 2 also contains YARN (Yet Another Resource Negotiator). YARN helps with resource management and task scheduling through its two daemons i.e., task tracking and progress tracking.
Hadoop 3: This is a recent version of Hadoop. Besides the merits of the first two versions, Hadoop 3 also has greater merit. It solved the single-point error problem by having multiple name Nodes. Many other advantages such as erasure encryption, GPU hardware usage, and Dockers make it superior to previous versions of Hadoop.
There are many more commands in HDFS, but we have covered the commands that are commonly used when working with Hadoop. As you can see, Hadoop is a powerful tool for analyzing big data. Its ability to scale up and down based on the size of your problems makes it ideal for analytics applications in any industry. If you are interested in learning more about Hadoop or becoming an expert yourself, check out our other blog posts!