Think and Work on Java: 2015

If you or your organisation decided to move to Hadoop solutions then you may need to think few points describe below on overall application design before moving to Hadoop.

Data Storage : Data storage is one the important parts of overall application design in Hadoop. To do this correctly, you must understand which applications are going to access the data, and what is the access pattern. If data is mostly used by MapReduce implementation and sequential access of files then HDFS is the best option. Also data locality is also important in overall application performance. So, all those things are supported by HDFS.

File Format : File format is also an important factor while designing a Hadoop based application. If your application is most doing MapReduce processing then SequenceFiles will be the best option because its processing semantic is well aligned with MapReduce processing. SequenceFiles provides flexibility on providing compression on different level (Record, Block), it is more compact than regular text files, It provides Header records which contains meta data of the file, type of the file and also contains version information of the files. You can choose other file format specially when integrating with other applications. but you need to keep in mind about custom format as this will lead to additional complexities in reading, splitting and writing data.

Types of Calculation : You need to think on the type of calculation you will be doing on the data. If you are considering all the data for the calculation then there is no additional considerations. But if your calculation considered on subset of the data, then you need to think on the data partitioning to avoid unnecessary data reads. The partitioning depends of the data usages pattern of the application.

Data Conversion : As you know Hadoop/HBase is internally storing byte stream of the given data. So, you need to think on the data conversion of your data to byte stream. Here different potential options exists on marshalling/unmarshalling application specific data to byte stream. There are couple of standard Java marshalling approach. But here Apache Avro provides a generic approach for simplifying data marshalling. Avro provides both performance and compact data size. It also storing data definition along with the data and also provide data versioning.

Security : This is one of the important factor to secure data in HDFS or HBase. HDFS and HBase have quite a few security risks. The implementation of the overall security required application/enterprise-specifi solutions to ensure data security.

As you all aware that today Data plays a vital role for any kind of systems. Systems does not mean should belongs to IT (Information Technology). Now to justify above said, we can consider any food product company, company need to regularly survey their product's demand in market and future growth of the product and advertising of new product. One more thing we need to consider that the company mentioned above is multinational company. So, to start survey they need to hire someone, who can do the survey on behalf of the company and based of that survey management can decide their next course of action. As you know one person can not do the survey and eventually it will not be effective interns of cost and man power.

So, company decided to start advertisement of their product and the survey in social media web sites. Which is cost effective and less man power and high ROI (Return of Investment). Now, you have understanding that how Data is important and existence of that Data is also important. It is also important that how Data is organised. So that as and when required it can be available. As you can imaging how large data will be, basically there is no limit. Data will be growing and growing. Now, question is how social media will handle that amount data. All software giants are starts thinking and discover the way to manage and process large data. There are lots of algorithms has been developed and implemented to handle large data interns of size and processing. So, Big Data evolved.

It is also important that how Big Data is distributed so that data can be survive from the failure and easily available for processing to make the decision. So, Distributed File Systems comes into the picture to handle Big Data. DFS was also there before Big Data, but after Big Data evolution it is getting notable importance and existing DFS is also getting re-design.

A distributed file system is a file system that allows access to files via network. The advantage of a distributed file system is that you can share data among multiple hosts or nodes. There are couple of Distributed File Systems such as NFS, CIFS, HDFS and OwFS.

HDFS is widely known and is an open source file system that was influenced by Google's GFS.

For NFS or CIFS, Network Attached Storage (NAS), which is an expensive storage system, is typically used. Whenever you increase its capacity, therefore, there is a high infrastructure cost.

On the other hand, as OwFS and HDFS allow you to use relatively economical hardware (commodity server), you can establish high-capacity storage at much lower cost.

However, OwFS or HDFS are not better than NFS or CIFS, which use NAS in all cases. For some purposes, you should use NAS. The same is true of OwFS and HDFS. As they are built for different purposes, you need to select one of these file systems according to the purpose of the internet server you are implementing.

Hadoop Distributed File System (HDFS)

Google developed Google File System (GFS), its unique distribution system, which stores information about webpages crawled by Google. Google published a paper on the GFS in 2003. HDFS (Hadoop) is an open source system developed using GFS as a model.

For this reason, HDFS has the same characteristics as GFS. HDFS separates a large file into chunks, and stores three of them into each datanode. In other words, one file is stored in multiple distributed data nodes. This also means that one file has three replicas. The typical size of a chunk is 64 MB.

The metadata about which data node stores the chunk is stored in the namenode. This allows you to read data from distributed files and perform operations by using MapReduce.

The namenode of HDFS manages the name space and metadata of all files and the information on file chunks. Chunks are stored in data nodes and these data nodes process file operation requests from the clients.

As explained above, in HDFS, large files can be distributed and stored effectively. Moreover, you can also perform distributed processing of operations by using the MapReduce framework based on the chunk location information.

Compared to OwFS, the weakness of HDFS is that it is not suitable for processing a large number of files. This is because a bottleneck can occur at the namenode. If the number of files increases, OOM (Out of Memory) occurs at the service daemon of the namenode, and consequently the daemon process is terminated.

The features of HDFS are as follows:

A large file is divided into chunks and distributed and stored into multiple data nodes.
The size of a chunk is usually 64 MB, each chunk has three replicas, and chunks are stored in different data nodes.
The information on these chunks is stored in the namenode.
It is advantageous for storing large files, though if the number of files is large, the burden of the namenode increases.
The namenode is a SPOF, and if a failure occurs at the namenode, HDFS will stop and must be restored manually.

As HDFS is written in Java, its interface (API) is also a Java API. However, you can also use C API by using JNI.

There are some other distributed file systems these are GFS2, Swift, Ceph and pNFS.

GFS2

Google's GFS is to distributed file systems what the Beatles were to the music industry, in that many distributed file systems, including HDFS, were inspired by GFS.

However, GFS also has a huge structural weakness. It is vulnerable to namenode failure. Unlike HDFS, GFS has a slave namenode. This is why GFS is less susceptible to failures than HDFS. Despite its slave namenode, however, when a failure occurs at the master namenode, the transfer time is not short.

If the number of files increases, the amount of metadata also increases, and consequently the processing speed is deteriorated, and the total number of files available is also limited due to the limit of the memory size of the master server.

Usually the size of a chunk is 64 MB, and GFS is too inefficient to store data smaller than this size. Of course, you can reduce the size of a chunk, but if you reduce the size, the amount of metadata will increase. For this reason, even when there are many files smaller than 64 MB, it is still difficult to reduce the size of a chunk.

However, GFS2 overcomes this weakness of GFS. GFS2 uses a much more advanced metadata management method than GFS. The namenode of GFS2 has a distributed structure rather than a single master. In addition, it stores metadata in a correctable database, such as BigTable. Through this, GFS2 addresses the limit of the number of files and the vulnerability to a namenode failure.

As you can easily increase the amount of metadata to be processed, you can reduce the size of a chunk to 1 MB. The structure of GFS2 is expected to have a huge influence on approaches to improving the structure of most other distributed file systems.

Swift

Swift is an object storage system used in OpenStack, which is used by Rackspace Cloud and others. Swift uses a structure in which there is no separate master server, as Amazon S3 does. It uses a 3-level object structure (Account, Container and Object) to manage files. The Account object is a kind of account used to manage containers. The Container object is an object used to manage the Object object like a directory. It is like a bucket in Amazon S3. The Object is an object corresponding to a file. To access this Object, you should access the Account object and Container object, in that order. Swift provides REST API, and has a proxy server to provide the REST API. It uses a static table with the predefined location to which an object has been allocated, and this is called Ring. All servers in Swift share the Ring information and find the location of a desired object.

As the use of OpenStack has been growing rapidly with the participation of more and more large companies, Swift has recently been getting more attention. In Korea, KT is participating in OpenStack, and provides its KT uCloud server by using Swift.

Ceph

Ceph is a distributed file system with a unique metadata management method. Like other distributed file systems, it also manages the namespace and metadata of the entire file system by using the metadata server. But it features the operation of metadata servers in clusters and the dynamic adjustment of the namespace area by metadata according to the degree of load. This allows you to easily respond when load is concentrated on some parts, and easily expand metadata servers. Moreover, unlike other distributed file systems, it is compatible with POSIX. This means that you can access a file stored in the distributed file system as in the local file system.

It also supports REST API, and is compatible with the REST API of Swift or Amazon S3.

The most noticeable thing about Ceph is that it is included in Linux Kernel source. The released version is that high, but as Linux develops, Ceph may eventually become the main file system of Linux. It is also attractive as it is compatible with POSIX and supports kernel mount.

Parallel Network File System (pNFS)

As mentioned above, NFS has multiple versions. To resolve the scalability issue of the versions up to NFSv4, NFSv4.1 has introduced pNFS. This version enables you to process the content of a file and its metadata separately, and to store a single file in multiple distributed places. If the client brings the metadata of a certain file and learns the location of the file, it will be connected to servers that contain the content of the file when it accesses the same file later. At this time, the client can read or write the content of the file in multiple servers in parallel. You can also easily expand the metadata server, which manages metadata, to prevent the occurrence of a bottleneck phenomenon.

pNFS is an advanced version of NFS, and reflects the recent trends of distributed file systems. Therefore, pNFS has the advantages of NFS, as well as the advantages of the latest distributed file systems. Currently, there are some, if not many, products that support pNFS being released.

One of the reasons we should pay attention to pNFS is that currently NFS is not managed by Oracle (Sun) but by the Internet Engineering Task Force (IETF). NFS is used as a standard in many Linux/Unix environments, and thus pNFS may be popularized if many vendors release products that support pNFS.

Think and Work on Java

Tuesday, December 8, 2015

Choosing Hadoop Data Organisation for your Application

Monday, December 7, 2015

Distributed File Systems for Big Data