Tuesday, December 8, 2015

Choosing Hadoop Data Organisation for your Application

If you or your organisation decided to move to Hadoop solutions then you may need to think few points describe below on overall application design before moving to Hadoop. 

Data Storage : Data storage is one the important parts of overall application design in Hadoop. To do this correctly, you must understand which applications are going to access the data, and what is the access pattern. If data is mostly used by MapReduce implementation and sequential access of files then HDFS is the best option. Also data locality is also important in overall application performance. So, all those things are supported by HDFS.

File Format : File format is also an important factor while designing a Hadoop based application. If your application is most doing MapReduce processing then SequenceFiles will be the best option because its processing semantic is well aligned with MapReduce processing. SequenceFiles provides flexibility on providing compression on different level (Record, Block), it is more compact than regular text files, It provides Header records which contains meta data of the file, type of the file and also contains version information of the files. You can choose other file format specially when integrating with other applications. but you need to keep in mind about custom format as this will lead to additional complexities in reading, splitting and writing data.

Types of Calculation :  You need to think on the type of calculation you will be doing on the data. If you are considering all the data for the calculation then there is no additional considerations. But if your calculation considered on subset of the data, then you need to think on the data partitioning to avoid unnecessary data reads. The partitioning depends of the data usages pattern of the application. 

Data Conversion : As you know Hadoop/HBase is internally storing byte stream of the given data. So, you need to think on the data conversion of your data to byte stream. Here different potential options exists on marshalling/unmarshalling application specific data to byte stream. There are couple of standard Java marshalling approach. But here Apache Avro provides a generic approach for simplifying data marshalling. Avro provides both performance and compact data size. It also storing data definition along with the data and also provide data versioning.

Security : This is one of the important factor to secure data in HDFS or HBase. HDFS and HBase have quite a few security risks. The implementation of the overall security required application/enterprise-specifi solutions to ensure data security.



No comments:

Post a Comment