What is Big Data | If you liked Linux, you’ll love Hadoop

ERP systems with Business Intelligence and Business Warehouses have been the mainstay of data analytics for enterprise organizations for the last 20-years. Some big data environments become large and expensive, representing data in structured databases. Most business environment databases exist in a data silo which is effective for ERP reporting systems but does not provide deep data relational analysis across all business and product functions.

The Library of Congress, as well as at least 10-business sectors of 1,000 or more employees, maintains roughly 255 TB of data, all data being housed in a separate silo system. If this same data had been stored in the “Big Data Platform”, data analysis would have been performed across multiple business departments, not possible in the single silo data storage process. By using tools and capability provided from the Big Data company framework of Open Source software, typically based on Hadoop, we can integrate unstructured data from any source: emails, customer feedback, Twitter feeds, Facebook, Global positioning data, etc.

Hadoop and the Open Source tools can address the challenge of bringing structure and unstructured data together and can provide business entities with the additional analytical processing power to enterprise environments. Hadoop was first developed by Yahoo and has triggered an expansive development of Open Source tools designed to reformat data into blocks of data prepped for parallel processing. It is this process of converting the data into parallel blocks that make it possible for a Big Data environment to process these large and complex data sets.

With the data, structure and unstructured, hosted in the Big Data’s common repository, an analysis of data sets can find new correlations, spotting business trends, identifying service related issues, and providing the ability to make successful business decisions based on department, product, and process relational data analysis.

The Hadoop platform is constructed by carving up the enormous amount of data into sections which run in a parallel database. A typical Hadoop cluster uses inexpensive processors which become a massive parallel processing facility each based on inexpensive servers with their own storage array. This parallel process is typically based on a collection of independent processors usually 8 cores, 48GB of RAM, and 48 TB of disk stripped across a SATA disk array. A Hadoop cluster can be as small as 10-nodes and can continue to grow as the business demand requires. In the Hadoop cluster, a single machine failure ends up having its load re-distributed across the other nodes. Most Hadoop processing is in memory as part of the parallel configuration thus providing the substantial benefit of high availability. Security can be addressed to meet enterprise requirements using Kerberos, data encryption, and SSL support.

A wide array of open sourced tools can be used for the ingested data processing required for structured data extracted from traditional ERP or CRM systems. The OS used in support of Hadoop is generally a derivative of Red Hat, a leading Linux distribution. Many of the tools take advantage of Apache, a mainstay in the Open Source process as a HTTP front-end. The Open Source community has produced a wide collection of tools such as Spark for managing cluster framework, HFDS – the Hadoop distributed file system, H catalogue – as a table and storage management layer that enables easier use of code like Apache’s MapReduce – managing large data sets with a parallel, distributed algorithm on a cluster, and Hive – a data warehouse infrastructure. Together these tools provide the capability of managing massive amounts of data to be staged in a parallel processing configuration running on a Hadoop cluster. Such a configuration provides analytical power on a level not previously available.

While the concept of using pizza box servers configured into a Hadoop cluster coupled with a wide array of Open Source tools is an attractive alternative to expensive enterprise applications, the task of installing, and configuring Hadoop requires a level of expertise not common in the enterprise community. In addition, once Hadoop is operational, extracting data and preparing data to be ingested into the Hadoop distributed file system (HFDS) for meaningful analytics can be challenging and confusing. The HDPlus BRData (Business Relevant Data) module simplifies and streamlines the analysis process and negates the amount of Hadoop-specific skill required.

HDPlus BRData provides a mechanism for extracting meaningful data from ERP systems such as SAP, Oracle, and MS SQL repositories. With the structured data collected, HDPlus BRData can streamline the process of preparing the data for ingestion into a Hadoop HFDS. Recurring jobs can be setup to keep the HFDS refreshed with current data from the various sources. Building the Data Lake can be completely managed via the HDPlus BRData module.

HDPlus provides Big Data consulting services working with HDPlus BRData and/or Big Data platforms in general. Contact us today to discuss your Big Data consulting needs.