D’WuhBeeAye – Bringin’ in the Hadoop!

Are we Hadooping now? How about now?

How Hadoop Can Help In Healthcare!

Meant to help solve problems dealing with lots of data, a combination of complex and structured data with simple data sets of simple numbers. Unfortunately, tables and databases don’t really have a mechanic that can help with this complex variety of data, nor is there really a tool that could parse this variety of simple and complex data sets for analytics. A set of procedures and tools meant to develop the backbone for big data operations, Hadoop has been widely used in analytics. Within Hadoop, there is a model called MapReduce that helps to pull data from the storage and map it into a usable format (cleaning up the data). After mapping the data and associating them into a smaller usable format, it will run through a series of algorithms and operations to reduce the data set. Primarily used in conjunction with data warehouses, it helps to transform unstructured data into a usable format and then loads it into the necessary architecture structures. If anything, Hadoop is what makes data and analytics so appealing and understandable to the C-Men executives.

Hadoop Source, Source2

So let’s look at the two main benefits and functions that Hadoop performs, the mapping mechanism, and the processing phase. (See what I did there? Alliteration!) Hadoop - Too Smart

If anything, Hadoop is made up of modules, each of which will carry out a a particular task designed for big data analytics. In a sense, Hadoop removes file size limitation and storage capacity allowing the organization to store big files and to think “big” again. However, with these large files, you would assume that utilizing these data sets and files could take a really long time, similar to opening an empty word doc vs a highly detailed AutoCAD file right? However, the second part of Hadoop is how to reduce these files that allow it to be used. It processes the data in a way to allow quick access, let’s continue to look into that.

How Is Hadoop Used?

As I mentioned, it removes the so called “soft” file size limitation and size restrictions, as a result it is used as a low-cost storage and data archive  that essentially acts as a pseudo-data warehouse. This forms a data repository that stores data in its original or exact format before it is extracted and transformed for its different purposes. Hadoop implements Google’s MapReduce algorithm by divvying up a large query into many parts, sending those respective parts to many different processing nodes, and then combining the results from each node.

Now – How it works?

As I mentioned, it is a series of programs and modules that carry out a particular task.

  1. HDFS (Hadoop Distributed File System) – Allows data to be stored in an easily accessible format, across a large number of linked storage devices. Afterwards, MapReduce allows the user to poke around in the data despite being stored in different sectors/nodes. Typically stored across multiple machines, but the Hadoop framework provides tools that are able to access the data, best thing is that it is “as-is”, meaning no prior organization.
  2. MapReduce – Maps out the data from a database into a format suitable for analysis, and then performs a series of operations that helps to reduce the vastness of data, essentially pre-processes some of the information for easier utilization of the data.
  3. Hadoop Common – Provides various tools for the user’s computer systems and any supported OS’. It allows the user to read the data stored under Hadoop.
  4. YARN (Yet Another Resource Negotiator) – Manages resources of the systems storing the data and running the analysis, resource management.

These tools, can then help the user/companies/organizations to expand and adjust their data analysis operations as it expands. Since it is open source, these organizations can simply build off of the source code to better fit their own expanding business needs. Unlike many data management tools, Hadoop was designed from the beginning as a distributed processing and storage platform. This moves primary responsibility for dealing with hardware failure into the software, optimizing Hadoop for use on large clusters of commodity hardware. A team in Colorado is correlating air quality data with asthma admissions. Life sciences companies use genomic and proteomic data to speed drug development. The Hadoop data processing and storage platform opens up entire new research domains for discovery. Computers are great at finding correlations in data sets with many variables, a task for which humans are ill-suited.

*** Hadoop Disclaimer ***

Hadoop is not a substitute for a database, databases are such a powerful tool in data management and although Hadoop sounds like it can do that, it simply cannot. It stores data, but ti does not index them. Although it is possible, it’s simply more resource consuming because it requires a MapReduce function and that will parse through all the time, which takes much more time.

Article by Sir. Lappleton III

I'm a happy-go-lucky recent graduate that started a blog as a way to not only document my education and my experiences, but also to share it with whoever stumbles upon my site! Hopefully I can keep you guys entertained as well as learn about a few things from IT as well as from my time and experiences as I plunge deeper and deeper into healthcare! A couple of my areas of focus is data management, system security (cyber security), as well as information technology policy.