BigData, Distributed Processing and Cloud services – A Perspective

Case 1.

1. Say a small organization goes into business. It uses data stored in excel to manage its day to day business.

2. It grows rapidly and invests in an RDBMS (Relational Database Management System) to store and analyze information.

3. It then decides on Data Warehousing systems to further analyze its information for some business critical analysis to maximize and target business

4. It launches a website and is a huge success. Information is pouring from all ends, relational database managing its day to day business, data warehousing doing lot of analysis

5. But there is more data that it sees critical – Its website logs site visits by every web access, web users have a lot of feedback and information logged in

6. Suddenly, the organization has a wealth of so much information, structured and unstructured as well.

Case 2.

1. A Medical device company manufactures life critical machines. It wants to do more and its devices are now capable of sending lots of information, status and critical information, many times a second.

2. Large data is now streaming and is stored over disks.

3. This data is critical to the patients as well as critical to the performance of the device

4. This data when analyzed can help future improvement in devices as well as create new horizons in patient are.

5. Data is flowing from various aspects of the devices. faults, functional status, patient’s status, more than one can imagine.

6. This Data is further complemented with other structured data, like sales, device tests

Now we say this data is collected over period of time. Structured and unstructured data has gone from gigabytes, terabytes to Petabytes

These businesses need to analyze this data to feel the pulse on their business and performance.

This is Big Data and it needs more than traditional RDBMS or Data Warehousing techniques to analyze data.

A single computer cannot process this data which is constantly growing. Over a short period of time results can vary.

A super computer is expensive to manage and maintain to analyze this data. Soon this computer will be outdated with the amount of data that is pouring in.

Enter Big Data and Distributed processing.

What if we can make equal chunks of this data send it to thousands of computers along with the program to analyze it. One I get back this data I can further put it together and report on the analysis.

Now I process this data once every 2 days. I don’t need to invest in 1000 computers.

Let’s go to the cloud launch a 1000 cloud instances to analyze data and then terminate those instances