Skip the navigation

The Grill: Doug Cutting

Hadoop's creator discusses how the technology is making its presence felt industrywide.

December 19, 2011 06:00 AM ET

Computerworld -

Doug Cutting, creator of the open-source Hadoop framework that allows enterprises to store and analyze petabytes of unstructured data, led the team that built one of the world's largest Hadoop clusters while he was at Yahoo. Formerly an engineer at Excite, Apple and Xerox PARC, Cutting also developed Lucene and Nutch, two open-source search engine technologies now being managed by the Apache Foundation. Cutting is now an architect at Cloudera, which sells and supports a commercial version of Hadoop. Here he talks about the reasons for the surging enterprise interest in Hadoop.

Doug Cutting

The most interesting thing people don't know about you: One summer I worked in a salmon cannery 14 hours a day while camping in a swamp.

Favorite technology: The bicycle derailleur.

Favorite nonwork pastime: Walking, cycling, skiing or swimming with friends.

Favorite vice: It's a tie between an espresso at 9 a.m. and a beer at 5 p.m.

Four people you'd invite to dinner together: Thomas Pynchon (author), Bootsy Collins (musician), John Muir (naturalist) and my wife.

Best movie ever: Once Upon a Time in the West (Sergio Leone, 1968).

How would you describe Hadoop to a CIO or a CFO? Why should enterprises care about it? At a really simple level, it lets you affordably save and process vastly more data than you could before. With more data and the ability to process it, companies can see more, they can learn more, they can do more. [With Hadoop] you can start to do all sorts of analyses that just weren't practical before. You can start to look at patterns over years, over seasons, across demographics. You have enough data to fill in patterns and make predictions and decide, "How should we price things?" and "What should we be selling now?" and "How should we advertise?" It is not only about having data for longer periods, but also richer data about any given period.

What are Hive and Pig? Hive gives you [a way] to query data that is stored in Hadoop. A lot of people are used to using SQL and so, for some applications, it's a very useful tool. Pig is a different language. It is not SQL. It is an imperative data flow language. It is an alternate way to do higher-level programming of Hadoop clusters. There is also HBase, if you want to have real-time [analysis] as opposed to batch. There is a whole ecosystem of projects that have grown up around Hadoop and that are continuing to grow. Hadoop is the kernel of a distributed operating system, and all the other components around the kernel are now arriving on the stage.

Why do you think there's so much interest in Hadoop right now? It is a relatively new technology. People are discovering just how useful it is. I think it is still in a period of growth where people are finding more and more uses for it. To some degree, software has lagged hardware for some years, and now we are starting to catch up. We've got software that lets companies really exploit the hardware they can afford.

What is it about relational database technologies that makes them unsuitable for some of the tasks that Hadoop is used for? Some of it is technological challenges. If you want to write a SQL query that has a "join over tables" that are petabytes [in size] -- nobody knows how to do that. The standard way you do things in a database tops out at a certain level. [Relational databases] weren't designed to support distributed parallelism, to the degree that people now find affordable. You can buy a Hadoop-based solution for a 10th of the price [of conventional relational database technology]. So there is the affordability. Hadoop is a fairly crude tool, but it does let you really use thousands of processors at once running over all of your data in a very direct way.

What are enterprises using Hadoop for? Well, we see a lot of different things, industry by industry. In the financial industry, people are looking at fraud detection, credit card companies are looking to see which transactions are fraudulent, banks are looking at credit worthiness -- deciding if they should give someone a loan or not. Retailers are looking at long-term trends, analyzing promotions, analyzing inventory. The intelligence community uses this a lot for analyzing intelligence.

Are those users replacing relational databases, or just supplementing them? They are augmenting and not replacing. There are a lot of things I don't think Hadoop is ever going to replace, things like doing payroll, the real nuts-and-bolts things that people have been using relational databases for forever. It's not really a sweet spot for Hadoop.

Microsoft, Oracle, IBM and other big vendors have all begun doing things with Hadoop these days. What do you think about that trend? It's a validation that this is real, that this is a real need that people have. I think this is good news.

What advice would you give to enterprises considering Hadoop? I think they should identify a problem that, for whatever reason, they are not able to address currently, and do a sort of pilot. Build a cluster, evaluate that particular application and then see how much more affordable, how much better it is [on Hadoop]. I think you can do bakeoffs, at least for some initial applications. There is a real synergy when you get more data into a Hadoop cluster. Hadoop lets you get all of your data in one place so you can do an analysis of it together and combine it.

Where do you see Hadoop five years from now? It is going to start to be a real established part of IT infrastructure. Right now, these things from Oracle and Microsoft are experiments. I think they are trying to tinker with it. I think in five years those won't be experiments. [Hadoop] will be the incumbent.

My hope is to build something that is loosely coupled enough that it can evolve and change and we can replace component by component [so] there doesn't need to be a revolution again anytime soon.

This version of this story was originally published in Computerworld's print edition. It was adapted from an article that appeared earlier on Computerworld.com.

Read more about Storage in Computerworld's Storage Topic Center.



What is Tech Briefcase?
TechBriefcase is a new, free service where IT Professionals can Search, Store and Share IT white papers and content like this. Learn more
Bookmark content
Speed up your research efforts with content across the web.
Search and Store
Find the white papers you need. Create folders for any topic.
View Anywhere
Open your briefcase on your iPhone, tablet or desktop. Share with colleagues.
Don't have an account yet?
Additional Resources
Security KnowledgeVault
WHITE PAPER
Security is not an option. This KnowledgeVault Series offers professional advice how to be proactive in the fight against cybercrimes and multi-layered security threats; how to adopt a holistic approach to protecting and managing data; and how to hire a qualified security assessor. Make security your Number 1 priority.

Read now.

Cut Communications Costs Once and for All
WHITE PAPER
New IP-based communications systems are being deployed by small and midsized businesses at a rapid rate. Learn how these organizations are enabling faster responsiveness, creating better customer experiences, speeding office or mobile interactions, and dramatically reducing existing communications costs.

Read now.

Storage White Papers
IDG Tech Dossier: Converged Storage ~ A Next Gen Virtualized Architecture
Organizations need a strategy for rearchitecting storage so that it enables, rather than constricts, the delivery of IT services. According to HP, it's...
ESG Lab Review: HP 3PAR Peer Motion Software
This ESG Lab review documents hands-on testing of HP 3PAR Peer Motion Software's distributed volume management with a focus on federated workload balancing,...
IDG Tech Dossier: Converged Storage ~ Utility Storage: The Ideal Platform for Virtual and Cloud Computing
Server virtualization has transformed corporate IT -- companies have enjoyed major cost savings and have gained flexibility and efficiency. But this has also...
IDG Tech Dossier: Converged Storage ~ A Next Gen Storage Strategy for Big Data
Implementing Converged Storage is an evolution and does not require immediate wholesale replacement of current systems. But by putting a plan into place...
Hybrid Storage: How to Get the Best of Solid-state and Disk
Traditional disk storage has struggled to keep up with the I/O pressure in virtualized environments. SSD-only storage is relegated to the fringe due...
All Storage White Papers
Storage Webcasts
Live Webcast
Today's NAS: A Solution Beyond Old Limits
Date: Tuesday, July 17, 2012 2:00 PM EDT

Traditional NAS systems don't scale beyond fixed limits. Proliferation of NAS systems leads to management...
Today's NAS: A Solution Beyond Old Limits
Date: Tuesday, July 17, 2012 2:00 PM EDT

Traditional NAS systems don't scale beyond fixed limits. Proliferation of NAS systems leads to management...
Redefine Expectations in the Data Center
Need to do more with less? Watch this video to learn how HP ProLiant Gen8 servers can help your business deploy servers three...
Oracle Database Appliance Best Practices
Business users increasingly demand 24x7 availability of their data while IT departments face the challenge of ensuring maximum availability while operating with limited...
Data Privacy and Protection in Production Environments: New Research from Ponemon Institute
Date: Wednesday, June 13, 2012, 1:00 PM EDT / 10:00 AM PDT

In a recent study conducted by Ponemon Institute, fifty-five percent of respondents...
BMC Control-M - Single Point of Control Demo
With BMC Control-M, you schedule and manage everything - down to the very last platform and application - from one simple interface. It's...
All Storage Webcasts
Newsletter Sign-Up

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all newsletters | Privacy Policy
IT Jobs