Really big data: The challenges of managing mountains of information
Shops that shepherd petascale amounts of data have figured out some interesting methods for getting the job done.
Computerworld - If you think the storage systems in your data centers are out of control, imagine having 449 billion objects in your database, or having to add 40 terabytes of new data each week.
The challenges of managing massive amounts of big data involve storing huge files, creating long-term archives and, of course, making the data accessible.
While data management has always been a key function in corporate IT, "the current frenzy has taken market activity to a whole new level," says Richard Winter, an analyst with Wintercorp Consulting Services, a firm that studies big data trends.
New products appear regularly from established companies and startups alike. Whether it's Hadoop, MapReduce, NoSQL or one of several dozen data warehousing appliances, file systems and new architectures, the data analytics segment is booming, he says.
"We have products to move data, to replicate data and to analyze data on the fly," says Winter. "Scale-out architectures are appearing everywhere as vendors work to address the enormous volumes of data pouring in from social networks, sensors, medical devices and hundreds of other new or greatly expanded data sources."
Some shops know about the challenges inherent in managing really big data all too well. At Amazon.com, Nielsen, Mazda and the Library of Congress, this task has required adopting some innovative approaches to handling billions of objects and petascale storage media, tagging data for quick retrieval and rooting out errors.
Taking a metadata approach
The Library of Congress processes 2.5 petabytes of data each year, which amounts to around 40TB a week. Thomas Youkel, group chief of enterprise systems engineering at the library, estimates the data load will quadruple in the next few years as the library continues to carry out its dual mandates to serve up data for historians and preserve information in all its forms.
The library stores information on 15,000 to 18,000 spinning disks attached to 600 servers in two data centers. Over 90% of the data, or more than 3PB, is stored on a fiber-attached SAN, and the rest is stored on network-attached storage drives.
"The Library of Congress has an interesting model" in that part of the information stored is metadata -- or data about what is stored -- while the other is the actual content, says Greg Schulz, an analyst at consultancy StorageIO. Although plenty of organizations use metadata, Schulz explains that what makes the Library of Congress unique is the sheer size of its data store and the fact that it tags absolutely everything in its collection, including vintage audio recordings, videos, photos and files on other types of media.
The actual content -- which is seldom accessed -- is ideally kept offline and on tape, with perhaps a thumbnail or low-resolution copy kept on disk, Schulz explains. The metadata can reside in a different repository for searching.
The library uses two separate systems as a best practice for preserving data. One is a massive tape library that has 6,000 tape drive slots and uses the IBM General Parallel File System (GPFS). This file system uses a concept similar to metatagging photos at Flickr.com: files are encoded with algorithms that make the data easier to process and retrieve quickly.
A second archive, with about 9,500 tape drive slots, consists of Oracle/Sun tape libraries that use the Sun Quick File System (QFS) with Oracle SL8550 tape libraries.
Another best practice: Every archive is sent to long-term storage, then immediately retrieved to validate the data, then stored again.
Today the library holds around 500 million objects per database, but Youkel expects this number to grow to up to 5 billion objects. To prepare for this growth, Youkel's team has started rethinking the namespace system. "We looking at new file systems that can handle that many objects," he says.
Gene Ruth, a storage analyst at Gartner, says that scaling up and out correctly is critical. When a data store grows beyond 10PB, the time and expense of backing up and otherwise handling all of the files go quickly skyward. One approach: Have one infrastructure in a primary location that handles the ingestion of most of the data, and then have another, secondary long-term archival storage facility.


- Excel 2010 Cheat Sheet
- Register for this Computerworld Insider Cheat Sheet and gain access to hundreds of premium content articles, guides, product reviews and more.
- Capture-Enabled Business Process Management
- Organizations today must deal with a vast amount of incoming information from many different sources. Efficient, automated business processes are critical to managing...
- Using Case Management to Empower Employees and transform Customer Service
- This Kofax paper shows how successful customer service organizations have transformed customer service by empowering their employees. We will see how Dynamic Case...
- Case Study: Audi-Volkswagen Improves Procurement Control
- Audi-Volkswagen required a user-friendly, easy-to-use Business Process Management system that did not require programming skills or high levels of technical expertise in-house. This...
- AIIM Market Intelligence: The paper-free office, dream or reality?
- In this Aiim Market Intelligence report, produced in association with Kofax, we look at the success of paper-elimination projects, where and why paper...
- Information Governance: Turning Data Into Business
- This whitepaper explores current information governance practices, challenges, and ROI among US, UK, and German firms. All BI and Analytics White Papers
- Live Webcast
How to Reduce Complexity and Automate Your Partners for Efficient E-Business: - Date: Tuesday, June 5, 2012, 2:00 PM EDT
Whether your B2B complexity is caused by multiple technologies due to M&A, business or application specific... - BMC Control-M - Single Point of Control Demo
- With BMC Control-M, you schedule and manage everything - down to the very last platform and application - from one simple interface. It's...
- Sun Chemical Customer Success Story
- Sun Chemical, the world's largest producer of printing inks and pigments, quadrupled its complex batch environment with zero extra headcount using BMC Control-M's...
- Service-Enabling CICS Applications: Best Practices
- This informative webcast provides an informed, thorough look into CICS service-enablement options and how they can affect your environment. You'll learn how to...
- Teaching Legacy Application Elephants How to Dance
- This four-minute video podcast shows how you can create services to continuously reuse enterprise applications, however and whenever needed, while leaving legacy logic...
- Verastream Host Integrator
- This six-minute product demo shows how you can use Verastream Host Integrator to modernize and service-enable legacy assets for use across your enterprise.... All BI and Analytics Webcasts