Integration of Hadoop and MongoDB Gets Significant Upgrade

Date 2013/8/23 8:56:23 | Topic: Product News

10gen, the MongoDB company, has announced significant updates to its MongoDB Connector for Hadoop making it easier for Hadoop users to integrate with data in MongoDB, the most popular database for big data systems across a variety of measures. The Connector exposes the analytical power of Hadoop's MapReduce to live application data from MongoDB to drive value from big data more quickly.
Key Enhancements:
* Support for Apache Hive with SQL-like queries across live MongoDB data sets
* Support for incremental MapReduce jobs, enabling simple and efficient ad-hoc analytics
* Support for MongoDB BSON files on Hadoop Distributed File System (HDFS) to reduce data movement

The Connector presents MongoDB as a Hadoop-compatible file system. Real-time data from MongoDB can be read and processed by Hadoop MapReduce jobs, such as when aggregating data from multiple input sources or as part of Hadoop-based data warehousing or ETL workflows. The results of Hadoop jobs can also be written back to MongoDB to support real-time operational processes and ad-hoc querying.

The new Connector adds support for MongoDB's native BSON (Binary JSON) backup files which can be stored locally in HDFS, reducing data movement between MongoDB and Hadoop, or on local or cloud-based file systems such as Amazon S3. Accessing MongoDB backup files can also reduce load on busy operational MongoDB clusters.

In addition to existing MapReduce, Pig, Hadoop Streaming (with node.js, Python or Ruby) and Flume support, the new MongoDB Connector for Hadoop enables SQL-like queries from Apache Hive to be run across MongoDB data. The latest version of the Connector enables Hive to access BSON files, with full support for MongoDB collections scheduled for the next release later this year.

MongoUpdateWriteable is another new feature of the Connector that allows Hadoop to modify an existing collection in MongoDB, rather than only writing to new collections. As a result, users can run incremental MapReduce jobs to aggregate trends or pattern matching on a daily basis, which can then be efficiently queried in a single collection by MongoDB.

"We are seeing strong market adoption of MongoDB for real-time operational big data and Hadoop for deep, offline analytics. The community has been asking us to make these tools interoperate seamlessly, so they can focus on building value in their applications," said Max Schireson, CEO of 10gen. "The latest upgrades to the MongoDB Connector for Hadoop provide this interoperability."

The MongoDB Connector for Hadoop adds to the broadest set of query and data analysis capabilities of any NoSQL database, enabling companies to reduce the number of tools they use to get value from their data. Options for users also include:
* The MongoDB API, which was recently adopted by IBM as the new standard for building mobile applications
* The Aggregation Framework, which provides functionality similar to SQL GROUP_BY operators
* Integrations with leading BI vendors like QlikTech, Informatica, Pentaho and Talend to perform BI on their live data
* Native MapReduce within MongoDB when integration with Hadoop isn't needed

MongoDB is the open-source, document database popular among developers and IT professionals due to its agile and scalable approach. MongoDB provides a JSON data model with dynamic schemas, extensive driver support, auto-sharding, built-in replication and high availability, full and flexible index support, rich queries, aggregation, in-place updates and GridFS for large file storage. Common use cases include operational and analytical big data, content management and delivery, mobile and social infrastructure, user data management and data hub.

This article comes from Software Development Tools

The URL for this story is: