The IT team at Webtrends Inc. has been working with Apache Spark since the processing engine was still an academic research project. But when Webtrends deployed a Hadoop-based big data environment to power new analytics applications in mid-2014, the Spark architecture got a limited role: aggregating details about data indexes to help users find relevant information. Now, things have changed, and the company is leaning much more heavily on Spark as part of an updated version of its big data platform.
Webtrends, which collects user activity data from websites, mobile devices and the internet of things for analysis by corporate clients, put Spark at the center of an application called Infinity Analytics that it began beta testing at the start of 2016 and is now marketing to customers. The organization, based in Portland, Ore., set up a 160-node Spark system to support real-time optimization of online marketing campaigns based on rapid analysis of the activity data streaming into the Hadoop cluster. “We’ve basically unleashed Spark against the data lake to do all the calculating,” CTO Peter Crossley said.
More and more organizations are similarly turning to Spark to help speed up big data processing jobs. Independent statistics are hard to come by on adoption of the processing engine, which didn’t become available in a 1.0 version through the Apache Software Foundation until May 2014. But Databricks Inc., a startup vendor that’s the driving force behind Spark’s development, says that more than 500 organizations have deployed the technology in production applications. And Spark has clearly elbowed its way into the big data spotlight alongside Hadoop, with which it’s often — but not always — paired.
Spark’s initial calling card was its ability to run batch processing applications faster than MapReduce, the programming environment and execution engine embedded in the original version of Hadoop. Spark proponents claim that its core in-memory engine can process data up to 100 times faster than MapReduce, at least in laboratory benchmarks. Batch jobs are still a big use for Spark, both in analyzing large volumes of data and in prepping them upfront via extract, transform and load (ETL) routines.
There’s more to the Spark architecture than that, though. The technology can also handle more interactive and real-time workloads through a set of add-on components, including a machine learning library, a stream-processing module and a graph-processing interface. In addition to its processing speed and application versatility, Spark lets users avoid programming in MapReduce; instead, they can tap higher-level and more familiar languages, such as Java, Python, Scala, SQL and R.
No lack of Spark deployment options
Hadoop doesn’t have to be part of the picture at all: Spark can run in stand-alone mode against data stores other than the Hadoop Distributed File System — for example, NoSQL databases and the Amazon Simple Storage Service. Also, it isn’t just Databricks pushing the processing engine. Spark is also offered and supported by IBM, Microsoft, Amazon Web Services and Hadoop distribution vendors Cloudera, Hortonworks and MapR Technologies, as well as other big data vendors.
Peter CrossleyCTO, Webtrends Inc.
On the other hand, Spark is still an emerging technology that has further maturing to do, according to some early adopters. For example, they cited lingering memory management issues, missing features compared with MapReduce, incomplete data encryption support, and a lack of tools for monitoring and managing Spark systems. To fill those gaps and others, the pace of development on the Spark architecture has been fast and furious: There were nine releases of the Apache open source software during 2015, plus five more thus far this year, including a Spark 2.0 version that became available in July.
But the maturity issues aren’t holding back users such as Webtrends. From Crossley’s standpoint, Spark is eminently production-ready. “It’s a stable [technology], and I have no hesitation at all about deploying it,” he said.
MapReduce wouldn’t have cut it for the fast performance that Webtrends is eyeing on the Infinity Analytics application, but Crossley described Spark and its Spark Streaming module as a perfect fit. “The idea is that this data moves seamlessly through our system, and it’s happening in real time. To look at the data and interrogate it in a fast way really required us to go with something like Spark.”
Each day, Webtrends funnels data on more than 13 billion online events — internet clickstreams, for example — into its Hortonworks-based Hadoop cluster. It took 12 hours to make the incoming data available for use in the company’s first big data analytics application, called Explore.
With the Spark platform processing the data in streams and running automated machine learning algorithms against it, marketing managers and data scientists at beta user organizations of Infinity Analytics were initially able to get information within a few minutes, Crossley said. His goal is to get the delay down to a matter of seconds so clients can dynamically tailor webpages and marketing offers to site visitors.
Upward mobility for Spark architecture
Synchronoss Technologies Inc. has also centered its big data environment on Spark. “We’ve pretty much standardized on Spark as our data processing engine,” said Suren Nathan, senior director of big data analytics at the Bridgewater, N.J., company, which sells mobility management applications and related analytics services to mobile network operators and corporate enterprises.
The big data implementation also includes a Hadoop cluster running MapR’s distribution; it originated at Razorsight Corp., an analytics provider that Synchronoss acquired in August 2015. Nathan, who led the deployment at Razorsight, said Spark initially is being used as a faster alternative to MapReduce for several “workhorse” batch applications. That includes ETL data integration jobs, as well as data profiling programs that give the analytics teams at Synchronoss a view into the device, network and operations data the company collects from its clients.
But Synchronoss is looking to expand its use of the Spark architecture into more real-time processing realms. By year’s end, it plans to add Spark Streaming for applications such as tracking use of mobile devices so marketing offers can be sent to consumers “at the point of the event,” Nathan said. He then expects in 2017 to turn to MLlib, Spark’s machine learning library, to fuel automated analytics applications — for example, detecting fraudulent activities and violations of mobile-device security policies on corporate networks.
Synchronoss also does SQL programming via the software’s Spark SQL module, in addition to writing application code in Python and Java. Between the core engine and the components surrounding it, the Spark platform “is kind of a one-stop shop” for the company’s big data processing needs, Nathan said. “If we didn’t use Spark, we would have had to use a different piece of technology for all of those things.”