Apache Spark is opening the world of big data to possibilities previously unheard of

mizhu2 · ‎07-06-2016

By adding real-time capabilities to Hadoop, Apache Spark is opening the world of big data to possibilities previously unheard of. Spark and Hadoop will empower companies of all sizes across all industries to convert streaming big data and sensor information into immediately actionable insights, enabling use cases such as personalized recommendations, predictive pricing, proactive patient care, and more.

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark.

Download our complimentary book excerpt to read about:

An introduction to Apache Spark on Hadoop
An introduction to Data Analysis with Scala and Spark
The Spark Programming Model
The steps for getting started
And a lot more!

micahwilliams12 · ‎10-09-2017

First of all, before starting with any big data analytics tool like Spark, Flinketc., you need to be familiar with the concept of Map, Reduce and Filter operations [there are a lot more but these are the basics]. I'm just going to describe briefly what they are using the following example [Source: Examples | Apache Spark]

First of all, you specify the input data path where from the data will read when the Job is executed. Next, the flatMap operation splits each line on a space and returns the results as a collection of words. This is analogous to mapping functions you must've encountered. [For example, Python's list map]. The only difference is that the flatMap function can return several elements instead of just one, which map does.

Next, every word arriving from the previous stage is assigned a weight of 1 initially using the map operation, and the reduceByKey function call simply groups the same words and takes their sum using the _+_ operator.

I think that clarifies the most fundamental operations you would need to start doing anything on your data.

val textFile = spark.textFile("foo.txt")

val counts = textFile.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + _)
counts.saveAsTextFile("result.txt")

Now that we've got that out of the way, the best way to start would be to play around with Spark first. Spark provides a very nice shell interface. Every step I described in the previous example can be typed one by one in the shell and executed. Try out some examples on it. Take help from the example page linked above.
Okay. You've made it so far. Now, to the real stuff. You will need to figure out the kind of analysis you want to run on your data. Usually, if you're going to use the data to predict future operations and activity, or say build a recommendation system, Spark has an extensive Machine learning library. Consider for example the task of training a SVM classifier on your data [Source: Spark MLlib]. This is all you need to do:
1. val data = // load data in the libsvm format
2. val model = SVMWithSGD(data, number of iterations)
Next, if you want to perform SQL like queries on your data, Spark SQL comes to your rescue. Spark supports operations like Select, Filter, Grouping, Count etc. Take a look at the API to find the ones you require.
Now, this all can be easily extended to data distributed across several machines. Of course you would need a distributed data storage system like HBase, HDFS, etc. Spark seamlessly integrates with these different file systems, and with simply providing the paths to files as, say, 'hdfs://a.txt' will get it working. You can focus on further optimizations later on.