cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1944
Views
10
Helpful
1
Replies

Apache Spark is opening the world of big data to possibilities previously unheard of

mizhu2
Cisco Employee
Cisco Employee

By adding real-time capabilities to Hadoop, Apache Spark is opening the world of big data to possibilities previously unheard of. Spark and Hadoop will empower companies of all sizes across all industries to convert streaming big data and sensor information into immediately actionable insights, enabling use cases such as personalized recommendations, predictive pricing, proactive patient care, and more.

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark.

Download our complimentary book excerpt to read about:

  • An introduction to Apache Spark on Hadoop
  • An introduction to Data Analysis with Scala and Spark
  • The Spark Programming Model
  • The steps for getting started
  • And a lot more!
1 Reply 1

micahwilliams12
Level 1
Level 1

First of all, before starting with any big data analytics tool like Spark, Flinketc., you need to be familiar with the concept of Map, Reduce and Filter operations [there are a lot more but these are the basics]. I'm just going to describe briefly what they are using the following example [Source: Examples | Apache Spark]

First of all, you specify the input data path where from the data will read when the Job is  executed. Next, the flatMap operation splits each line on a space and  returns the results as a collection of words. This is analogous to  mapping functions you must've encountered. [For example, Python's list  map]. The only difference is that the flatMap function can return  several elements instead of just one, which map does.


Next,  every word arriving from the previous stage is assigned a weight of 1  initially using the map operation, and the reduceByKey function call  simply groups the same words and takes their sum using the _+_ operator.


I think that clarifies the most fundamental operations you would need to start doing anything on your data.

      val textFile = spark.textFile("foo.txt")

    1. val counts = textFile.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + _)
    2. counts.saveAsTextFile("result.txt")

  1. Now that we've got that out of the way, the best way to start would be to play around with Spark first. Spark provides a very nice shell interface. Every step I described in the previous example can be typed one by one in the shell and executed. Try out some examples on it. Take help from the example page linked above.
  2. Okay. You've made it so far. Now, to the real stuff. You will need to figure out the kind of analysis you want to run on your data. Usually, if you're going to use the data to predict future operations and activity, or say build a recommendation system, Spark has an extensive Machine learning library. Consider for example the task of training a SVM classifier on your data [Source: Spark MLlib]. This is all you need to do:
    1. val data = // load data in the libsvm format
    2. val model = SVMWithSGD(data, number of iterations)

  3. Next, if you want to perform SQL like queries on your data, Spark SQL comes to your rescue. Spark supports operations like Select, Filter, Grouping, Count etc. Take a look at the API to find the ones you require.
  4. Now, this all can be easily extended to data distributed across several machines. Of course you would need a distributed data storage system like HBase, HDFS, etc. Spark seamlessly integrates with these different file systems, and with simply providing the paths to files as, say, 'hdfs://a.txt' will get it working. You can focus on further optimizations later on.