Lecture 16 – Analyzing Big Data with Twitter: Spark by Matei Zaharia
Articles,  Blog

Lecture 16 – Analyzing Big Data with Twitter: Spark by Matei Zaharia


okay so our last guest lecture for the
class but certainly not least is Matei Zaharia who is actually quite famous these
days for having yeah you are for being the inventor of the Spark system which is
really the hot competitor to Hadoop these days for processing large data
Matei is currently a PhD student here at UC Berkeley but he’s worked on Twitter
data as well as a lot of other kinds of data so it’s definitely relevant to the
class so let’s give him a warm welcome thank you very much okay well thanks a
lot for having me talk in the course I know you guys have seen a bunch of
things on parallel programming models for big data already so I hope this will be
interesting this is a project that we’re doing right here at Berkeley and Spark
is a research project we started here a couple of years ago and it’s also an
open source project that’s actually starting to get some real users in
industry and so that’s pretty exciting to us now we’re part of the AMPLab
which is a computer science lab focusing on systems for big data so just in a
nutshell Spark is a parallel computing framework that provides a few things one
of the main things it provides that previous systems didn’t is efficient
primitives for doing computation in memory across a cluster which can be a
lot faster when you have computations that go over the same data many times
and I’ll explain why that is it also provides some simple clean APIs in
Scala Java and SQL in fact we’re working on a Python API as well that
makes it faster to write programs even if you’re not using the in-memory stuff
and finally it tries to be highly general so it’s applicable to a lot of
emerging applications that so far people have built specialized systems for and
that’s one of the things we’re excited about from a research perspective so in
this talk I’ll talk a little bit about how it works and you know kind of what
it lets you do and I’ll also talk about how people are using it and we have a
bunch of applications both in industry and at Berkeley you know people doing
their research using it and finally I’ll talk a bit about some of the current
research that we’re doing and since you guys have seen you know GraphLab and
Scalding and Pig and things like that I made this talk slightly more “researchy”
than you know if I were just giving brand-new audience a talk on on Spark
but hopefully it will still make sense and certainly ask questions because I’m
glad to answer them and even to talk about things that aren’t in the slides
so let me talk first about why we wanted to do Spark lab actually has worked with
with MapReduce and Hadoop for at least five years and we we saw that there was
a huge amount of success of MapReduce in in terms of use and it’s not just web
companies like Twitter and Google that are using it you have today
Bank of America Visa you know science labs like the group at the Large Hadron
Collider all of these use Hadoop MapReduce so that was very successful
but we also saw that as soon as people started putting these big data sets
together and having file systems to manage them having the infrastructure to
power system they wanted to do a lot more than MapReduce could offer and in
particular people wanted to do three things one of the things they wanted to
do is to run more complex applications especially multi pass algorithms that go
over the data several times and these are things like machine learning
algorithms and graph algorithms which you guys saw with with GraphLab the
second thing they wanted to do is run more interactive queries so you know
imagine your’re Google you collected the crawl of the web overnight and you had
a job you know it took four hours to build a web index that’s great but now
if you have a question that isn’t answered by your index how quickly can
you ask that question about whatever however many terabytes of data that is
and instead of that question taking hours can you get an answer in seconds
and the last thing people wanted to do is more real-time computation so you
know you’re Twitter for example you built a model for identifying
users every night so that’s that’s great you can now identify these users but can
you now run this model every minute or every second to find the users in real
time so I just want to point out these are really natural directions to go
after doing batch processing and these three places the complex apps interactive
and real-time are actually were a lot of the value in big data lies if you get
the data but you have to wait you know multiple hours to ask a question about
it it’s not as useful as if you could answer that question in a second so so
unsurprisingly there’s been a lot of attention to these things but the thing
that has happened is that people have designed a lot of specialized tools and
specialized programming models for these applications and just some examples are
Bagel for graph processing or GraphLab is another example or Storm for stream
processing so in Spark we started with the observation that these models actually
have a lot in common and we wanted to design one system that can handle all of
them and that has a lot of benefits that that I’ll talk about and in particular what
we saw is that really the reason you know you
need something other than MapReduce to do these
applications is that all of them require more efficient data sharing than you
have in MapReduce so these multi pass algorithms streaming and interactive
queries all things that go over the same data multiple times you know they reuse
it across time steps or across iterations of an algorithm and that’s
the problem we wanted to address so let’s look at what happens if you try to
implement these in MapReduce I have two examples here one is a iterative
algorithm so imagine something like PageRank and the other is interactive
queries so if you have the PageRank at the top you’re gonna basically PageRank is a
bunch of MapReduce steps so each time you kind of update the rank of each page
and on each step you’ll start with your data being say in the Hadoop file
system HDFS and you load it in you’ll do a MapReduce to compute something and you
have to write your new state out and the only abstraction you have for writing it
out in in Hadoop is write it back as a file in the file system so you did all
this crunching now you write it out and when you write it too
HDFS you incur a few costs you incur the cost of like writing out the disk and
also the cost of replicating and across machines because that’s how HDFS
provides fault tolerance and so you did all this work of writing it out and now
you launch the next iteration and the next thing you do immediately is read
all this data back in you know and you keep doing this on each step and what
happens is if you actually profile these applications look at what they’re doing
they can easily spend 90% of their time actually just writing data out and
loading it back in that can easily be more expensive than the actual PageRank
computation the same thing happens with with the interactive queries so if your
data is on a disk based file system like HDFS each time you ask a question you
have to read it from disk and that’s slow so so you have the same kind of
problem of you’re going over this data multiple times and you’re doing it at
the speed of disk so I want to point out these things are slow due to due to data
replication and disk i/o but those two features of HDFS also necessary for
fault tolerance if we want to run this on a large cluster we need it to be fault
tolerant so it’s not that these guys did something you know completely misguided
in doing this this is it made sense to provide fault tolerance so what we
wanted to do in Spark is provide this sharing at the speed of memory instead of
disk and basically see what we can do to make this go you know at the speed of
memory so the picture is pretty simple we place all the things where we were
writing to disk to writing to memory and the reason to do memory is also really
simple memory is easily ten to a hundred times faster than the network or disk
and in a modern computer even the fastest networks you get today like you
say you get 10 Gigabit Ethernet will be more than ten times slower than the
memory bandwidth you have within a machine and certainly disks are about a
hundred times slower so it made sense to do this at memory speed the challenge
then was how to get the fault tolerance how to get that back so that’s the
problem we looked at from sort of a systems research point of view how do we
design an abstraction to let you store data and distributed memory that is both
fault tolerant and efficient by efficient I mean it actually goes at
memory speed and so even though there have been a bunch of systems that
store data in memory none of them actually have the efficiency property
yeah yeah I will talk about it what do you mean you mean what kind of faults
are trying to deal with so we want if any node in the in the cluster goes down
or slows down and you know is going much slower than the rest we want to be able
to recover the computation and you know to basically finish the computation get
the same answer we would have had if that didn’t happen and we also actually
tolerate multiple nodes failing and all this stuff so yeah yeah yeah okay so so
that’s yeah so this is what you want it to do so if so one option we could have
done is to take an existing in memory storage system and there are a bunch of
these but it turned out that that none of them actually have the right level of
efficiency so if you take so all these existing system things like key value
stores or even databases or distributed shared memory which is a bunch of
research systems are based on fine grained updates the state so basically
you have a big table sitting in the memory and you can read and write to any
cell in the table and this is it’s a really natural interface it’s kind of
how memory works on your local machine but it’s expensive to provide fault
tolerance because essentially you have to either put each piece of data on two
machines or write a log of the updates and send that log to multiple machines
each time you do an update and so basically each time you write data it
needs to go over the network and as I just said a couple of slides ago the
network is it’s ten to a hundred times slower than your memory so you’ll be
writing that at the network speed instead of the memory speed here so so
how do we deal with this how do we actually provide fault tolerance without
having to keep two copies of the data on two machines the way we do it is by
changing the interface so we came up with this this idea called resilient
distributed data sets or RDDs and these are parallel datasets you can
manipulate on the cluster but the interface instead of being these
fine-grained reads and writes is just coarse-grained operations you apply one
operation to the whole dataset and it turns out you can provide a lot of
operations in this model so we can provide you know map and reduce things
like group by joins SQL joins and so on what’s cool about this model is that
since we have this higher level understanding of the operation we can
provide fault recovery using what’s called lineage which is instead of
replicating the data we remember the operation we applied on a previous data
set so so we just log one operation that we applied on the whole data set like a
map function and if something fails we can recompute the last data and there’s
also no cost if nothing fails so we’re not replicating anything let me just
show what that looks like so if we have our datasets here before
in the first one for example we run these two MapReduce steps iteration 1
and iteration 2 and say that we lose this data set over here what RDDs
will do is they’ll remember the map 500 Internal Server Error

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator at [email protected] to inform them of the time this error occurred, and the actions you performed just before this error.

More information about this error may be available in the server error log.