[MUSIC] Welcome to Cloud Applications 2. This is our second course and we're really excited about bringing what has been new material to you. We have organized it, so we're going to cover quite a lot of the new activities in clouds, being applied to data systems. I'm Roy Campbell. >> I'm Reza Farivar. >> And we've worked hard at trying to get this together. So where do I come from? I've been a professor here for 40 years. I've been teaching systems. And all of a sudden, the whole systems world really got very exciting when file computing came along, cluster computing came along. And all of a sudden, there are just simply sort of lots of innovations and inventions going on. And the whole big data area is exploding, applications are increasing, jobs are increasing. This is a great time to be in computer science. >> Absolutely. Absolutely. Well, I'm hoping to bring some industry expertise into the class. I'm a manager of data engineering pipelines at the Capital One company. I'm also adjunct faculty at the university and together with Professor Campbell, hoping to bring you lots of interesting information in this segment of the market, which is, my God, exploding. It's just so much interest in big data and clouds these days. >> We've got a number of different modules for you. We'll take you briefly through those modules right now, and then each module will introduce the particular topic and get you into depth on each of those aspects. The first topic is really sort of, well, what about clouds now? What's happening to the big data world and so on. We're going to do it by example. What are we going to do is to choose Spark because it's so pervasive. And because, pretty much if you want to get started, Spark is a great way to start, you can do it on your laptop. But then you can move those same elements onto cloud computing and you don't have to change too much to actually make it all happen. So we'll go through that motivation for Spark. Actually, Reza, why don't you tell them what you're going to tell them. >> Yeah, yeah, absolutely. So before talking about the details, I want to mention that Spark has seen great, tremendous amount of support in the open source community in the past couple of years. And previous versions of this course when we started teaching this course at the university, Spark wasn't there. The whole course was based around Hadoop and that sort of model then. Of course, we still kind of mention Hadoop and talk about Hadoop. But Spark has been so great and tremendous in the community that we've reworked the course, and a lot of the examples, hands-on work on the course, now we are doing it in Spark. So we talked about, in the first module, we start by talking about Spark. We provide some examples on how you- >> You gotta do some mining. >> Yeah, we're going to do some log mining. We do [CROSSTALK] >> Some logistics. >> Logistic regression. >> Yep, some for torrents. Well, one of the things is as you write this a lot of the software that you would have to put in the system is already there. So we have these reliable data sets, RDDs, that provide you for torrents when you write to them. And then we're going to talk about how you interact with it. And what you'll find is that both interactive systems and batch types of systems, you're going to mention that. >> Exactly. >> And then because we're in academics, I guess, well, fairly academic enough. We'll lift the covers a bit and just talk about what's underneath Spark that makes it all work >> Yeah, we definitely want in this course to kind of give you a knowledge on how to use a lot of these systems and write applications from a hands-on point of view. But we also want to give you an overview of how these distributed systems work, what sort of algorithms are underneath. So we definitely will go a little bit into Spark's implementation in this course and other systems that we'll talk about in a couple of minutes. >> So once we've done that sort of motivation and introduction, I'm going to sort of introduce you to platforms that are around. They come in distributions, distros as they are called, and there's three that I really want you to get familiar with. One is Hortonworks. Hortonworks was one of the sort of founding Apache members when Apache moved to look at cloud computing. Hortonworks has two great suites, one for streaming data, and one for actually just building MapReduce and databases, big table type of operations. So we'll talk about Hortonworks, then I'm going to talk about Cloudera, which is often what people actually get their hands on when they sort of go to a commercial, in a university. If you take a course, you might well end up using Cloudera. And what is that, is a package that you can use. Pretty much all the things that you'll find in Hortonworks, but they'll just be organized and presented in a different way. And then we'll talk about MapArt, distro, again, pretty much repeating it but in a different sort of distribution framework with different emphasis slightly. So I'll overlay those and you can sort of pick them up, put them on your laptops, play with them. But those systems are really the basis for what you see out there on the cloud systems running. And then, of course, we've already introduced Spark. So you, by then, have a good feeling for what sort of systems you're going to meet. >> Exactly, so just to put a wrap on the distros, for you old timers, it's kind of like Linux how it used to be 20 years ago when you could actually download source codes from different parts and build this driver or build that driver and build a kernel. And you still can do that, the same mentality is here. You can download Hadoop and install it on your cluster and then you can download Storm and install it on your cluster. Or like Linux, we can just have them all packaged together. You install Hortonworks Sandbox, everything comes with it, Storm, Hadoop, Spark, whatever. >> It just makes life too easy, right? >> Right. >> Okay, then what we're going to do. We've talked about RDD earlier on, but we're going to sort of emphasize a little bit more how you get reliability and how you get really massive data sets. We're going to talk about HDFS and that file system for Hadoop and for a lot of other different types of big data operations. HDFS is kind of unique, it's early days that it happened, it's kind of like file systems you already know, but boy, can you move data with HDFS. >> My God, it can store massive amounts of data and it's really a building block for a whole bunch of these systems that we talk about. All of them can support HDFS. You put files in HDFS, like petabytes, multiple petabytes of information, it can easily handle them. And it all goes down to how they architected the system, right. It's all about how they designed the system. >> So if there's somebody you want to impress, then HDFS is there. Then we're going to move on to talking about infrastructures for doing some distributed computations like Hadoop and sort of some of the other systems that we're going to discuss, Storm and so on. They're all based now on a framework, it's kind of move along from our first courses. All those implementations are now being put on top of this very very smart sort of infrastructure called Yan, and scheduling, it's called Melos, that will organize all this. Basically, what it's looking at is all the interconnectivity of the parts, as a sort of graph, data dependencies, whatever. And then it builds schedulers on top of it to actually make all the parts work together. And that part of it is just, well, you'll find it in the really big 10,000 node systems. >> Exactly, I mean, when you go to industry applications, and you look at the real clusters, clusters with 20,000, 40,000 nodes, that's where you really need something like Yarn or MISOS to handle resources. And these days a lot of the frameworks that sit on top of YARN or MISOS, handle everything else. You have MapReduce on YARN, Storm can use YARN. Spark can use YARN. But YARN is a kind of the operating system that sits down there. >> So now, you will get the idea of exactly how we're going to present these modules. As we go forward, we're going to be doing the same thing. We're going to move on now to sort of looking at map reduce. And we start off with a Spark implementation and describe how it works, what the motive for it is. We're going to describe the programming model and then some examples. Reza, you're going to handle that one, we- >> [CROSSTALK] Yes, absolutely. So MapReduce is really a foundational way of thinking about distributed algorithms. Right, so before MapReduce came along, it was about, about ten years ago so, more or less, before MapReduce came along, people would write MPI programs for distributed systems and it was messy and it was hard to write. The whole idea of MapReduce came along and changed things. And we'll talk about how you can think in terms of MapReduce. We'll provide some examples and then we'll show you how to write a MapReduce example in Spark interactively and you'll see how easy it is to write a program that does lots of work in Spark and MapReduce. >> So as you will have realize, it's not just about the algorithms. It's also about the data structures. So we're going to, then, in module three, launch into just what are the data structures that you have to manipulate big data, or what's the support you have to do this. And I'm going to introduce large scale data storage and talk about the problems you get when you take data and you spread it across 10,000 nodes. That provides a whole problem about how do you, if you change anything, how do you get it all consistent? How do you actually look at all that data in a consistent way even if there's going to be nodes that are failing, even if you're updating some of the data in the system? How do you look at it consistently? So we'll look at eventual consistency, we'll look at sort of tradeoffs with different types of approaches. We'll talk about ACID, BASE, ZooKeeper, Pack-source, their techniques for keeping things consistent. And that will lead up to then a sort of good background to tackle the next chapter which- >> [CROSSTALK] Yeah, that stuff is actually very interesting, very exciting, to be honest. When you want to scale into large distributed systems, you can just think, okay, I'll just copy paste my machines. I'll have, instead of one machine, I'll have ten machines. But that whole idea introduces so many different challenges, and Professor Campbell is going to cover a lot of those- >> [CROSSTALK] That reminds me of a good example. When you take an algorithm, like one of the grass algorithms we'll be talking about later on, and you want to scale it from 100 nodes to 1,000 nodes, ow exactly do you do this and keep the computation running? >> Yes. >> You know, that's- >> [CROSSTALK] Proving that something, the liveness of an algorithm, well, when you have 1,000 machines, it's guaranteed that at each time, one or two or three or couple are in failed situations, right? And how can you make sure that your algorithm can actually even move forward? So that's a lot of interesting stuff, and Professor Campbell is going to cover them. >> Then we'll get into details about how you program it, and like all the other chats, we're going to start off with how do we use Spark to actually do things? >> So we'll describe, why not, sort of use a database, why not use SQL? Then we'll describe, well, a database that is actually scalable, each base, and then we use Spark to write some SQL, to play with those sort of databases and get a feeling for what's happening. Now, there's some really trendy things happening in databases. >> Yeah, that space is also seeing a lot of attention. We have picked up a couple of very popular ones. We will talk about h base, we will talk about Spark, which have now the SQL engine. And we'll also talk about some, in memory key radio stores. >> Tell me about Redis, Reza. >> Right, so Redis Air is actually an interesting system that solves a very interesting challenge. It has a niche and it does very well in that. So distributer databases, we talked about them, and we will talk about, of course in detail, all the issues with CAP theorem, and the consistency, and everything. Redis solves an interesting problem, how you can store key values in memory across a distributed cluster. And how can we make sure that it's extremely fast and very usable. >> You keep it in memory, right? >> You can keep it in memory, but of course, it's not that easy and there's a system Redis that handles that, then we'll talk about it. It's a very interesting building block. It's used in batch systems and databases, as we talk in this module. It's also very extensively used in streaming in the next module that we talked about. But before we talk about streaming, let's talk about Kafka. >> Kafka now, so everybody wants to write systems that pass messages backwards and forwards. You're picking up all the mouse clicks so all over the world, you need to put them into your databases, you need to count them, you need to sort of tell your advertisers, you need to build statistics and all sorts of other things. How are you going to do this? So Kafka really underlies how you do this. What it would do is like big messaging system built on published subscribed systems. There's publishers, like the mouse click guys that sort of feed you all sorts of counts, especially, I was just thinking about the Olympics. How many mouse clicks were there in the Olympics, just incredible numbers. Coming in, being processed, the advertisers get money, and that's a stream that's processed stuff then. The company providing the video streams gets its revenue. That's another set of streams. So this is all organized. They will be consumers of this data that Kafka was absorbing. So Kafka will be taking the mouse clicks as production, processing them, and then producing counts and so on as streams to The consumers in this case, the people actually putting the show on or charging the advertising. So we'll be talking about that. But it leads Kafka leads to this notion of how do you actually build streaming systems. One of the things that's interesting about Kafka is it handles the redundancy. If you've got streams coming in and you want to have redundant copies of that stream. Kafka will take control of that and do that for you without you having to worry about how that operates. And the other aspect to that is if you get failures, it's going to account for that. It actually uses a log file system, which is really cool to keep the messages in. >> Yeah. >> So yeah. We're going to have some excitement. >> I really like Kafka. I think of Kafka like a doc tape of big data. Remember actually you really need to connect one thing to another thing, and we have a video on that later on in the course, you want to connect things and you need to something between these two, you always use Kafka, it's kind of like duct tape, it's great. It works. It's reliable. It's fast, it can handle. Hundreds and thousands of events per second, and a couple machines. So that actually now gets us into streaming. Right. Streaming is important. >> I've got a lot of PhD students, and they're all off over the summer, working with streaming companies. So you've got for example, all the LinkedIn You know, data flowing. And everybody sort of, looking for jobs, adding their profiles, adding pictures, adding, all sorts of facts about their jobs, and aside they all get distributed, well it's all streaming systems and they're just fabulous. Like, how many, I wonder how many of the audience actually use Facebook. There's another sort of big streaming application. So we don't really need to motivate it, but it's just if we look back ten years, well, streaming would be, ho hum. But now- >> Now, yeah. So up until maybe 2012, If you stretch it 2013, everything in the world was batch, companies were happy with batch, consumers were happy with batch. Now there's going 2016, 2017, everybody wants millisecond level responses. You want to apply, a search for, I want to find a car. I really want to see the inventory of all the car dealers, right now, right? If there's a car added to a car dealer's inventory, the dealer wants to be able to show me that car right now compared to tomorrow, right? >> So what better name for a streaming system than storm, just because of the way >> Our streaming is taking everything by storm effectively. >> By storm, yeah. >> So we're going to discuss some of that. We'll give you some examples of using storm, doing word count, simple stuff. And then we'll go on from there. We're going to look at how storm gets used in some companies as well. So, hoping to, the plugin to you know how Yahoo uses storm to look at incoming mouse clicks. >> Right, right. So we take storm as one of the you know common streaming processing engines these days as we kind of go into detail of how you can write a program for it and how the engine itself, how the platform itself runs as a distributed system. So we kind of like tried to cover storm in detail and then we kind of moved on to some of the other alternatives to storm. Right? So again, like we did a couple of times before. Now we go back to Spark. >> Right. >> Spark is great. Spark is covering everything these days. >> Yes! So, sort of anchoring and building spouts and. >> All those good things. We'll discuss all that and how you'll actually build sort of exactly once processing. We'll try. >> All right. >> That's going to be a sort of interesting aspect to that whole thing. We're going to look, then, inside some of these systems, so we've got inside Apache Storm. >> All right. >> I'll open the covers. >> And then, I look down there. We're going to look at how you build storm classes, what they look like. You want to talk about thrift a bit? >> Yeah, I will definitely show how thrift works, and storm thrift is kind of a foundational software package. Not only using storm but also in a lot of these other streaming systems, not streaming big data. You're using pretty much anything that is network related and is distributed system. These days, it might actually be moving thrift. We'll talk about thrift and how you can use thrift. And then after that, you'll kind of move to how schedulers work in storm. >> Tell us about Spark >> And got the echo system that surrounds it. >> Yeah, of course. >> Right. >> Of course after the storm, I'm kind of like talking about Sparks Streaming and how Sparks Streaming can also be an alternative to storm. After that, I have an interesting talking thinking about, talking about ecosystem. So you have, we've introduced a whole bunch of different tools. We introduced Kafka, we introduced storm and Spark Steaming. >> Now we're looking to [CROSSTALK] >> And there's a whole bunch of other things. It might be confusing these days. You might think, okay should I use storm? Should I use Spark? Should I use Kafka? Should I use? And there's some other things that we don't talk about at least in this course, much. I guess, I'll mention them. Should, I use NiFi? You might have heard NiFi. Should I use NiFi? Should I use Druid, right? So, where do you need to use which of these platforms? So we have an interesting talk that tells you the strengths of each of these platforms and the weaknesses. >> And to how to plug them together. >> And how to plug them together. How to use duct tape. [LAUGH] >> Right. That's great. So we're going to move on from there to another great topic that's emerging. >> Okay. This topic really sort of was just on your desktop for a long while and it was just very sort of I want to say, academic but it didn't really grab everyone but nowadays, with social networking, we the complex into some of the systems is just >> They come enormous and that's graph processing. So we're going to go through early days in graph processing. Looking at, and what Google that introduced to do that. We're going to move on to looking at Giraffe which is the Apache release and the beauty about that particular system is, then I can talk to you about how it works. So then we will go on to Spark again Spark GraphX where you can actually play around with graph processing. >> Yes, so I'll introduce Spark GraphX again, >> It tries to reuse the whole RDD model to process graphs. It's kind of like a slightly different model than prevalent draft, but both of them are kind of gaining popularity so I decided to talk about both of them. And then we move on to machine learning. >> Yeah, that's a really cool topic. >> All right. >> Especially when you got lots of spaces. >> Yeah. >> So we're going to look at some of those packages that allow you to handle big data >> Mahout is one of those. Comes out of the Apache foundations. And you can get to play with it. It's a particular install all by itself. But that doesn't really stop anyone from using it. Now if you really want to use it, in sort of programming with lots of other bits and pieces, then what I would suggest is back to Spark. >> Back to Spark. And Spark has again being seeing a lot of activity in the ML machine learning part of Spark. We'll talk about Spark ML itself, and then we have an example on how to use Spark K means for instance. And then probably some other examples that you'll see in the course. >> Are you going to cover naive bays and a little bit of FPM, right. Just to show you how universal it is. After all that we can sort of wrap it up with looking at the learning. Because I think probably no course on big data or at least Cloud applications is actually, sort of complete without a little touch of big data. >> Right, so keep learning. The deep learning stuff is actually quite interesting, right? So it's really exploded in the last two years. I mean, the whole idea was there for 20 years, right? We're talking about artificial and neural networks, but you might actually think about why are we talking about the artificial and neural, it can be data, but this is actually the reason, big data is really the reason that neural networks are not working compared to couple, the kids that know- >> The theory about the deep learning is that you can actually draw parallels between all sorts of different things. You can look is this particular event got any relationship to this other type of event over here. You can really do a lot of comparisons. You do need to do oftentimes, learning in order to be able to do this. Sometimes you can do it without that. It just depends. We'll discuss some of that topic. The nice thing is that the packages now allow you to do this very easily and some of the packages we're going to be talking about like can run on your laptop or they can run on really powerful machines. >> Right. >> And they allow you to do, also like search images. If you want to recognize. I was looking at that. If you take photographs of yourself, stick them up on Facebook, and it knows who you are. And it knows who you've taken a picture of. And that's all the deep learning pack. >> That's all the deep learning stuff. >> So we'll be covering that. >> And it's using and on the back end it uses, >> Large clusters of GPU enabled machines that turn through massive amounts of data through the data, so. >> So I think we've been talking long enough so what I think they should do is start the course. >> Yeah, yeah. >> So good luck with the course- >> We are excited. >> And we'll see you as you get into the introduction. >> Yes. [MUSIC]