[SOUND] So, been looking at big data and how data systems, data centers process those big data requests. And sort of examples of the processing engines involved and the storage involved. Another form of big data that's really important in clouds, is how they process streams of data. And that has become an enormous sort of Business. Sort of driving multiple different organizations thing. People like Facebook and so on, in how they actually manage their cloud computing in services. Stream processing has lots of important features which we will deal with. Imagine you're browsing a web page. What you will probably end up doing is to kick over an advert, or whatever. You click on a page that has an advert that pops up, or is otherwise displayed. You can imagine the advantage to the service providers if they can count how many adverts you actually get to look at. The companies providing the advert will want to know just how many people were interested in that advert. Whether they clicked through, and whether they actually bought the product. So counting all these things has become a big industry. If you see an advert on a pagem just seeing itm the systems nowadays will actually try and record that you actually saw this when you saw it. So the user, the advert, and the AdId, and then the time stamp of when you actually saw it would go to the consumer base. So let's say you're clicking through LinkedIn, then you would see an advert as you were sort of browsing somebody's page, that would all be recorded. And is useful because now the advert advertisers know that you're, well, you're actually sort of observing their adverts, that they've paid for LinkedIn to put on their website. If you, however, got interested in the ad for that new sort of Tesla car. Then there will be another ad click as you sort of click on it and go to the website. And that really is in response to well, it was good advert, the ad agency did well in placing it there. And you are sort of an interested potential costumer. So, again a record will be taken of that and stored away. And so you get these mouse clicks from everybody's browsing. And pretty much whenever you visit servers or whatever that have advertising, this is going to happen. And it's going to produce a huge amount of data streaming back to a variety of different companies. So you need to process this to actually find out how effective an ad is, how effective a campaign is, or whatever. What is the interest in this particular type of article? And the mechanism for doing that is in data traffic going back about those mouse clicks. How to process it? Well it's continuous, it's mixed up between lots and lots of different users actually looking through the web and clicking on pages that might have adverts on them. And there's just a huge number of people, so it's 6 billion people or 7 billion people in the planet. A large number of them may be clicking through. And when you have sports events like the Olympics, you can potentially have huge fractions of those folks actually sort of observing an advert. So all this data comes back. What you would like to do is to process it and in near real time actually decide well, do you want to place more effort? Do you want to spend more money on advertising or do you want to change the adverts? Do you want to identify the places to place adverts? All sorts of decisions that can be made off what's occurring. How are the advert companies going to do that? The answer is they would like to take that continuous stream, look at it, see who the people are that are looking at these adverts. Look at the time stamps, look at the adverts themselves, sort them out, and then decide whatever they're going to decide. So, they need to process all this data. Typically, the output's going to be a sort of infinite sequence of changes to derive dataset. It's going to change how many people are interested in this product? How many people are interested in that product? It's going to generate out sort of a fan of information in a sort of consumer producer way to a whole bunch of engines in the cloud. And the question for us is just how do you build such systems? So you have high input, lots of different events, they're sequenced. When you get them of course they've gone over the networks, so they may not be in their original sequence. You've gotta sort them out, you've gotta work out what they relate to, and you want to process them very quickly. And then you want to output data from them that can change the behavior of the advertising company or whatever, in that sort of real time way. So, you get diagrams of these streaming processes systems, that look a bit like this. That you have multiple streams coming in. You want to sort out between them what is coming, what is different, what the times are. Sort the times, join the streams together, work out what's a hit on a particular ad. And then send out data based off that and it all has to keep up. If you have cues inside the system that increase because your processing isn't fast enough. Then essentially, you're going to increase the latency, you're going to slow everything down. It's not going to keep up with real time, so that's a problem. What you want to do is to keep it running as fast as you can, so you may want to add extra stream processing units to that. In which case, you may need to do extra joins, and all sorts of other things. You need to think about that architecture of stream processing. So that's the topic of the next sets of lectures. We're going to discuss different stream processing engines, different ways to construct fret streams. Different frameworks, and some of the practicalities and problems about doing this stream processing. Essentially what you're looking for is low latency. You want to be able to tolerate things that arrive out of order even though they've got dates on them. You may even sort of late arrival, you still want to be able to process it and fit it back in somehow. It should be usually user friendly, in particular the person doing this stream may not be the person processing the actual ad request. So it has to be user friendly so that it’s easy to operate. And typically people think about SQL for data queries. You would like SQL on streams that give you results about what's happening out there. You would like it to be scalable, so if you hit a big Olympics event you can accommodate all those clicks. You would like your data to arrive safely so it's not duplicated or lost. You'd like it to be highly available so you can instantly see what's going on. And probably there's a whole suite of other requirements too. So these systems that we're talking about generally are built to allow you to do this. And allow you to build systems that maintain this requirement throughout that system for processing all of those streams. As you sort of hone in on particular ideas in stream processing, what you'll find is there's some slight changes, depending upon the applications. How you're going to process these applications. There's some issues about scalability. Are you looking at millions, or, billions, or trillions of events a day, and how do you actually store all those? How do you get a state representation of what is occurring. And particularly you may have multiple products, they may all be having clicks, and you may want aggregates across them. And you may need to accommodate all of that aggregate seen as well. So you have multiple parts at the back end. Why is it expensive? Well, you may need to keep the intermediate datasets to allow some further processing, perhaps a lower latency. You might need to be actually moving large amounts of data across the network from one part of the country to another. Lots of, sort of, implications of the way that you organized your outfit. All of that has to fit into the framework you're going to build for the stream processing. So the next time, what we'll be looking at is the individual components of how we do stream processing on data centers. Thank you. [MUSIC]