About big data

From Me Pare Whanau
Jump to: navigation, search

overview

source [MIT Tackling the Challenges of Big Data, 2014]

big data issues

  • volume - too much data
  • velocity - speed at which data presents itself
  • variety - different data sets requiring integration
  • veracity - accuracy of data sets requiring integration
  • non-scalable analysis - data that is hard to process for some reason



big data collection

source professor m stonebraker, MIT 'Tackling the Challenges of Big Data' MOOC, 2014 [profile]

overview

… by "big data collection" we mean a collection of techniques for ingesting and integrating data big data …

… as well as using cloud infrastructure to store it …

… big data's messy and vast nature makes it inherently different from working with smaller datasets …

… this requires tackling the problem of how to ingest and integrate disparate data sources together into a single unified data set that can be accessed and queried …

… and using cloud infrastructure can be used to store and process big data …

challenges

… creating a unified schema from multiple data sets, where uniform names and data types are used to describe the concept or entity …

… performing entity resolution, where the same object or entity (i.e., a particular product, customer, or company) referenced in multiple data sets, is identified using the same name or identifier in the integrated data set …

… understanding how these problems are different in the era of big data; in particular, how to handle these challenges with hundreds or thousands of data sources and billions of records …

… understanding how big data requires a new, statistical approach to these problems, instead of the traditional, manual intensive approach …

… understanding the key concepts of cloud computing, including elasticity and pay-as-you-go, and to understand the economics of cloud computing …

… understanding why these concepts are particularly attractive for big data, including the ability to quickly harness the power of many computers, and the ability to deal with bursty workloads …

… articulating the benefits of cloud computing, in a big data era …

… understanding the different features of popular cloud platforms, and the level at which they operate (e.g., hosted compute, hosted data, hosted service) …

… understanding the features of popular hosted cloud and big data platforms, including amazon aws (ec2 and s3), as well as hosted services like google big query …

data cleaning and integration

data curation

… i'm here today to talk to you about data integration …

… and this is a really complicated topic …

… i will call it data curation, because what you really need to do is ingest data from a data source, usually not written by your team, validate it, make sure it's correct …

… often, you need to make transformations …

… data is invariably dirty, and so you need to clean it ie correct it …

… then you need to consolidate it with any other data sources that you have on-site …

… and often, you want to look at the data when you get done …

… so this whole process is called data curation …

… where did this come from? …

… the roots of data integration, data curation go back to the data warehouse roots, which is the retail sector …

… meaning people like kmart, walmart, those guys-- pioneered data warehouses in the early 1990s …

… and the whole idea was they wanted to get consolidated sales data into a data warehouse …

… and what you wanted to do was be able to let your business intelligence guys, namely your buyers, make better buying decisions …

… so the whole idea was to figure out that pet rocks are out, and barbie dolls are in …

… so you want to send the pet rocks back to the factory, or move them up front and put them on sale, tie up the manufacturer barbie dolls so that nobody else can get any …

… so that was buying decisions were the focus of early data warehouses …

… the average system that got built in the '90s was 2x over budget and two times late, and it was all because of data integration issues …

… so these were big headaches in the '90s …

… however, data warehouses were a huge success …

… the average data warehouse paid for itself within six months with smarter buying decisions …

… so then, of course, what happened was it's a small world, and essentially everybody else piled on and built data warehouses for their customer facing data, which is products, customers, sales, all that stuff …

… this generated what's come to be called the extract, transform, and load business …

… so etl tools serviced the data warehouse market …

… so the traditional wisdom that dates from the 1990s is that if you want to construct one of these consolidated data warehouses, you send a very smart human off to think about things, and you define the schema, which is the way the data is going to look in the data warehouse …

… then, you assign a programmer to go out and take a close look at each data source, and he has to understand it, figure out what the various fields are, what they mean, how it's formatted, and write the transformations from the local schema, whatever the local data looks like, to whatever you want the global data to look like in the warehouse …

… he has to write cleaning routines, and then you have to run this workflow etl …

… so there's a human with a few tools to help him out, and this scales to maybe 10 data sources or maybe 20, because it's very human intensive …

… so this is traditional etl and where it came from …

… so the architecture is pretty straightforward …

… you have a collection of data sources …

… you have this etl system, which is downstream from all your data sources and upstream from your data warehouse, and it's doing data curation, run by a human, and it's a bunch of tools …

… so let's just take a little look at what some of the issues are …

… why is this stuff hard? why was this 2x over budget? well, suppose you're somebody who sells stuff …

… so you sell widgets, and you have a european and a us subsidiary …

… so one of your sales records is your us subsidiary sold 100k of widgets to ibm incorporated, and your european subsidiary sold 800k euros of m-widgets to ibm sa …

… so of course the issues are you've got to translate currencies, because euros aren't directly comparable to dollars …

… and then you have the thornier questions which is, is ibm sa the same thing as ibm incorporated? yes or no …

… and then, are m-widgets the same thing as widgets? …

… so these semantic, thorny difficulties and transformations, and then, of course, this data isn't dirty, so if it was dirty, you'd have to clean it …

… so this stuff is just hard …

… and so what are the traditional etl tools do to help you? …

… well, first of all, they often give you a visual picture of the source schema and the schema you're aiming for, and they give you a line drawing tool so that you can say, here's the thing in the local schema …

… here's an attribute in the local schema, map it to this other attribute in the global schema …

… so you basically get line drawing tools that allow you to line up the attributes, so that helps you a bunch …

… but then you have to do transformations, so all the etl tools give you a scripting language …

… think of it as python, if you want …

… and what you do is that you write python scripts that convert ibm sa to ibm incorporated, if that's what you need to do …

… and as often as not, the etl vendors give you a workflow so that you can define modules, line them up with boxes and arrows, and every box is some code written by a programmer …

current situation

… the current situation, though, i think is very discouraging …

… it's very discouraging, because this technology works great for 10 or 20 or i'll even give you 30 data sources …

… push me hard, twist my arm, i'll give you 50 …

… here's some of the problems …

… enterprises want to integrate more and more and more data sources …

… the business intelligence guys have an insatiable appetite for more data sources …

… so let me give you an example …

… a while ago, i got to make a sales call at miller beer, the guys in milwaukee …

… and they have a traditional data warehouse of sales of beer by brand, by zip code, by distributor, all that kind of stuff …

… so when i got to visit, it was in november …

… and the weather forecasters were predicting an el nino event …

… so el nino, now we know, is a mid-equatorial pacific upwelling of warm water …

… and it screws up the us weather whenever it appears …

… it makes it wetter than normal in the pacific coast and warmer than normal in new england …

… so this november, the weather forecasters were saying el nino is coming …

… so i asked the miller beer guys, do you expect to sell more or less beer if it's wetter? because it's going to be wetter on the west coast …

… do you expect to sell more or less beer if it's warmer? because that's what's going to happen in new england? coast …

… so is our beer sales correlated with precipitation or temperature? coast …

… and the miller guys said, boy, i'd really like to know the answer to that question …

… but of course, weather data wasn't in the data warehouse …

… so this is what causes people to want to get more and more and more data sources …

… also, there is emerging some very non-traditional, non-customer facing examples …

… so novartis, which is down the street here in cambridge, has 8,000 chemists and biologists doing wet experimentation in the lab and writing down their results in lab notebooks …

… so think of these as electronic lab notebooks, 8,000 of them …

… and every scientist gets to do his own thing …

… there is no global schema whatsoever …

… there's no ontology …

… there's no common vocabulary …

… there isn't even any common language, since some of the scientists are in switzerland …

… and they write their results in german …

… so novartis wants to integrate 8,000 spreadsheets …

… why do they want to do this? coast …

… well, they've got various scientists who are doing experiments that either produce the same gook or start with the same gook and produce something …

… and they want to be able to tie these guys together so that they can collaborate …

… so they want to do social networking at scale 8,000 by putting the data together …

… you cannot do traditional etl at scale 8,000 …

… it just has no chance in hell of ever working …

… also, since the internet took off, there are a whole bunch of web aggregators who are scraping the web for content from a whole bunch of websites and organizing it into a common data warehouse …

… so we'll take a look at a company called goby in a few minutes …

… they are integrating 80,000 urls into a global schema …

… you cannot do etl at scale 80,000 …

… you can't do it at 8,000, let alone 80,000 …

… more over, fedex has about 5,000 operational data stores …

… probably 20 of them are in their data warehouse …

… what are you going to do with the other 4,980? coast …

… so what are you going to do with a long tail of sources inside the enterprise? coast …

… so enterprises are dying to integrate more and more data …

… and their traditional technology simply won't scale …

… it is too human intensive …

… the second thing that's happening fairly recently is the rise of what are called data scientists …

… so every company i know of is trying to hire data scientists …

… there will be a bunch of discussion at various times in this course about data science …

… but they are invariably assigned to point projects …

… suppose you're the brand manager …

… you're in charge of m&ms for mars candy …

… so you want to decide how to do your ad spend for m&ms …

… so maybe you want to buy some keywords on google …

… well, what keywords do you want to buy? …

… so you want to correlate candy eating with various key words …

… so suppose you have a database that does that …

… you then want to integrate that data source with a few others and decide how to do your marketing spend …

… this has nothing to do with a central data warehouse of customer and sales data …

… these are point projects, often in marketing …

… and you have a specific thing you want to do that you want to integrate three or four data sources and then do some sort of predictive model or correlation …

… so this has come to be called data science …

… and to support data scientists, etl is just way, way too heavy …

… you want something that's lightweight, that doesn't require a programmer …

… you don't want to assign a programmer to every single data scientist …

… so etl, traditionally, won't scale …

… or it's way too heavyweight …

… so this is creating a new problem space, a new sandbox in which humans need tools …

… so that's mostly what we're going to talk about …

… the rest of this module, we're going to go over a curation example that is at scale 80,000 …

… we'll take a look at goby …

… and then we'll look at low-end tools to support individual data scientists …

… and then we'll look at high-end tools that try and support scalable data integration inside the enterprise …

why is data integration hard

… why is data integration so darn hard …

… so at the risk of failing completely, we're going to do a live demo of a system called goby.com …

… which is a web aggregator that i mentioned in the introduction that's aggregating a lot of websites …

… so we'll take a look at how dirty the data that they are stuck with actually is and how daunting the problem is to fix it up …

… so let's dive in to goby.com, and take a look at a real world data integration problem …

… so we're going to take a look at goby.com, which got sold recently …

… so it's now been renamed scout …

… so as i mentioned in the introduction, it aggregates about 80,000 urls in the area of things to do …

… you know, downhill skiing, hot air ballooning, and events …

… things like rock concerts, speeches by the mayor, and that kind of stuff …

… so in goby you can say what are you interested in doing …

… so i'm a fanatic downhill skier …

… so i can go to things to do …

… and i get a category hierarchy …

… i go down here to outdoor recreation …

… and i go down here to skiing and snowboarding …

… and i'm interested in downhill skiing …

… so you're going to a category hierarchy to say what you want to do …

… no one wants to go downhill skiing in boston …

… so we'll try it in vermont …

… so what do i want to do, and where do i want to do it? …

… so i search and i get back a aggregation from 80,000 websites …

… so i can look down here …

… so suicide six is a ski area …

… killington is a ski area …

… so i got a whole bunch of ski areas …

… everything is looking cool …

… and i want to focus you in on mount snow …

… so notice that here's mount snow …

… i can click on mount snow …

… and notice that it has an address of 39 mount snow road in dover, vermont and a phone number of 464-3333 …

… and it's called mount snow …

… and it's presumably a ski area …

… so that's aggregated from one of their sites …

… i continue down here …

… and whoops, i get mount snow again …

… so i click on this one just to see what's going on here …

… so here i get route 100 west dover, vermont …

… so the address is completely different …

… phone number is the same …

… and it's still called mount snow …

… but because the address is different, they don't figure out that these are the same thing …

… so the idea is goby is trying to deduplicate this data …

… they are doing it based on the entity name and the entity address …

… and since the addresses are different, they can't figure out that that's the same thing …

… so when you have the same ski area with two different addresses, it really is a problem …

… now, presumably, one of those addresses is presumably wrong …

… which one, who knows? …

… it's entirely possible that one of them is the maintenance department, and one of them is the snow line …

… who knows? so this is why it's really hard to deduplicate your data because it's really dirty …

… and it's really dirty in unpredictable ways …

… so goby has a huge challenge trying to clean up their data …

… they have an ad hoc system …

… so the whole idea is you'd like to be able to generate tools that would allow them to do a much better job …

… so as we've seen in the goby data looking at the mount snow ski area, data is really, really dirty …

… and sometimes it's not at all obvious how to clean it …

… so let me just give you one other quick example …

… suppose i have some data that shows that two restaurants are at the same address …

… did one go out of business and get replaced by the other? or is this a food court? so cleaning dirty data is really hard …

… and sometimes it's not obvious how to clean it …

… moreover, data transformations may well be a huge problem in addition, although the goby data hasn't been able to illustrate that …

new approaches to data integration – data wrangler

… in the remainder of this presentation, we are going to look at a couple of new ideas on how to help out with data integration challenges …

… you can always use traditional etl, but what we're going to do is look at ideas that you may want to check out that are not traditional etl …

… so, the nice thing is that there's a whole bunch of start-ups in this space, probably a whole bunch more than i know about …

… there's a company called paxata …

… trifacta is a commercial version of data wrangler, which we'll take a look at in a minute …

… cambridge semantics, data tamer is a commercial company we'll take a look at in the next segment, and there are a bunch of others …

… so there's a lot of activity in this space …

… and so if you are somebody who has to do data curation, check out some of the start-ups for ideas that may help you out …

… so the focus of these start-ups, and a bunch of others, is at least to support two kinds of activities …

… support for individual data scientists, who have point problems, as i mentioned in the introduction …

… and then support for scalable enterprise data integration …

… so in the rest of this segment, we'll take a look at data wrangler …

… which is oriented toward individual data scientists, and in the next segment we'll take a look at data tamer, which is oriented toward enterprise support …

… so now we'll turn to checking out wrangler …

… wrangler is an interactive system for data cleaning and transformation …

… by combining simple interactions, automatic transform suggestion, and visual transform previews, wrangler helps analysts quickly sculpt the data set into a usable format …

… imagine we are investigating property crime, and download data from the us bureau of justice statistics …

… the data is not in a format our analytic tools expect …

… you copy text from the csv file containing the data …

… paste the data into wrangler, and click wrangle to get started …

… the interface now shows a table containing the data, an interactive transform history, and a transform editor …

… first, let's delete empty rows in the table …

… we select an empty row in the table …

… in response, wrangler generated suggested transforms with natural language descriptions …

… when we mouse over the descriptions, wrangler previews the transform's effect in the table …

… red highlights indicate which rows will be deleted …

… we execute the suggested transform …

… wrangler adds the transform to the transform history …

… we'll now use wrangler's text selection to extract state names from the year column …

… we select the text "alaska" in row 6 …

… wrangler guesses we are extracting text between positions 18 and 24 …

… it highlights matching text in each row, and previews the derived column …

… we can provide more examples that help wrangler generalize our selection …

… we update the suggestions by selecting "arizona" in row 12 …

… we execute the highlighted suggestion, and wrangler adds it to the history …

… we then rename the derived column "state" …

… the state column is sparsely populated …

… missing values are indicated by the gray bar at the top of the column …

… when we click the gray bar, wrangler suggests transforms for missing values …

… we can choose the next suggestion by pressing the down key on the keyboard …

… we can execute the suggestion by pressing enter …

… we now remove the rows containing the text "reported" …

… we select the text "reported" in row 12 …

… wrangler suggests extract, cut, and split transforms, but no delete transforms …

… we can reorder wrangler's rankings by clicking the delete command from the rows menu in the transform editor …

… after executing the command, the data is in a relational format …

… we'll now create a cross tabulation of crime rates by state and year for subsequent graphing in excel …

… we'll use an unfold operation to reshape the data …

… unfold operations are similar to pivots in excel …

… we select the year in property crime rate columns …

… we find an unfold transform in the suggestions …

… for operations that alter the layout, we preview the result with the ghosted overlay …

… we execute the suggested transform …

… we can export the transformed data to analysis tools such as excel …

… or, we can export the transformation itself …

… the output of wrangler is a declarative data cleaning script …

… from this high level script, we can generate code for a variety of runtime platforms …

… on the right, we show generated javascript code …

… to find out more, visit vis.stanford.edu/wrangler …

… so what you've seen from data wrangler is a visualization system that lets you look at your data, and some non-programmer tools that allow a non-programmer to massage your data, and transform it into a better representation …

… so the point project's support is mostly about visualization, and about non-programmer transformations …

… so expect a whole bunch more systems in this space …

… google refine is another example of a system that has the same kind of flavor …

… expect them at very low prices, because they are oriented toward supporting individual data scientists …

… and it's real clear to me that this market will be gated by the availability of qualified data scientists …

… so for example, i got to visit a large insurance company, who sells automobile insurance, and they are very interested in putting sensors in all their customer's cars …

… a market there was pioneered probably a decade ago by progressive …

… and so when you put a sensor in somebody's car, you get to record one-a-second data of exactly how they're driving, exactly where they're driving, and at what time of day they're driving …

… now presumably, if you drive in dorchester at 4:00 in the morning, that's way more risky than driving in lexington at 10:00 in the morning …

… so all of a sudden, you're going to be able to do risk analysis on way, way, way more variables …

… and this insurance company said, boy, we really need to move aggressively in this direction, but we are completely gated by the availability of qualified data scientists …

… meaning, people who have the necessary statistical and data management tools …

… so if you're an undergraduate and you want to make sure that you have a really good job when you graduate, train yourself as a data scientist …

enterprise data curation

… now we'll move to the other end of the market …

… supporting enterprises who want to integrate a lot of data sources …

… people like novartis, who of 8,000, or goby who has 80,000 …

… so what i want to do is tell you about a research prototype called data tamer, which has been commercialized recently, and there's a start-up that is commercializing this software line …

… the whole idea is to do the long tail …

… so when you do the long tail, all you've got to do is do it better, cheaper, faster, than whatever the company is currently doing …

… when you think about novartis, you don't have to do everything …

… you just have to produce an roi against whatever they're doing now …

… now if you think about it for a minute, you have no chance of doing traditional etl …

… it is just too human intensive …

… the only way you stand a chance is by applying machine learning and statistics, and then ask a human if the automatic stuff can't figure it out …

… so it inverts the etl architecture …

… instead of a human with some tools, it's an automatic system that asks a human if it can't figure out what to do …

… so we'll take a look at what data tamer looks like …

… you get a console to decide what you want to do next …

… so you can, if we start on the left, you can ingest the next data source …

… so take whatever the data source currently looks like, and stick it into a local database …

… and since i was responsible for this system, and i'm a database guy, it shouldn't surprise you that all state is in a database …

… so databases are good …

… they store data …

… they don't lose it …

… so when you ingest a new data source, you ingest it into a database system …

… if you've ingested 20 data sources, you then may want to ingest the 21st and compare it against the 20 that you have already seen …

… and build up a global schema automatically …

… so there's a scheme integration module that does that …

… and if it can't figure out what to do, it asks a crowd sourcing system to get human help …

… if you've integrated the 21st data source, there may well be some duplicates …

… so you want to do entity consolidation …

… like we saw in the mount snow example earlier in the segment …

… so there's a dedup module that tries to locate duplicate record to describe the same entity that need to be consolidated …

… again, if the dedup's module can't figure out what to do, it gets to ask the crowd …

… and then, you want a visualization transformation system along the lines of data wrangler so that a human can do the heavy lifting easily if you have to do transformations …

… so that's the architecture …

… everything is in postgres …

… so databases, as i said, are good things to have …

… so what does ingest look like? …

… so, what you do is, a data source is a collection of comma separated values …

… it's assumed to be in csv format …

… which is kind of universal interaction speak these days …

… and if it's not in csv, then we require you to put it in csv before we can deal with it …

… and we assume that your csv is, in fact, a collection of attribute name attribute value pairs …

… so-called semi-structured data …

… so if we had doug, who's sitting here behind the camera, we might have doug who lives in cambridge …

… so we'd have doug name = doug, city = cambridge …

… it's a whole bunch of stuff that looks like that …

… that's the only requirement …

… and that's loaded into postgres …

… so you can ingest a data source …

… and now we have semi-structured data in a database system …

… so now you want to integrate that with what you've already seen …

… now there are three cases that you have to distinguish …

… the first one is, i have no idea what the global schema is …

… and that's true for novartis …

… they have no idea what the global schema is for 8,000 spreadsheets …

… on the other hand, there may be a complete global schema that somebody has already defined …

… doesn't occur very often, but it does occur …

… and then sometimes, like in the case of goby, there's a partial schema …

… goby knows that all of their data has an entity name and has an entity address …

… and if their records don't have that, they figure out how to make that true …

… so you have to distinguish those three cases …

… and what data tamer simply does is it starts integrating data …

… so you give it the first two sources, tries to integrate the two of them …

… it won't be very smart …

… but as you integrate more and more data sources, it gets smarter and smarter …

… so the system gets better over time …

… and so you can say, there are some synonyms …

… so wages can be defined as the same thing as salary …

… you can have templates …

… which is to say usually addresses are a city name, a state name, a zip code, a number, and the street name …

… and you can have authoritative tables …

… so for example, airport codes …

… boston happens to be the airport code bos …

… so if you have authoritative tables, we can take advantage of that …

… so there'll be extra information …

… and we'll take advantage of that …

… the inner loop of schema integration, is i'm on the 21st data source …

… i get in some attribute names …

… member doug has a name, and doug has an address …

… i get in names of attributes …

… if i've seen them before, then they might well match …

… so i need to check the incoming attributes against everything i've seen before …

… and what happens is, there's a collection of heuristic experts that try and do that …

… so if i have a data column in my global schema called human names, and in comes this new data source has a bunch of names in it …

… so i can say, well do the incoming data-- does it look like what i've seen before? …

… if the data is numeric i can run a t-test on the data …

… and that'll give me a score of how similar the due date is to what i have …

… i can treat the entire column that i know about as a document …

… treat the incoming data column as a document …

… see how similar those two documents are if the data is text …

… so the inner loop is, i have a global schema that i'm building up, and in comes the iplus first data source …

… and it's got a collection of attribute names and some values …

… so what i want to do is take every new name that i see and see if it matches something that i know about …

… so we have a collection of heuristics …

… and we can call them experts …

… the first thing you can do is, we can take the new attribute name, and do a cosine similarity on any attribute name we know about …

… if i have m widgets and i have widgets like we saw in the introduction they'll have a high cosine similarity …

… because one is essentially one letter less than the other …

… and so we run cosine similarity on attribute names …

… in addition, we have all the data …

… so we have a column of data that we're building up in this global schema …

… and we have a new column of data …

… so if the data is numeric, we can write a t-test to see if the data is statistically looks like what it looks like in the global column …

… and again, that will give us another score for the heuristic likelihood that these are the same thing …

… and if the data is textual we can treat the entire column is a document …

… now we have two documents …

… we can run cosine similarity on the two documents …

… you get a bunch of scores …

… you heuristically combine them …

… and the nice thing is that after modest training, we've tried this out on novartis' 8,000 spreadsheets, and we get 90% of the matching attributes correct without any human intervention …

… we also get more than 90% of the goby attributes correct without any human intervention …

… this cuts your human labor dramatically …

… because we can get a lot of stuff done automatically …

… without having to bother a human …

… what happens if the automatic algorithms aren't very short? well, you've got to ask a kraut …

… now if you have biology, genomics, and chemistry names, the last thing you want to do is farm out questions to somebody like mechanical turk, because that means you'll be asking grandmothers in cedar rapids, iowa …

… and they will be clueless about chemistry jargon …

… you are going to ask people inside the enterprise …

… and inside the enterprise, there are a hierarchy of experts …

… in fact, novartis' uber expert is a guy named wolfgang who is in basel …

… he knows more than anybody else about genomics data …

… so you have a hierarchy of experts with specialization …

… so some people are good at genomics …

… some people are good at chemistry …

… and so forth …

… so you start asking these people when you don't know the answer …

… and you ask every question to at least two people …

… and the way we do it is, in the hierarchy you ask the question to one person and at least one person who's allegedly smarter than that person …

… and so that means that if they get the same answer you raise the expertise of the lower guy …

… if they differ then you lower the expertise …

… so we have algorithms to adjust the expertise-ness of the experts …

… and of course, the problem is, is that if you want the answer to any of these crowd questions, you just ask every question to wolfgang …

… and that gets you the best answer …

… but of course, wolfgang has other things to do …

… other than answer questions …

… so you need a marketplace to perform load balancing …

… and so the idea is questions have a cost …

… and you have a budget …

… and you try and either get the maximum certainty at a fixed budget, or you have a fixed certainty that you're aiming for …

… and you want to get the cheapest cost of getting that certainty …

… so we're currently doing a large scale evaluation on novartis' exact 8,000 spreadsheet database …

… and guess what …

… it works …

… if you're asking people randomly, they of course, random allocation of the crowd, will work in some domains …

… but if you know about people's expertise, you of course can do better than the crowd …

… so enterprise crowd sourcing is really a bunch of experts …

… and it works better if you take advantage of who's an expert …

heuristics

… so if you want to look at how the system really works, what ends up happening is all of these heuristics produce a certainty of how certain the automatic algorithms are …

… that the thing on the left is the same as the thing on the right …

… you can set your threshold to be whatever you want …

… if you set the threshold really high, then you'll ask a human a lot …

… but you'll get really, really good answers …

… if you set it low, it'll cost you a lot less money, but you'll get less certain answers …

… so you can adjust how much money you're willing to spend, versus how good the answer is …

… if you set the threshold to be the heuristic score of one, then what happens is everything above the line is automatically mapped …

… everything below the line is sent to the crowd …

… so, in this case, five questions get sent to the crowd, and something like eight or nine of them get automatically mapped …

… so that gives you a sense for how schema integration works automatically, as much as possible, with human intervention, when necessary …

… so now, if you add the 21st data source, it may have some duplicates with the 20 that you've already seen …

… as we saw in the couple segments ago, there were a whole bunch of mount snow records …

… so you want to be able to do consolidation when the entities are the same thing …

… so you have tables that are, the schema integration module constructs a collection of tables …

… so there is a global schema being built up, and is populated by records …

… so what you want to do is see if you have duplicate records in any these tables …

… now goby turns out to match only on attribute name and attribute address …

… but, that's of course, you can do much better by taking all values into account …

… so you can do weighted clustering on all the attributes that you know about …

… and the reason to weight things is, if you have an attribute called sex; well if two people are males, that gives you very little certainty that they're the same person …

… on the other hand, ski areas almost always have vertical drop as one other attributes …

… if two ski areas have the identical vertical drop, chances are they the same ski area regardless of what they're called …

… so basically we solve the data clustering problem in n space, meaning all n-- all the attributes that we know about …

… it's an n squared calculation, where n is the number of records …

… if you get to scale, you don't like to do n squared calculations …

… that really takes a while …

… so as a first pass to try and knock down the complexity-- and, by the way, this works very well …

… it works wildly better than goby …

… we also looked at some data from an insurance aggregate are called verisk …

… and they currently have a domain specific d duplication system …

… and we do just a bit better than their domain specific system, with a domain independent system …

… so automatic algorithms are getting pretty darn good, and they can be-- they're entirely horizontal …

… so what does this stuff look like? well, we produce a collection of clusters …

… at the end of this n squared algorithm, we have a collection of clusters that we think are the same record …

… so the brown bubbles are individual records, the blue circles are clusters …

… and so, you can say, here's a particular entity that we think is the same thing, that we think has five records …

… it's a thing who's entity name is craft farm …

… they're a whole bunch of individual records from various sources …

… you can, in fact, select a specific record and blow it up …

… and you could also look at our edge to edge-- our record to record strength of clustering …

… so we give you a little visualization system to check out our automatic system …

… and, again, if our automatic system thinks it's strong enough, and you set the threshold so that our strengths are greater, a human will never be asked …

… otherwise, this will go to the crowd to be resolved …

… so where's data tamer going? well, first of all, not all data is structured …

… so you want to be able to integrate text with structured data …

… almost everybody wants to do that …

… and we're working on that …

… second thing is, if you integrate, say, an oracle database with another oracle database …

… well, oracle knows about foreign key primary key relationships …

… so, right now, data tamer doesn't understand relationships, but we have to move our algorithms in that direction …

… some people are really interested in hierarchical data …

… which is purchase orders have line items in them …

… it's very hierarchical …

… we may extend data tamer in that direction …

… input adapters are a big problem …

… which is, right now you've got to get your data into csv format …

… and we want to make that easier by having some way to generate adapters quicker …

… and, of course, we're working hard on algorithms to make the automatic algorithms better …

… and then, if you want to integrate data wrangler, or google refine, or some other tool; we want to be able to have user defined operations, not just the ones that we invented …

… so are hard at work making data tamer better …

… and this gives you a sense for the direction you have to go if you want to use a long tail of data integration …

… so this is the end of the segment on enterprise data integration …

summary

… this module has talked about data curation …

… we went through the history of where it came from, from the data warehouse market …

… we had some issues on why it was so hard …

… and we looked at some new ideas, both for individual data scientists, and for enterprise data integration …

… so i want to just give you a few concluding thoughts here …

… every enterprise that i talked to wants to integrate more and more data sources …

… and when you ask a cto or a cio what's his biggest problem, and he works for a big enterprise, chances are he's going to say data integration …

… this is the number one headache of most everybody …

… the problem is etl won't scale …

… it's too expensive to do it manually with a programmer …

… the people who are doing web aggregation are really into this in spades …

… so for instance, i was talking to the cto of groupon …

… and groupon is trying to construct a global database of small businesses …

… they want to integrate 10,000 data sources …

… so scale thousands is getting to be routine …

… and it's really hard …

… it remains to be seen what fraction of this market can be helped by automatic tools …

… data tamer-style tools …

… the initial results are encouraging that that's the way you have to go …

… there's no chance of doing it with traditional etl cleaning your data is going to be a huge issue forever …

… novartis's point of view is that they have a bad data cleaning problem …

… you've got to push it back toward your data sources …

… get the originators of the data to get involved in cleaning …

… because otherwise, once it gets way downstream, it gets harder and harder and harder to do it …

… and the odds that you can get your data to be 100% clean is nearly zero …

… so you've got to be able to deal with data that is just too expensive to clean …

… one of the things, a long time ago, i got to visit the state of montana …

… the department that issued checks for people on welfare and people with disabilities, and the administrator said, well we are up to the point where we're issuing-- where 50% of the checks we cut are correct …

… the other 50% are wrong …

… that's a little bit low for data quality …

… but you're going to have to live with-- there's some data that may be too expensive to clean …

… data transformations, as we saw in the data wrangler demo, are a big issue …

… you've got to figure out how to do them cheaper …

… non-programmer tools is one way to go …

… another way to go is, i can imagine a big database of transformations …

… maybe it's on google …

… maybe it somewhere else in the web …

… but every time i hear a programmer write a transformation he always says, somebody has written this before …

… why do i have to redo it? it's just right now too hard to find …

… so it's and interesting search problem to search for a codes that you may be able to use …

… this whole issue is going to get gated by data scientists …

… the average application that i hear about, so you want to be able to correlate x to y …

… so you go out and you find the data …

… and the average data scientists spend 80% of his time getting, extracting, and cleaning his data …

… 20% of his time doing data science …

… so data scientists are going to have to get very familiar with this kind of stuff …

… because they are being asked to do stats, data management, and data integration …

… so data scientists, get familiar with this stuff …

… there's a lot of activity in this area …

… lot of start-ups …

… it's a very active research area …

… and hold onto your seat belt …

… without a doubt, better stuff will be coming in the future …

hosted data platforms and cloud computing

source prof matei zaharia [profile]

overview

… i'm going to talk about hosted data platforms and cloud computing …

… so what is cloud computing? …

… cloud computing has been in the press a lot lately, and many different definitions floating around, many services that claim to be clouds …

… but at a high level, what cloud computing means is computing resources that are available on demand …

… by computing resources, i mean a wide range of resources …

… they can be just storage or computing cycles …

… but they can also be higher level software built on top …

… so for example, several providers today offer databases as a hosted service …

… so the provider will set up and manage a database for you and do all the administration …

… and as a user, you directly see and use a database …

… now the "on demand" aspect means that resources are fast to set up and fast to give away or tear down …

… and it means that pricing is usually pay as you go …

… what i mean by that is that you pay at a small granularity …

… if you only use your hosted database for a day, you only pay for a day …

… and often with these services, you pay on granularity as little as an hour …

… for big data workloads, in particular, clouds can be attractive for several reasons …

… so the first one is that clouds provide easy access to large scale infrastructure that would otherwise be very hard to set up and operate in-house …

… for example, if you need to store 100 terabytes of data, or if you need to launch 100 servers to do a computation, this can be quite a bit of setup, quite a bit of administrative overhead to do in-house …

… but with a cloud, it's possible to pay with your credit card and have access to these resources right away and start trying out a computation and seeing if it would be useful …

… the second reason that clouds are attractive for big data is that big data workloads are often bursty …

… and so they benefit quite a bit from the pay-as-you-go model …

… what i mean by that is, when you collect a large amount of data, you're not usually just continuously doing computations on it …

… instead, maybe once in a while, you have a large computation that you want done …

… and with the cloud model, you can acquire a bunch of computing resources for just that computation, and then give them away and only pay for the time that you use them …

… so because of these properties, cloud computing has seen a major growth in the past few years, in actually almost all software domains, and certainly in big data …

… let's talk about some examples of cloud services …

… our cloud services actually exist at multiple different levels …

… at the lowest level, you have just raw storage or computing cycles and a variety of services that offer that today …

… the most well-known are amazon s3, or simple storage service, and ec2, elastic compute cloud, which lets you store bytes and launch virtual servers to do computations …

… but since amazon began these services, quite a few other providers are offering similar ones as well, including google's compute engine, windows azure from microsoft, and rackspace …

… at the next level up, you have hosted services that provide a higher level piece of software and just host and manage it for you …

… and a great example of that is amazon's relational database servers or rds, which has hosted versions of just database software, including, for example, mysql or oracle …

… now normally, managing and hosting a database isn't entirely easy …

… you need to make sure the database is up and running, highly available …

… you need to make sure that you're setting up and taking backups …

… and you may even want to replicate the data archives to data centers for disaster recovery in case one of the data centers goes down …

… with the database hosted on amazon, amazon manages all these properties for you …

… and you can go ahead and use a database …

… other examples of this include google's bigquery, which lets you on sql queries on google's distributed infrastructure …

… and amazon redshift, which is a hosted analytical database similar to the analytical databases you'll see at other parts of this course …

… and finally, at the highest level, you have entire hosted applications which are directly accessed by end users …

… so one example of that is salesforce, which provides a variety of enterprise software such as a customer relationship management software just hosted so that business users can directly use it …

… splunk is a software for analyzing log files from servers …

… and there's a version of splunk that runs in the cloud so that you just point logs from your servers to it …

… and then splunk will analyze them and show you interesting graphs and interesting things that are going on …

… and finally, tableau is a visualization software that lets you quickly explore and slice through data …

… and there's a hosted version of tableau that is managed by the tableau company itself …

cloud computing – benefits + challenges

… let's start talking about the benefits for users …

… first benefit for users is fast deployment …

… using a cloud, computations can start in minutes without a long set-up period …

… and this can be a great way to quickly get started with a new type of computation …

… normally, setting up hardware and then setting up software on it in-house can take months …

… and using a cloud, it's possible to just quickly get started, try something out, and see if it works …

… the second benefit is outsourcing management …

… so the cloud provider handles system administration, it handles reliability and disaster recovery, and it handles security of the hosted services …

… and these are all problems that usually require quite a bit of commitment and resources in-house …

… so for an organization that doesn't want to specialize in these, it may make sense to outsource them …

… the third benefit can be lower costs …

… there are two reasons why costs might be lower from the cloud …

… first of all, users of the cloud will benefit from the economies of scale of the provider …

… this means that the provider, by being a large purchaser of resources, can often acquire lower prices, or it can share expertise …

… for example, of administration or security staff among many customers, and be able to offer them cheaper than a customer would be able to do by themselves …

… that's one reason to get lower costs …

… the second reason is the pay-as-you-go model …

… so if you're not using the resource all the time, using the cloud, you can pay for it just in the times when you use it …

… whereas if you had to purchase it yourself, it's sitting there idle and you have, in some sense, wasted money …

… and finally, the fourth benefit to users is elasticity …

… using a cloud service, it's easy to acquire large amounts of infrastructure for a short period of time and then give them away …

… whereas in-house, you'd have to acquire probably a smaller amount to hold onto, and then do all the computations on that …

… so these are the benefits for users, but the cloud model also has some benefits for providers …

… and this is why more and more providers are going for this model …

… the first benefit for providers is economies of scale …

… so because the provider is servicing multiple customers, they can share the expertise and the resources they purchase across these customers, and they're able to attain lower costs because of that …

… so for example, a provider might hire a system administrator only once, or a security expert only once, and have them administer or secure the resources of many different customers …

… and so that cost gets spread out across them …

… then the second reason why economies of scale matter is that the provider, just by virtue of being bigger, might be able to acquire resources at a lower price …

… for example, a provider who needs to buy power might build a data center right next to the power plant and get a much lower rate on that than most organizations would be able to do …

… second reason why providers are going for the cloud model is fast deployment …

… traditional software sales cycles are very long …

… it takes months to set up an agreement, set up a long-term deal with a customer, and then wait for them to have the hardware in-house to actually run the software …

… and in contrast with the cloud, the provider can get customers right away, because they can quickly sign up for the service and get started, and can immediately get revenue or feedback on the software …

… in the same way, when a provider builds a new feature, they just need to launch it on the existing services and it immediately reaches users, without waiting for users to buy the next version of the software …

… so the final reason is optimization across users …

… because a provider sees the workloads of many users at the same time, they're able to take lessons from the workloads and apply optimizations that might be harder for a traditional software vendor to do …

… when the traditional vendor just gives the software to the user and never looks at it …

… for example, amazon, with its hosted database service, might see how many different users use databases and come up with optimizations that will make these faster or more efficient because they're able to actually see the complete workload …

… so finally, for big data workloads in particular, clouds have three benefits that are specific to the nature of the workload …

… first, cloud provides easy access to reliable distributed storage, which can often be quite difficult or expensive to set up in-house …

… often, a challenge with collecting big data sets is making sure that you can actually preserve and archive the data over time and still be able to access it later …

… usually, this requires replicating it to multiple geographic sites and making sure those sites are in sync and so on …

… so using a cloud service, the provider is already doing this for you, and you don't need to set up these different data centers and set up this infrastructure on your own …

… second benefit is elasticity …

… a lot of big data computations are highly parallel, and using the elasticity of the cloud, you can easily acquire a few hundred servers at once, do the computation, and then an hour later, turn the servers off …

… and you've only paid for that one computer and been able to get back the answer quickly …

… so the final reason is data sharing across tenants …

… large data sets are actually expensive to move across the internet, and can also take quite a bit of time …

… so using the cloud, many tenants that are sitting in the same cluster in the same data center can very quickly share data …

… one example of this is the public data sets provided by amazon …

… which, amazon collects a number of interesting scientific and government data sets and lets different users just access them as soon as they launch machines in amazon's cloud …

… and usually, bringing these data sets in-house would take quite a bit of time for an organization running in-house …

… at the same time, clouds also have several challenges, many of them specific to the big data domain …

… the first one is security and privacy …

… so because you are outsourcing computation to a third party, it's harder to guarantee with a cloud that the data will remain secure and that no one else will get to look at it …

… and this is especially problematic in applications where legislation requires you to have strict controls over who accesses the data …

… so these can often be a challenge for the cloud …

… the second problem is data import and export …

… as i mentioned before, big data sets are expensive to move across the internet …

… but if you're going to start collecting your large data sets and processing them in a cloud, you'll need to move data into there, and this can often be time consuming or expensive …

… and finally, the third risk with a cloud model is lock-in …

… because you've put all your data in a provider, it's expensive to move out …

… and because you might be relying on services and applications that only that provider hosts, it might be hard to migrate to a different provider if you need to in the future …

… so this is certainly something to take into account when considering the cloud …

… so, i'm going to start with a more detailed look at cloud economics …

… saying, where do the cost benefits of the cloud come from, and in what cases, what kind of workloads does it make sense to use the cloud? …

… next, i'm going to go into different types of cloud services, cover some of the services that are out there today …

… and also look at different dimensions that one might compare cloud services on to make a decision …

… and finally, i'm going to go into more detail on the challenges of the cloud model …

… including security and lock-in, and steps one can take to address these challenges …

cloud economics

… when does the cloud make economic sense? …

… so i'm going to talk about three cases, or three reasons, compared to traditional on-site hosting when using a cloud may make sense …

… these are variable utilization, economies of scale, and something i'm going to call cost associativity, which is about acquiring more resources for a shorter period of time …

… so let's talk about variable utilization …

… the issue here is that with on-site hosting, you must provision resources for the peak loads that you anticipate them to have during the day even though in practice much of the time the load might be lower …

… for example, imagine you're purchasing a cluster of machines to do computation …

… you have to think about what will be the biggest computation it will have to do when the most users will be using the cluster …

… and that might happen, say, once a day …

… and during the rest of the day, parts of the cluster will be idle …

… in fact, most resources in practice follow a cyclical usage pattern when there's quite a bit of variation at different times in the day …

… usually in the morning, there's no one in …

… resources are at a low load …

… and then as the day goes on, they ramp up up to a peak …

… and then they ramp back down again as workers go away or go home …

… so you end up with this cycle where you go up to a peak and you go down as naturally as users employ the resources …

… but with static provisioning, you have to choose a provisioning level in advance, as shown by the orange bar here, that's just at the peak you anticipate during the day …

… and as a result, there are these periods in between where you have the resources, but they're just sitting idle …

… and they are essentially wasted …

… you paid for them, but you're not actually using them …

… this variability during the day is one reason that you might have a problem …

… but the second problem happens because the provisioning has to be done upfront …

… so there's a significant risk of over- or under-provisioning …

… by over-provisioning, i mean that the organization chose a resource level that's higher than what they actually need …

… for example, here, we chose this level …

… but actually the load is just down here …

… and as a result, there's quite a bit more wastage of resources compared to the one above …

… in this case, more than half of the resources are maybe sitting idle on average …

… under-provisioning can also be a problem …

… and under-provisioning just means that the level you chose is too low …

… so the issue here now is that peak times during the day when you don't have enough resources to sustain the computations …

… and this might mean that the computations slow down or that user applications have to wait in a queue to run …

… and so user experience is worse …

… or if these are customer-facing applications, it might mean that the customers actually move away because they're not getting great, consistent performance …

… so in fact, most organizations are especially worried about this case of under-provisioning …

… so they'll err on the side of caution and over-provision, and then end up with this kind of wastage of resources shown on the left …

… so with the cloud model, it's possible to ramp resources up and down at a much finer granularity …

… because clouds usually charge for them at a much finer granularity …

… for example, the most common one today is granularity of one hour …

… so using our cloud, for example, imagine these servers here were servers from amazon instead of servers that you had to purchase on-site …

… it's possible to ramp up the usage during the day, so add on more servers as the load goes up …

… and then ramp it back down when the load goes down …

… and overall have a much closer tracking of the resources to the actual load and much less wasted resources in these little orange spaces compared to the previous one …

… and with amazon, in particular, the steps which you have up and down can be as small as one hour …

… so this means that even if the cloud provider offers higher hourly rates than it would take you to buy those resources in-house …

… it might actually be worth it because a lower percent of the resources you purchased are wasted …

… so overall you might actually still be gaining something compared to buying these servers on-site …

… a question you might be thinking about seeing this load going up and down and the risks of over- and under-provisioning is how come the provider can do this better than individual users? …

… after all, amazon, itself, or google or microsoft also has to buy servers …

… it also has to provision for a certain peak load …

… and it also has to deal with changes in utilization during the day …

… so there are two reasons why providers can actually do this better and actually pass on these benefits in this hourly pricing that i was talking about before …

… the first one is called statistical multiplexing …

… it's a fancy way of saying that if you have different workloads that are variable, and they're peaking at different times, the sum of them together might be less variable than the individual parts …

… so in this little diagram here, we have four different cloud users …

… the top one has this green wavy pattern that we saw before …

… this one here maybe has a smaller time period …

… it's a bit spikier …

… this one here may be spread out over a longer time period …

… and maybe these are also spread out geographically, so they have peaks at different times during the day in different areas of the world …

… so when you add up the load from all these users, the sum of them ends up having a much lower variance …

… and so the provider has a much more predictable usage pattern that they can deal with …

… and this means that a large provider like amazon and google can very easily know how many resources they'll need in the future, and even during the day, and be able to offer them efficiently …

… and finally, the second reason that some of these large providers can do it is because they already have large amounts of applications internally that need to use compute resources …

… and if cloud customers aren't using them, they can simply use them for these internal applications …

… so for example, both amazon and google needs to run large internet services …

… and they need to also do a bunch of lower priority batch computations on the back end, things like analyzing the data or training learning models or things like that …

… and so when someone isn't using the cloud resources, unlike an on-site provider who has bought them and would just have them sitting idle, these companies can actually use them for their internal workload …

… so they're not actually wasting them in the same sense that an on-site provider would …

… and this is why it's no accident that some of these companies were the first ones to start offering cloud …

… because they saw that their own usage pattern was bursty, and they wanted to do something with the gaps in their resource usage …

economies of scale

… the second aspect i want to talk about when clouds can make economic sense is economies of scale …

… and this is essentially if the company purchasing the cloud, or the user purchasing the cloud, is much smaller than the cloud provider, then the cloud provider has some advantages in terms of the costs that they can offer …

… and these advantages can then be passed on to the customer …

… just as a concrete example, a small company that's trying to manage some servers might need to hire a system administrator …

… and say that they have a hundreds servers on-site …

… so if the system administrator's overall cost of the company is $100,000 a year, this means that they're paying $1,000 per year for each server to have this person administrator …

… in contrast, a provider like amazon might only hire one administrator for 10,000 servers …

… and they might have internal processes and just a larger number of servers that make it possible for one person to manage all of these at once …

… and as a result, amazon's cost per server is only $10 …

… on top of this issue here of static resources like the administrator is a fixed cost that you have to get at least once, there are also variable costs that can be smaller for the provider …

… so because amazon is a really large customer, they have more purchasing power …

… they can buy hardware like servers or a disk drives at lower cost …

… they can also buy electric power at a lower cost or security, for example hire one security guard for the entire data center, and so on …

… so these variable costs will also go down …

… finally though, there is a flip side …

… so unlike resources you purchase on-site, cloud providers also have to have a margin …

… so the costs that you see from them aren't going to be exactly the costs that amazon or google internally have …

… instead, they're trying to make a profit …

… but the idea here is that if you're a small organization at least, amazon's price, even with its margin, might still be lower than the price you have in-house …

… and finally, the third aspect i want to talk about is cost associativity …

… so let's talk first what is associativity? …

… so associativity is a property of some mathematical operations, such as multiplication …

… in multiplication, it just means that a times b is the same thing as b times a …

… when this comes the clouds, the reason this comes in is because pricing of most clouds is per resource hour …

… so this means for example, if you buy 100 servers for one hour, it's the same price as having one server 100 hours …

… and so it's possible, using the cloud model, to translate a lot of things that would have taken a long time on a single server …

… into using more servers for a short period of time, and still be able to do it at the same cost …

… the picture here shows a really simple example …

… we have five server hours of work to do …

… and instead of doing them all sequentially on one server, we're going to do them all in parallel, and it's going to cost the same …

… let's say each server hour costs $1 …

… so what this means is that for parallel workloads, using the cloud we can get an answer faster, because we can ramp up to a large number of servers for just a short period of time …

… so even though this case here with the five servers has the same cost in cpu cycles for a $1, it gives you more productivity per dollar than buying a single server would …

… so that's it …

… so to summarize, clouds can provide the most advantage when you're benefiting from one of these three reasons that i showed before …

… either the user's usage on-site is valuable during the day, and you'd like to not pay for resources when they're idle …

… or the in-house organization is small, and just the economies of scale of offloading things like security and system administration to a cloud can be worth it, can lead to a lower cost per server …

… or finally, if you have highly parallel workloads, where using the cloud you can use more parallelism get this same workload done for a similar cost, but get more productivity …

types of cloud services

… so our next topic will be types of cloud services …

… i'm going to talk about the different dimensions one can compare cloud services on …

… and also, what are some examples of services available today, and how they fit along these dimensions …

… so this is an area where there's actually a fair bit of vocabulary starting to come up …

… things like infrastructure as a service, or public cloud and private cloud, and so on …

… and i just want to highlight the three main dimensions that you can compare cloud services on and give some examples and talk a little bit about the options among each one …

… the first dimension that people compare cloud services on is levels of abstractions …

… and there are three different levels of abstraction that people talk about today …

… software as a service, platform as a service, and infrastructure as a service …

… these are fairly established industry terms …

… and they're also defined more specifically in the nist definition of cloud computing …

… which is a report from nist about trying to clarify some of the terms around cloud computing and let users compare these different services …

… so software as a service is the highest level …

… and it means complete user-facing applications …

… examples of this application are splunk storm …

… which is a hosted version of splunk's log aggregation product and is just called storm …

… and it lets you visualize and collect data from logs …

… and tableau online, which is an online version of the tableau visualization software …

… so these are end-user applications that we expect business users or analysts or just customers to use directly …

… the same way that they'd use software installed on a local machine …

… and you just get the complete application …

… and you don't know anything necessarily about how it's being hosted or managed on the back end by the provider …

… the next level is platform as a service …

… platform means that these are developer-facing services …

… so things like, say, a web application host or a database …

… and their higher level than just raw machines …

… you can't just go in there and start launching programs on the machine …

… but they're lower level than end software …

… usually, you use these as components in a bigger application …

… so two examples of this are amazon's hosted database service, rds and amazon's hosted mapreduce …

… so both rds and mapreduce offer abstractions that you might use to build an application …

… rds lets you host a relational database …

… mapreduce let's you run computations with and against a standard api …

… and amazon offers these to you …

… you got to put in some code that will run, say, the database queries or the mapreduce jobs …

… but you don't get full control over the resources …

… amazon is still assigning them and figuring out how they're going to share them across users …

… and finally, the lowest level is infrastructure as a service …

… and here you do get just raw computing resources, nothing special built on top …

… and you can kind of do whatever you want with them …

… so for example, most cloud providers let you just launch virtual machines …

… you can run anything you want on your server …

… and there are also a lot of cloud storage services that provide virtual disks …

… so you just attach this disk to a virtual machine …

… you can run a database on it or a file system or whatever else you want …

… so these are the three levels of abstraction …

… and essentially, the trade-off between these different levels is the lower levels like infrastructure give you more control …

… but the higher level give you more management and more complete applications …

… so you get to choose whether you want to write the application and manage it yourself …

… or go with a hosted one, or use a platform component that's somewhere in between …

… the second dimension to compare clouds on is levels of multitenancy …

… and the two common terms you'll see here are public cloud and private cloud …

… so remember, i said at the beginning that a cloud means any computing resources available on demand …

… but for the provider of the cloud, there is still room in terms of how they'll share these resources among users …

… and how much they'll benefit from putting users together versus assigning the resources to just one customer …

… so public clouds are services where the underlying computer resources are shared by multiple tenants …

… and are available for the general public to sign up on and start launching computations on …

… there are lots of examples of this …

… google compute engine, amazon, windows azure all offer public clouds …

… private clouds on the other hand means that resources are allocated to a single organization which will use them for their own internal workloads …

… so this makes a lot of sense if you have a large organization like, say, bank of america or walmart, which has a lot different internal workloads and still wants to share resources between them in a cloud fashion …

… and private clouds can actually be hosted either on premise at the organization or off premise at a provider that just sets up and manages servers isolated for this company …

… and there are providers out there like rackspace that will do both …

… they can either help you install a cloud on premise or host one for you in an area of the data center that's just allocated to you …

… finally, the third dimension to compare clouds on is the access interfaces that they provide …

… and there can be two types of interfaces-- open or proprietary …

… so many cloud interfaces that you see are actually standard …

… they're either interfaces that you'd already use in a private data center or they're just a standard across vendors that vendors happen too all commonly use …

… so for example, the simplest one is virtual machines …

… virtual machines all look like an x86 processor …

… and that means that any program you wrote for just pretty much any machine anywhere can run on these clouds without any trouble …

… the same thing happens with the interfaces for block devices for storage …

… operating systems see them as a block device …

… they don't need to know that it's running in a special environment in the cloud or with hosting standard software, specially opened software with open interfaces …

… so for example, mysql database hosting just looks like a mysql database and hadoop mapreduce is a standard api for doing mapreduce …

… now, there are other interfaces that are also quite powerful but are proprietary specific to a particular vendor …

… and usually, these are interfaces for doing a new types of computation or storage where maybe there's no existing open implementation that can do it …

… but the provider still wants to offer this for you …

… so for example, amazon's dynamodb is a key-value store …

… you'll see key-value store is covered in more detail in the rest of the course …

… and it provides high performance distributed storage for reading and writing small objects …

… and it's based on dynamo, which is amazon's internal technology for storing the web session data …

… but amazon decided to just offer an interface to this that's specific to dynamo db …

… and it means if your application uses this interface, it can only really talk to dynamo …

… google bigquery is another example …

… bigquery is a service for launching sql computations on google's internal data infrastructure …

… and it has a specific api that doesn't look necessarily like the apis of other databases …

… so in choosing between these interfaces, often there's a choice between, are you using something that's specific to a provider …

… maybe it provides special properties that you wouldn't get someone else …

… or are using something open that makes it easier to migrate across providers but may also not do the same kinds of computation that you'd really like to have? …

cloud service examples

… let's talk about some example cloud services available today and categorize them along these three dimensions …

… i've just included five services here, but these dimensions are useful to think about when you're looking at other services too …

… so the first one is amazon's elastic compute cloud, or ec2 …

… and amazon's ec2 just lets you host virtual machines, they're just running x86 code, and you can just launch any programs you want on them …

… so if you look at the level of abstraction for this, this is infrastructure as a service …

… this is just low-level computer infrastructure you can do anything you want with it …

… the hosting for amazon ec2 is public …

… so multiple users, users you may never know, could be acquiring virtual machines in the same data center …

… and the interface to it is a standard in the sense that it's just the x86 interface for running code …

… although an interesting aspect here is that the api for actually launching, spinning up these machines, and turning them off is actually specific to amazon …

… so there are still aspects of it that are proprietary …

… the next product is rackspace's private cloud …

… so rackspace private cloud also provides virtual machine hosting …

… and so it's also an infrastructure as a service …

… you can own whatever you want in these vms, but the hosting is private …

… rackspace's cloud can either be a private slice of a data center that rackspace owns, but that's only dedicated to you …

… or actually, rackspace can set up private clouds within an organization's own data centers as well …

… and the interface to this is still a standard in the sense that it's still just running x86 code …

… now, moving up to the next level of abstraction, we have things like amazon's relational database service, or rds …

… and this one, as i mentioned before, provides hosted versions of mysql, oracle, and other common database software …

… so this is one level up …

… this is platform as a service …

… you're not just getting machines …

… instead, amazon is managing and running this application that's a component of many end products you may want to build on top …

… the hosting is still public …

… it's still running in the same data center as other tenants …

… and the interface here is a standard in the sense that both mysql, oracle, and all these databases have a standard interface …

… and you can move to any other provider of mysql if you want and get the same thing …

… another example of platform as a service is dynamodb …

… as i mentioned before, dynamodb is a key-value store it has a data model, and consistency guarantees, and storage properties that are specific to amazon …

… amazon engineered a key-value store that would meet certain types of high-skill workloads …

… so this is still platform as a service …

… it's still public, but the interface is now proprietary …

… this is an interface you can only get in amazon, because only amazon really hosts the dynamo system …

… so there's a choice here …

… ok, do you want the performance characteristics of this interface, or do you want a more standard one that may perform less well, but be common across providers …

… and finally, we have software like tableau online …

… this is an end-to-end visualization and reporting software …

… you don't need any developers at all to use this …

… this can be used directly by, say, data analysts, or business users …

… and this is the last level of abstraction, software as a service …

… the hosting, in this case, is actually still public …

… so potentially, multiple tenants could be in the same data centre all running tableau …

… technically, the hosting for this kind of software could be private as well …

… it just depends on the software in case …

… and here the interface is proprietary in the sense that it's this one product from the one company tableau …

… you can't exactly get a substitute that has the same interface …

… so hopefully this has given you an overview of the types of cloud services available today …

… and these are just some of the things to consider when choosing which cloud to use …

challenges and responses

… so in the final segment, we're going to talk about challenges and responses of cloud computing …

… and in particular, i'm going to list four of the challenges that come up when using a cloud service …

… and also some of the things that users have done, and providers have done, to respond to those challenges …

… so the first challenge, and maybe the most talked about, is security …

… and the reason for this is very simple …

… if you're outsourcing computation or storage to a third party, security and confidentiality can be a lot harder to guarantee than if you're doing that computation in-house …

… this issue of security becomes especially important when combined with legal compliance …

… so for some workloads, applications and companies that run those workloads need to follow specific laws to protect the information …

… for example, hipaa is a set of laws governing the storage and computation on health data …

… so this means that if you're working with the sensitive information and you need to do it in a cloud …

… you have to make sure that the cloud services you're using also follow all the right guidelines to maintain hipaa compliance …

… pci dss is another example …

… pci is the payment card industry, and this is their data security standard …

… so this is something you have to follow if you're doing credit card transactions online …

… and so you have to make sure that with outsourced storage or outsourced computation, the provider also follows these same guidelines …

… and security can also become a challenge, because your provider might be in a different legal jurisdiction …

… so for example, if you have a company in the united states that's using a cloud hosted in europe, the servers in that cloud have to follow the local laws in europe …

… these laws might be different …

… they might require information to be protected in a different way …

… or to be given over to authorities under different circumstances for law-enforcement reasons …

… and so the organization has to make sure that these legislations are followed …

… and same thing happens if you're a european provider with a cloud in the us …

… because of the concerns about security, there's actually been quite a bit of response to these challenges in both industry and in research …

… and providers and users are actually working to make the security of the cloud quite a bit better, quite a bit easier to offer …

… so the first example of this movement is that providers over time have given users more and more control over the security properties of their data …

… and just more controls to let them ensure that the computation happens securely …

… for example, virtually all providers today offer encryption of the store of data …

… so if you place data, say, in amazon's s3 or in a hosted database, you know that even if someone breaks into amazon's data center and runs away with the machines or the disks, they won't be able to actually use that data …

… the second feature related to encryption is key rotation …

… when you encrypt data, you've done so with a specific key …

… and over time, you may want to change the key, either because the first key was compromised …

… you know, someone got a copy of it, or you had a security breach, or just want to update to a newer type of encryption standard …

… so providers are now starting to let you do this remotely, without having to download the data locally and re-encrypt it and send it back …

… for example, amazon's redshift database has key rotation as just a built in feature …

… and finally, more providers are offering fine-grained access controls-- access roles and user authentication for users within a customer organization that are using the cloud …

… for example, if you have, say, a retailer company that's trying to host some computations in the cloud, there might be different people within the retailer that should have different level of access to the services there …

… some people should not be able to see customer data, maybe managers should only be able to see data about their own division, and so on …

… so using these mechanisms, it's possible to sign up different users within the same organization that have access to different parts of the cloud service …

… and it's also possible to acquire richer authentication mechanisms from them …

… for example, two-factor authentication, where the user has to carry a physical device with part of their key on it, so that it's harder for someone else to login as them …

… and so most clouds are starting to offer these features as well …

… the next direction in which we've seen changes in work from providers is in compliance …

… so most providers now, for example, are pci dss compliant …

… this means that they let users build up the security properties that the payment card industry needs in order to do secure payment …

… so for example, providers might give control over firewalls, they might give control over encryption, so that users can securely process credit card transactions …

… and there's movement to increase this to other types of compliance as well …

… and there's certainly interest from legislators in being able to certify providers as offering these …

… so finally, on the research side, the cloud is motivating quite a few interesting developments in cryptography and encryption that may make it possible to more securely perform computations on encrypted data …

… one example term you'll hear there is homomorphic encryption …

… homomorphic encryption means that, at least for some operations, it's possible to do the operation on the encrypted data and get the encrypted result without the provider ever learning what the real value of the data was …

… so for example, if a homomorphic encryption scheme supports sums as an operation, you might have two numbers, a plus b …

… and if you have encrypted values of a and encrypted values of b, the provider can just add those up or perform some other operation on them to get an encrypted value of a plus b …

… and in doing this, the provider learns neither what a nor b was …

… it can just give you back this value of a plus b …

… but the customer can perform the computation using the provider's servers without having to gather all the data back locally and do it on site …

… so homomorphic encryption for just one operation like this has existed for a while …

… and lately, something called fully homomorphic encryption has also been developed, which, although it's currently highly inefficient, allows, in theory, arbitrary computations to be performed in a homomorphic way on encrypted data …

… so this is still and area of very active research …

… similar property is encryption schemes that let you do comparisons on the value …

… so for example, order preserving encryption …

… here, the idea is that if you have values that you're encrypting, say, a and b, and a is less than b, then the encrypted values of these numbers follow the same property …

… so here the encrypted value of a is less than the encrypted value of b …

… and what this lets you do is, it lets you run queries on the data that compare values, without the provider ever seeing the actual values …

… they only see the encrypted one …

… but you can still tell them, hey, give me the values where the encrypted value is less than this …

… and so you run a query on the underlying data without revealing it …

… now, order preserving encryption, unlike the homomorphic one above, necessarily has to reveal some information to the provider …

… so in this case, it does reveal the order of values to the provider …

… and depending on the domain, the provider might still be able to figure things out …

… for example, if these are the salaries of employees, and the provider sees the encrypted salaries, they'll be able to figure out which employee had the highest salary …

… or maybe they don't know anything about them, but they can track this employee through other fields that they have in their database, and maybe understand something about them …

… so here, there's a bit of a trade-off between being able to do these computations and leaking information to the provider …

… and finally, our third type of computation on encryption that is possible is searching on encrypted data …

… so there have been several proposed algorithms and schemes to let you do text search on encrypted data without the provider learning what the original text was or what the query was …

… and this is another example where you might outsource computation to the cloud …

… so these are just some examples of advances in cryptography as applied to the cloud …

… and i think over the next few years, you'll see quite a bit more interesting theoretical and practical work in this area as we begin to understand to what degree computation really can be outsourced to a third party …

availability

… the second challenge with the cloud model is availability …

… when you're running computations in the cloud, you give responsibility for availability of the data and of the service to a third party, namely the cloud provider …

… so that means that the provider you choose must make sure that the data is stored reliably …

… that it recovers from disasters, from power going down or a data center going away …

… and that any service you're running is up all the time when your users need to use it …

… and a specific challenge that may come up with this is actually also business continuity …

… so if your cloud provider ever goes out of business, the question is, what happens to your data? …

… what happens to your computation? …

… can you easily move that somewhere else? …

… or have you lost it? these are both important things to worry about with the cloud …

… given these concerns about availability, there are actually a number of responses that users can take to minimize the risks …

… and there are also things that providers are doing to let users control their availability of their computations and more easily employ the infrastructure that the provider has …

… so the first thing is location diversity within a provider …

… most large providers, such as microsoft, amazon, and google, have data centers in many locations across the world …

… and they expose the users, where specific resources are, so that users can make sure that their computation is geographically distributed and redundant …

… most providers have support for data application across these zones …

… and they may also expose what are called availability zones as a concept even within one data center that tell you these racks of machines, for example, are on different power supplies …

… so if you put your machines across them, it's unlikely that they'll all go down at the same time …

… second response you see here is user is using multiple providers …

… there's a lot of interest in this, a lot of pressure on providers to inter-operate well, to provide similar interfaces …

… and there are also third-party services that actually will manage different providers transparently for you and let you run a computation across all of them without worrying exactly about which provider you're currently on …

… and finally, the third aspect to consider with availability is that even though you are outsourcing availability to a third party, the scale of these providers may let them do quite a bit more to ensure availability than on-site hosting would …

… so for example, having 24/7 security guards around the data center or setting up distributed application or setting up local generators in case of power failure …

… these are all things that are fairly expensive for a company to do themselves until they operate at a specific scale …

… but they are things that the cloud providers are doing …

… so in some sense, because the providers are doing this, it may also be possible to get higher availability using the cloud than by hosting your own infrastructure …

… third challenge i want to talk about is data transfer …

… this is especially relevant because we're talking about big data …

… and the problem here is that moving data over the internet is actually really slow …

… just as an example, say you want to transfer 10 terabytes of data-- that's actually not that much data-- over a t3 line, which is a relatively fast business connection …

… it's about 45 megabits per second …

… so transferring that much data over a t3 line will actually take about 20 days even though 10 terabytes today is only five disk drives …

… it's something you could stack up on your desk very easily …

… you can carry it in your backpack …

… and it costs about $400 to purchase …

… so the gap between storage density and internet bandwidth is very quickly increasing …

… and it makes it difficult to actually move these volumes around even when these volumes are quite reasonable to store and to collect …

… because of this, there have been some really interesting responses from providers …

… one response is that data transfer into most providers is actually free …

… usually when you do network communication with the cloud, the provider has to pay for the network …

… and so you have to pay for part of the networking …

… but moving data into them is usually free …

… and this is just to encourage you to load in data without covering their bandwidth cost needed to receive it …

… of course, once you've loaded the data in, they're hoping that you actually run some computations in there or store it long term so that they can benefit from it …

… and the second interesting response is that in addition to sending data over the internet, there are now services that let you import data by just shipping physical disks …

… so for example, amazon has a service called import/export where they use amazon's existing network of trucks and delivery vehicles and warehouses that can already go around and pick up and deliver packages to import just disks, like the five disks i mentioned above, into amazon and then load the data onto servers …

… so using these services, you can actually move large volumes of data relatively quickly and relatively cheaply into the cloud …

cloud lock-in

… so the final challenge we'll talk about is lock-in …

… and there are two types of lock-in, interface lock-in and data lock-in …

… interface lock-in means that if you begin using an interface to write your applications that's specific to one provider, it may be difficult to then move these applications on site or across to different providers, which have different interfaces …

… so at the very least, you need to take care when writing your application in …

… so that it could easily migrate to say a different storage service, or a different way of acquiring resources …

… data lock-in means that in the same way that data was expensive to move into a cloud, it can be expensive to move out …

… so the same problem that uploading 10 terabytes would take 20 days, downloading them from the cloud would take an equal amount of time …

… and so just having a lot of data put into one provider may then make it difficult to switch providers, and may put you at the discretion of essentially the provider your chose …

… and this is especially tricky because most types of computation you do need to be near the data …

… you can't just read the data across the internet and compute on it at any reasonable speed, compared to actually running computation in the same data center on the same machine …

… so to deal with lock-in, there are a number of responses and different steps that customers can take to minimize their lock-in …

… the first and most important one is probably just a preference for open or standard apis …

… if your cloud application is written against an api that you can easily replicate on site, or that you can replicate on another cloud, it becomes much easier to move that application around and to deal with the case where the provider you chose at the beginning is no longer ideal …

… if using open api is impossible, a second response is to use wrappers over the proprietary apis, so that the same code written by the user can work over many different providers …

… and many software efforts out there today to build these common wrapper libraries over the provider interfaces and let applications move around …

… just as an example, jclouds is an open source library in java for accessing different cloud providers …

… it makes it possible to launch and shut down virtual machines across these providers using the same api …

… and finally, to deal with the data lock-in issue, the same physical disk import and export that we talked about before for uploading data can also be used for downloading it …

… and this can be a way to quickly move data across providers when needed …

… so to conclude, although they are still relatively new technologies, clouds are very exciting environment to manage and process big data …

… they give most organizations the ability to run much larger scale computations than they could in house in …

… manage much larger data volumes in a reliable fashion in …

… and get started with this technology quickly to really understand what it can do for them …

… several challenges remain with the cloud model, including legal challenges like compliance and technological challenges like availability and security …

… but these challenges are actively being worked on, and i think we can expect to see a lot of change across all of these domains …

… finally, as a parting thought, clouds are just the latest instance of resources and business operations being outsourced …

… as an example, in the 1900's, large companies generated their own electricity …

… for example, the ford motor company had its own power plant and decided that creating its own electricity and distributing that electric power was more efficient than trying to use some kind of public grid …

… but since then, things have changed, and essentially almost every company today just uses the public grid …

… and this essential service without which almost all business operations would no longer work is now outsourced to a third party …

… so it wouldn't be at all surprising if a few years or maybe a few decades from now, computing which is another highly necessary and ubiquitous resource, also becomes a utility, and most of these operational challenges become outsourced to clouds as well …

big data storage

source professor m stonebraker, MIT 'Tackling the Challenges of Big Data' MOOC, 2014 [profile]

overview

… in this module, you will learn about new data storage processing technologies that have been developed to process big data …

… trends in relational database systems, including column-oriented and main-memory systems, which offer much greater scalability and performance than first-generation relational systems, which are needed in the era of big data …

… hadoop, mapreduce, spark, and other scalable computation platforms, their limitations, and recent developments aimed to make them perform better …

… so-called "nosql" and "newsql" systems, which offer different interfaces, consistency, and performance than traditional relational systems …

… this module begins with an overview of a number of these technologies by renowned database professor mike stonebraker …

… mike expresses his skepticism about many new technologies, particularly hadoop/mapreduce and nosql …

… and voices support for many new relational technologies, including column stores and main memory databases …

… after that, professors matei zaharia and samuel madden provide a more nuanced view of the tradeoffs between the various approaches, discussing hadoop and its derivatives, as well as nosql and its tradeoffs, in more detail …

… survey how data storage technology has evolved in the last decade, from conventional single-node relational databases to a plethora of storage technologies, including modern analytical databases, hadoop, and nosql …

… highlight the storage challenges that big data presents, particularly as it relates to new analytical needs on that stored data, from processing massive-scale sql aggregates to querying arrays to processing graphs …

… present the tradeoffs between relational, nosql, and other technologies, particularly as they relate to storing and accessing very large volumes of data …

… describe what a column-oriented database is and how it works …

… describe how the mapreduce (hadoop) framework works and what its limitations are …

… understand how next generation "newsql" databases, like the h-store system, can provide very fast sql queries over data sets that fit into memory …

challenges

… understand the differences between row- and column-oriented relational databases …

… understand the differences between transactional and analytical database workloads …

… describe the mapreduce abstraction and its limitations, especially when processing repetitive operations over data that fits in memory …

… articulate how the new spark framework addresses these limitations …

… understand acid semantics and related reduced-consistence concepts like eventual consistency …

… understand how sql systems can be modified to provide very high performance on main memory workloads without giving up acid properties …

modern databases + distributed processing platforms

introduction

… so the first thing i'm going to do is give you an introduction, which is going to be a history lesson of where this all came from …

… so the world pretty much started in the 1970s …

… it started with a pioneering paper by ted codd in cacm that, to a first approximation, invented the modern relational database model …

… during the '70s, there were some prototypes built …

… but the next significant thing happened in 1984, when ibm released a system called db2, which is still sold today …

… and they by that one act declared that relational database systems were a mainstream technology, they were behind them …

… during the remainder of the '80s, relational database systems gained a market traction …

… and in the '90s, they basically took over …

… all new database implementations were basically using relational systems by the 1990s …

… and a concept evolved called "one-size fits all", that relational database systems were the universal answer to all database problems …

… put differently, your relational database salesman was the guy with the hammer, and everything looked like a nail …

… so that was pretty much true …

… relational systems were the answer to whatever data management question you had …

… and that was pretty much true until the mid 2000s …

… and if you want to pick one event, i wrote a paper that appeared in the icde proceedings …

… at the time, i claimed that one-size did not fit all, that the days of universal relational database system implementations were over, and they would have to coexist with a bunch of other stuff …

… now it's seven years later, and i'm here today to tell you that one-size fits none, that the implementations that you've come to know and love are not good at anything anymore …

… so before i tell you why, i've got tell you what it is that i'm complaining about …

… so unless you squint, all the major vendors are selling you roughly the same thing …

… they're selling you a disk-based sql-oriented database system, in which data is stored on disk blocks …

… disk blocks are heavily encoded for a variety of technical reasons …

… to improve performance, there is a main memory buffer pool of blocks that sit in main memory …

… it's up to the database system to move blocks back and forth between disk and the main memory cache …

… sql is the way you express interactions …

… there's a bunch of magic that parses and produces query plans …

… these query plans figure out the best way to execute the command you gave them by optimizing cpu and i/o …

… and the fundamental operation, the inner loop of those query plans, is to read and or update a row out of a table …

… indexing is by the omnipresent b-trees that's taken over as the way to do indexing …

… all the systems provide concurrency control and crash recovery, so-called acid in the lingo …

… to do that, they used dynamic row-level locking and use a write-ahead logging system, invented by a guy named c mohan at ibm called aries …

… most systems these days support replication …

… the way, essentially, all of the major vendors work is that they declare one node to be the primary …

… they update that node first, then they move the log over the network and roll the log forward at a backup site, and bring the backup site up to date …

… so that's what i mean by "the implementations of the current major vendors", who i will affectionately call "the elephants." you are almost certainly using "elephant" technology …

… and the thing to note is that all of the vendors implementations date from the 1980s …

… so they are 30-year-old technology …

… you would call them legacy systems these days …

… so you're buying legacy systems from the major vendors …

… and my thesis is they are currently not good at anything, which is whatever data management problem you have, there is a much better way to do it with some other more recently developed technology …

… so the major vendors suffer from the innovators dilemma …

… that's a fabulous book written by clayton christensen who's on the faculty at the harvard business school …

… it basically says, if you're selling the old stuff, and technology changes to where the new stuff is the right way to do things, it's really difficult for you to morph from the old stuff to the new stuff successfully without losing market share …

… but in any case, the vendors suffer from the innovator's dilemma …

… that keeps them from moving quickly to adopt a new technology …

… and therefore, their current stuff is tired and old, and deserves to be sent to the home for tired software …

… so the rest of this module is going to be my explanation of why the "elephant" systems are not good at anything anymore …

… so i will claim that there's three major database markets, so i will segment the market into three pieces …

… about one-third data warehouses, about one-third transaction processing, so-called oltp, and one-third everything else …

… so i'll talk about each of these markets and how you can beat the "elephants" implementations by a couple orders of magnitude in every single one of them …

… and then i will have some conclusions at the end to leave you with …

data warehousing

… michael stonebraker: ok, in this segment, i'm going to talk about data warehouses …

… and i'm going to explain to you why the elephants' products don't work well in the data warehouse market …

… so what about the data warehouse market? …

… well, almost all of you, your enterprises, have a data warehouse somewhere in your enterprise …

… typically, it has customer-facing data in it, things like sales, products, customers …

… and typically, you record that stuff over time, and you load that in a data warehouse …

… and the fact of life is that are so-called column stores are well along at replacing the row stores that you are buying from the elephant …

… and why are they taking over? …

… because they're two orders of magnitude faster …

… so i'm going to explain to you in this segment why they're two orders of magnitude faster …

… so the way to think about a data warehouse-- and i'll just use walmart as an example-- every time an item goes under a wand anywhere in a walmart store, a transaction record is ultimately loaded into a data warehouse in bentonville, arkansas that says who bought what, where, and when …

… so that's what's called a fact …

… and at the center of most data warehouses is a thing called a fact table that just records transactional data …

… so surrounding that fact table is a collection of tables that are called dimensions …

… so in the case of walmart you have a table for every store, indicating the address of the store, who the manager is, its square footage, that kind of stuff …

… you have a timetable indicating the time at which any transaction took place, namely in days, minutes, hours, seconds, that kind of stuff …

… you have a product table, which is all the products that a company like walmart sells, giving their id and their description …

… you sell nails that are number 10 nails in aluminum, that kind of stuff …

… and then you often have a customer table that you get through loyalty cards, getting whatever information you have on a customer from the loyalty information …

… so you have a fact table in the center, and you have sort of so-called dimensions surrounding them that are connected to the fact table by foreign key/primary key relationships …

… and this is what's called a star schema, meaning a thing in the center and then a bunch of edges …

… there's a concept of snowflake schemas, which is if the star extends more than one level deep …

… and if you want to read about snowflake schemas and star schemas, check out anything written by ralph kimball …

… if you have a data a warehouse and you're not using a star schema, you should be …

… so why are column stores so much better than row stores on data warehouse stuff? …

… well, let's look at the fact table, because that's where all the data is …

… so in a typical warehouse you might have 100 columns recording the individual transactions …

… and a typical warehouse query might be something like, well, there were four hurricanes in florida during the 2007 hurricane season, so i want to figure out how to provision my stores for the next hurricane …

… so i want to figure out what's sold by department in florida stores in the week before every hurricane contrasted with the week after every hurricane, and compare that with same-store sales in georgia …

… so such a query might read four or five attributes out of 100 …

… so if you have a row store, the problem with a row store is that you store data on disk record by record by record …

… so when you pull a block off the disk and into the buffer pool, you're reading all 100 rows in any given record …

… and even though you only want four or five of them, you're reading the other 95% because they come off the disk, because that's the way data is stored …

… in a column store you have to rotate your thinking 90 degrees …

… you store the data column by column by column and not row by row by row …

… at which point, if you want to read four or five attribute, you read exactly the four or five you want and never bring the other 95% off the disk …

… a quick calculation shows that i just won by a factor of 25 with a column store …

… so other things that column stores do is the they encode the heck out of data …

… because if you have a 20 terabyte fact table you really would like to encode it as much as you can …

… so in a column store you only have one kind of attribute on each block of data …

… and so you can very easily compress one kind of thing …

… in a row store, you've got 100 kinds of things on the same block …

… it's much more difficult to compress data …

… so compression is way more productive in a column store …

… that wins me some more …

… furthermore, it turns out that all the elephants' products have a big header on the front of every record …

… well, if you designed a column store from scratch you don't want any big headers, because they take up a lot of space and they don't compress well …

… so the recently written column store systems don't have any record headers …

… that saves you some more …

… and then the last thing is that an executor does not pick up data row by row by row …

… it picks up data column by column by column …

… and it turns out a column executor is way faster than a row executor because of vector processing …

… when i pick up a row, i basically look at, say, the department …

… and then i either like it or i don't like …

… i either keep it or i don't keep it …

… then i go on and pick up the next row …

… and i do row by row by row by row …

… in a column executor you pick up a column and you blast through all the department data, keeping only the ones you need …

… and so you process the whole column rather than processing stuff a row at the time …

… that's just way faster …

… so for all of these reasons, column stores are just wildly faster than row stores in the data warehouse market …

… so let's look at the participants in this market …

… well, there are the elephants, who are native row stores vendors that we talked about in the introduction …

… they sell you a row store …

… then there are more recently written new technology column stores …

… vertica, which was a startup recently bought by hp …

… hana has been making a lot of press …

… it's a new system written by sap …

… there's a system called redshift that amazon sells …

… it turns out to be a product called paraccel that they're relabelling …

… and sybase iq was a long ago column store now owned by sap …

… so they're native column store vendors, way faster than native row store vendors …

… and then there are companies that have realized they're on the wrong side of a technological change …

… and so in transition are some companies that used to be row store vendors and are trying to convert to column stores, solving the innovator's dilemma in the process …

… so the thing to note is that the top bucket of people are 100x the second bucket …

… and the third guys are somewhere in between …

… they're in transition …

… so let me tell you why column stores don't look like row stores …

… so you can't be both a row store and a column store …

… there is no easy way to take a row store and wave a magic wand and convert it into a column store …

… why is that? well, let me just give you three quick slides on the way vertica works, which was a system i was involved in, now owned by hp …

… so column stores store data column by column by column …

… so think of it in the case of vertica, think of a table as stored column by column by column and sorted on all the attributes in left to right order …

… just happens to be way vertica does it …

… so the leftmost column is sorted …

… the next one over is sorted only for those matches in the one to its left and so forth …

… the way vertica stores a column is in 64k blocks that i'll call chunklets …

… so in 64k will fit however many attribute values will turn out to fit …

… the first one is stored uncompressed, and all the rest are stored compressed …

… delta encoding, which is you encode each one against its predecessor, is one way they do it …

… lempel-zipf is another way …

… huffman coding is another way …

… if you have a lot of repeated values, like male and female as an attribute, well, you can just suppress the repeated values …

… so you're encoding the heck out of 64k chunklets …

… the leftmost column is in sorted order, so it's usually delta encoded …

… the rest of them are encoded however turns out to be the best way to do things …

… chunklets are only decompressed when necessary …

… you can do quite a few operations without decompressing a chunklet …

… so lazy decompression …

… and the fundamental operation is pick up a chunklet and process the attribute values in that column …

… looks nothing like the standard row store technology …

… now, you might say, well, column stores are kind of nice, but how fast can you load data? because if i load a new record, well it has 100 columns in it, and those 100 columns all go in different places …

… so, i certainly don't want to do 100 disk writes every time i want to load a record …

… so of course, vertica doesn't do it that way at all …

… there is a main memory row store in front of the stuff i just talked about …

… all the newly loaded tuples go there as main memory adds …

… and every once in a while vertica picks up a whole bunch of these main memory tuples, rotates them 90 degrees, compresses the heck out of them, and writes out big, long runs of stuff …

… so in bulk, sort, compress, and write out …

… and every once in awhile you have these segments that you've written to disk …

… you merge them to make bigger segments, and they get bigger and bigger and bigger in an asynchronous process …

… queries to vertica are answered from both places, both the column store and from the row store …

… so to a first approximation, this is what paraccel looks like …

… this is what hana it looks like …

… and it has nothing to do with the way row stores are implemented …

… so the elephants code has nothing to do with the way these systems work …

… and these systems are wildly faster than row stores …

… so over time, the only successful data warehouse products will be column stores …

… and the elephants are either going to go out of the warehouse market or they're going to have to morph their current row stores into column stores, which requires them to solve the innovator's dilemma problem …

… and that's the end of what i have to say on data warehouses, except for a couple of remarks i'll make in the conclusions …

… thank you …

online transaction processing OLTP

… the purpose of this segment is to talk about the online transaction processing world and to explain to you why the elephants' systems are no good at this market either …

… so in order to talk about transaction processing, i should first do a quick example …

… the way to think about transaction processing is to think of our favorite retailer, which is amazon …

… you don't hit a transaction processing system until you say, buy it …

… when you say, buy it, your shopping cart is turned into a transaction …

… you actually own the stuff …

… and there's a transactional database-- implemented, by the way, in oracle-- which keeps track of what you bought, figures out when to ship it, and charges your credit card for whatever the amount is, and so forth …

… so this is an update-oriented world, where you want to do updates very quickly …

… there's essentially no big block-oriented reads like in the data warehouse market …

… so this is an update-oriented world …

… and you can never lose anybody's transactions because the last thing amazon wants to do is send you stuff and not have you pay for it or have you pay for it and not send you stuff …

… either way, it's a big lose …

… so there are three big decisions to make when you're thinking about how to implement transaction processing systems …

… the first one is do i bet on main memory or do i bet on disk? …

… the second one is how to do replication …

… and the third one is how do i sort out parallel transactions to make sure that i support so-called acid? …

… so i'll talk about each of these …

… and i'll talk about the new way …

… and i'll talk about the old way and why the new way's going to clobber the old way …

… so a reality check on transaction processing databases is that a one terabyte database is really, really big by transaction processing standard …

… and, yeah, i know there's facebook, which has a much bigger one …

… but by and large, a terabyte is a really big tp database because your tp database only grows at the rate that you sell more stuff …

… so if you want to store one terabyte of data, well, you can store it on disk …

… that's always been an option …

… but you can also store it in main memory, as long as you're willing to pony up about $30,000 or less …

… and the price of a terabyte of main memory is dropping like a rock …

… so if your data doesn't fit in main memory now, then wait a couple years, and main memory prices will drop to where it'll make sense for you to put your data in main memory …

… so unlike data warehouses that are really, really big-- and your business analysts want to keep making them bigger-- transaction processing databases just aren't that big …

… so if your data fits in main memory, well, if you're using one of the elephants' products, that will mean it's in the buffer pool cache that's in main memory …

… the real data is stored on the disk …

… so if you're doing that, here's what you're up against …

… so this is some data that we actually got about five years ago …

… and we instrumented the shore dbms, which looks exactly like the elephants' products …

… but we had the source code for it …

… and so we could instrument it …

… and also dave dewitt, who wrote shore, was here at the time …

… so that made it even easier to get what i'm going to talk about to work …

… so the data from shore should look a lot like what you'd get from the elephants if you could, in fact, instrument their product …

… and this is on a benchmark called tpc-c, which is the standard transaction processing benchmark from the transaction processing council …

… so we measured elapsed time, which is to a first approximation cpu cycles, because everything's in main memory …

… and so useful work is less than 10%, which is finding the data you want and updating it-- 10% or less …

… in the artist's rendition of what i'm talking about, it's 4% …

… it varied depending on exactly which tpc-c command you are actually doing …

… but the big enchilada was four different pie slices that collectively took up more than 90% of the total time …

… the first one was managing the buffer pool …

… it shouldn't surprise you that if the data is highly encoded in the buffer pool, well, to update it, you've got to find the right buffer pool block, then find the right place in the buffer pool that your record exists, decode your record into main memory format, update it, and then reverse the process, put it all back …

… the elephant systems all maintain lru ordering of the blocks in the cache in case they have to throw stuff out …

… you have to pin blocks in the cache sometimes …

… all that code, about a quarter of the overhead …

… the next piece is you do a row level locking …

… every time you touch a record, you have to set a lock …

… every time you update a record, you've got to upgrade that row level lock to a write lock …

… and managing the lock table is another about a quarter of the overhead …

… the third thing you have to do is you have to write in aries style, write-ahead log, like i talked about in the introduction …

… and that means every time i do an update, i have to assemble the before image and the after image of the record, write the data into the log, and then force the log to disk …

… that's about another quarter of the overhead …

… the last one is a little bit surprising …

… and you probably haven't thought about it because the elephants' products are all multi-threaded …

… and the whole idea long ago behind multi-threading was, if i had to do a disk read, well, then there was nothing else to do …

… i wanted to switch to run something else …

… to run something else, i needed some more threads …

… so all the major database systems are multi-threaded …

… and long ago, they were running on one cpu …

… so everything was cool …

… these days, you've got 8, 16, 32 cpus that are sharing main memory …

… you got a lot of threads …

… so you're running a lot of parallel transactions …

… what that means is that whenever you touch shared data inside the data manager, you have to latch it to make sure that it doesn't get corrupted …

… so when i'm setting a lock in the lock table, i got to make sure nobody else is changing the lock table while i'm changing it …

… so what happens in multi-threaded worlds is that everybody starts setting latches, and you end up having to wait until the previous guy gets out of your way …

… and the latching overhead is about a quarter of the overhead …

… so there are four big slices of the pie …

… and the useful work is under 10% …

… so now if you want to go fast, it doesn't take a rocket scientist to figure out two major things …

… the first one is that if i have some fancy improvement on b-trees and i can do indexing better than the next guy, well, that only impacts the useful work piece of the pie …

… and that's way under 10% …

… so better b-trees will improve overall performance by 1% or 2% …

… so there's no sense worrying about useful work …

… you've got to get rid of overhead …

… now to get rid of overhead, if you get rid of the buffer pool, well, that gets rid of one slice of the pie …

… but there are still three more left …

… so for example timesten is a main memory database system that's been around for about 20 years …

… it stores data in main memory …

… so there is no buffer pool …

… that piece of the pie is gone …

… but the other three pieces are still there …

… they do record level locking …

… so that overhead is there …

… they do aries-style logging …

… that piece of the pie is still there …

… and they're multi-threaded, so the latching overhead is still there …

… so you think about it for a minute …

… and you say, well, that'll go marginally faster because you're only getting rid of one of the four slices of pie …

… you want to go really fast, you got to get rid of all four …

… so how do you do this? well, first of all, you got to get rid of the latch overhead …

… now, latches are everywhere in all the major vendors' products …

… so either they have to rewrite all their systems to get rid of shared data structures somehow …

… or you've got to get rid of multi-threading which, again, would be major surgery to the elephants' products …

… or you somehow got to get rid of the queuing delays that latching entails …

… one way or another, you're toast unless you can get rid of the latching overhead …

… so current systems often do this by being single-threaded …

… so a system i was involved in that was written here, the prototype h-store, which turned into the commercial product voltdb is single-threaded …

… so it gets rid of the latching overhead by not being multi-threaded …

… so one way or another, you've got to get rid of the multi-threading overhead …

… you're toast unless you're a main memory database system …

… so that much is obvious …

… any disk-based system, you're going to die …

… so i'm often asked at this point, what happens if my data doesn't fit in main memory? well, there's a great paper that i, of course, was involved in that will appear in next year's vldb conference …

… and it's called "anti-caching." and what it does, it says run a main memory database system and then archive the cold tuples in main memory format to a slower storage device …

… and anti-caching is way, way faster than caching …

… so you can extend past main memory …

… but you've got to run a main memory-oriented system …

concurrency control

… well, you're toast, if you do dynamic locking, which is what the elephants all do …

… so what do the new systems do? …

… well, a thing called multiversion currency control is popular …

… a hekaton is the system coming out soon from microsoft …

… timestamp order is popular …

… that's what voltdb and h-store did …

… and there's some other interesting ideas from the research community that are combinations of very lightweight locking and timestamp order …

… and there is no system written for the oltp market in the last decade that's using traditional dynamic record level locking …

… it's just too slow …

… now, what about logging? …

… well, how do i do crash recovery if i don't log something? well, the answer is yeah, that's all true …

… but you've got to make the run time overhead go down, and down by a lot …

… so again, we wrote a paper that's going to appear in icde 2014 by nirmesh malvaiya …

… and basically, he suggests just do command logging …

… in other words, if i pull up to mcdonald's and i say, i'll have a number six with fries and a diet coke, well, you can log six fries, diet coke, rather than log all the data changes that that transaction actually turned into …

… so do logical logging at the highest level possible …

… and again, it's a wildly faster than doing data logging …

… so that's augmented by the current reality, which is no one is willing to take downtime …

… back when i actually had some non-gray hair, systems would crash, and you'd try and get back up as quickly as possible …

… that's not what people are willing to do anymore …

… they all are willing to pay for the extra hardware and failover to a backup …

… and so tandem pioneered this concept in the late '80s …

… and so failover to a replica is now absolutely required …

… so most of the time you just failover to a replica and keep going …

… so log is rarely invoked these days …

… it's only invoked if you get disasters …

… and so you want to make the overhead of logging go way, way, way down …

… so what i've said is there's an old way and there's a new way …

… so the old way, which is the elephants code …

… the data is based on disk …

… the new way, it's in main memory …

… the elephants cache disk blocks in main memory, my claim is that you want to archive cold tuples on disk, and it works much better …

… you don't want to do data logging aries style, you want to do command logging …

… you fail over to a backup and keep going, and only occasionally do you recover from a log …

… you implement multiversion concurrency control timestamp order …

… something other than dynamic locking …

… and you run a single-threaded system or another mechanism to avoid any critical sections, any shared data structures, multithreading threading is a disastrous performance problem …

… so there are some new-way systems …

… hekaton that i mentioned a few minutes ago coming to you from microsoft …

… hana is an oltp system, a little further out, being promised by sap …

… and there's a bunch of startups i mentioned, voltdb …

… there's also memsql, sqlfire, clustrix, a few others …

… these are all new-way systems that implement oltp using the new-way concept …

… and they are a factor of 100 or so faster than the elephants …

… so new is way better than old …

… and the elephants aren't going to be able to easily change their implementations to implement all these totally different ideas …

… so again, they're up against the innovator's dilemma …

… so if you don't care about performance, if you want to run 50 transactions a second, run them on your wrist watch …

… anything will run 50 transactions a second …

… if you care about performance, then one of the new vendors is going to be a factor of 100 faster than what you're currently running …

… so if you don't care about performance, do whatever you're currently doing …

… otherwise, a changeover to better technology is in your future …

… and that's the end of what i have to say on transaction processing …

… thank you very much …

nosql systems

… ok, so far in this module, we've talked about data warehouses and about transaction processing, and i explained how the elephant's products are no good at either of those markets …

… so now, i'm going to talk about everything else …

… by now you get the drift, the elephants are no good at this stuff either …

… so everything else is a potpourri of interesting stuff …

… there's something called nosql systems that i'll talk about …

… something called array databases that i'll talk about …

… something called graph databases that i'll talk about …

… and then there's a thing called hadoop …

… and because hadoop-- everybody somehow thinks that's synonymous with big data, and i'm going to tell you why it's not …

… what i'm going to do is, i'm going to talk about the first three things in this segment and then the next segment we'll bash hadoop …

… so let's start with nosql …

… so there are 75 or so vendors of which probably the ones you most likely to have heard of are a thing called mongodb and another thing called cassandra …

… and then there's 73, at least, others …

… so there are three concepts that all of these nosql guys implement …

… the first one is, as the name on the tin suggests, give up on sql …

… so don't use sql …

… now the basic message here is that sql is too slow …

… and so what you want to do is code in lower level utterances, which is what sql gets compiled into …

… so basically, get rid of the whole sql layer and code the algorithm yourself …

… now, it turns out there's been 40 years or so of database research …

… and in the '70s there was a big debate between low-level codicil, like record at a time systems, and high-level sql systems …

… and over 40 years, codicil is completely gone and sql systems are the answer …

… so anybody who says bet against the compiler is a lunatic …

… so the standard wisdom, when i had mostly black hair, the version of this argument that was prevalent then was, you want to go fast, you got to code in ibm assembler …

… if you can't allocate your own registers, you're never going to go fast …

… now you guys today would never advocate coding in assembler …

… because the compilers, they get good enough that they get way better than you at doing all of this low-level stuff …

… so sql is compiled at compile time into the same kinds of utterances that the nosql guys advocate …

… never bet against the compiler …

… never ever bet against the compiler …

… high-level languages are good …

… the database community has figured that out over 40 years or so …

… so why don't the nosql guys realize this? …

… well, most of them don't know much about databases, so they are reinventing the wheel, as we'll see in a minute …

… but anyway, give up on sql …

… in my opinion, high-level languages are really good …

… do not ever get forced into writing your own low-level record at a time code …

… second thing the nosql guys advocate is giving up on acid …

… acid is too expensive …

… well, we saw in the last module that acid is expensive when it's implemented by the elephants in their legacy systems …

… so that's true, what they're saying is true, but as we saw in the last segment, there are all kinds of new ideas on how to make acid go fast …

… and so when the nosql guys complain about acid, they're complaining about 35-year-old implementations, not about modern stuff …

… so giving up on acid because you think it's too slow is just a statement that you're not paying attention to the marketplace and modern systems these days …

… second thing is that if you need acid, then do not write it yourself in user code …

… your hair will be on fire if you try and get acid straightened out with user-level code …

… it's complicated, messy stuff …

… let the database experts do it …

… so if you are guaranteed that you won't need acid, you won't need it now, and you won't need it three years from now when your company , which used to be in the plumbing supplies business, suddenly decides to buy beauty salons …

… so if you are guaranteed that you won't need it now, and you won't need in the future, then fine give up on acid …

… otherwise, you have a fate worse than death on your hands, which is you're going to have to implement acid in user code …

… so the nosql guys advocate giving up on acid …

… number one, you don't need to if you run one of the modern db systems …

… and number two, you're making a bet that you won't need it in the future, and if you bet wrong, i don't want your job …

… the other thing, the third tenet of the nosql guys, is schema later, …

… which is they complain that the relational database vendors require you to say create table …

… you've got to think about your data upfront, decide what the table structure is, and before you can load a record, you've got to decide what your data looks like …

… so schema later says just start loading data, and you'll figure it out eventually …

… i'll give both sides of the argument …

… first one is, i'm not quite sure what my data's going to look like …

… i'm just going to start loading, and i'll figure it out later …

… the second point of view is to say, if you are supporting a production enterprise application …

… and you don't think about your data upfront, you're going to dig yourself into a horrible pit and a modicum of thinking about it upfront will save you a disaster later …

… so depending on your point of view, schema later is either a good idea or a terrible idea …

… and that's kind of all i'm going to say …

… i come from the camp that says, think about your data upfront, and you'll be way better off downstream …

… but not everyone agrees with this …

… so now let's look at the nosql guys in aggregate …

… and i would say cassandra and mongo are probably the most prominent vendors, so let's look at what they're currently doing …

… well, they're both inventing, or have invented, high-level query languages, which look like sql …

… so they are moving quickly toward sql …

… secondly, let's talk about acid …

… well one of the biggest proponents of no acid is a guy named jeff dean from google …

… so he's largely responsible for mapreduce and bigtable and most of google's database offerings …

… and he just wrote a paper on a system called spanner and basically said, i was misguided …

… i thought acid wasn't necessary, but, boy, it's certainly a good idea, and google's most recent system has it in it …

… so the world is moving toward acid as people realize that acid is an awfully good idea …

… so in my opinion, nosql used to mean nosql, then the nosql guys said, well, what it really means is not only sql …

… we want to do this stuff-- maybe sql's good for some things, but not for everything …

… and in my opinion, nosql currently means not yet sql …

… so i think there's going to be a convergence …

… which is the nosql guys are going to move strongly toward the transaction processing world, which is sql, or things that look like sql, along with acid, and they're going to sort of look more and more like conventional database technology off into the future …

… so if you're running nosql, mostly what people seem to use nosql for is low-end applications that aren't particularly mission critical, sort of webby things, if you want to store some web data …

… if you've got to store everybody's password, that's one table that you always look up by row …

… so if you have some low-end simple applications, i'm not against nosql …

… most of the systems are open-source …

… they're free or very cheap …

… you can get them up quickly …

… they're easy to learn …

… so there's certainly a market here, especially at the low end for these guys …

… and, of course, nosql guys don't particularly look like sql engines …

… their implementations are moving toward the implementations from the elephants …

… but there's certainly a separate market here, so it remains to be seen how big this market will be …

complex analytics

… so let's move on to a different topic …

… so i want to talk about complex analytics …

… what that means is the world of the quants …

… any time you hear machine learning, predictive models, recommendation engines, estimators, dot dot dot, that usually goes under the rubric 'data mining.' …

… and your quants and your data scientists are the people who do this stuff …

… so what is this stuff all about? …

… well, by and large, all of this is computations that are defined on arrays as collections of linear algebra operations …

… and they don't look at all like sql …

… and often, you've got to do them on big amounts of data at scale …

… so you want to do this kind of stuff on big data …

… so let me just give you a really quick, easy to understand example …

… let's look at-- suppose you're a quant on wall street, and you're trying to write an electronic trading engine …

… so what you do is you've got 20 years worth of data on the closing price of every stock for a long time …

… and you say, well, if i'm going to write a trading engine …

… i've got to find out what patterns there are in this sea of data …

… so let's assume that i start off by saying, well, i don't know anything yet …

… so let me pick two stocks-- say oracle and ibm-- and say, i wonder if they're correlated …

… so i can compute the covariance between the closing price of those two stocks …

… and all the orange stuff at the bottom is covariance …

… and if you don't know what that means, have a quant explain it to you …

… anyway, that's what it is …

… but that isn't really what you want to do …

… you really want to do it for all pairs of stocks …

… i don't know what patterns relate to what patterns, so i want to do some data mining …

… so i want the covariance of all pairs of stock …

… so what it is is i have a big array, which is i have order 15,000 stocks-- that's how many publicly traded stocks there are in the united states …

… you multiply it by 4 if you want them everywhere in the world …

… and if i've got 20 years worth of data, that's about 4,000 trading days …

… so there's a row for every 1 of 15,000 stocks that has 4,000 closing prices …

… so i have a 15,000 by 4,000 matrix …

… it's a matrix …

… and what is covariance on this table? …

… well, to a first approximation, ignoring constants and ignoring subtracting off the mean, it's that array-- stock-- times matrix multiply times its transpose …

… so this is a matrix calculation on matrices …

… and it's a program that involves matrix multiplication and transpose …

… so you want to do this at scale …

… you've got this orange stuff to do, and do fast …

… what do you want a database system to do? …

… well, you want it to be able to do these kind of calculations …

… you want them to scale to lots of cores, many nodes, out-of-memory data-- run it on big data …

… but you want to be able to mix this with data management …

… because i want to leave out outliers …

… i might only want those securities with a market cap over $10 billion, or only those securities with a headquarters in new england …

… so i want to be able to mix and match this kind of stuff with data management operations, and make it work at scale …

… so that's the purpose of array databases …

… if you fundamentally have to do array calculations-- linear algebra …

… well, if you're running a table system, then you've got to convert back and forth between tables and arrays, because tables are not arrays …

… so the general idea is to say, well, if i've got to do a lot of this kind of stuff, let's have an array database system …

… let's not do tables …

… let's build arrays as the native storage object …

… and to do data management, i need sql …

… that's what everybody wants …

… but i wanted defined on arrays so i can do joins and filters …

… so you've got systems that give you a race equal with all kinds of quant-oriented linear algebra built in …

… and if you don't like what's there, they all give you user-defined functions so you can write your own fancy analytics …

… so scidb is an example of this kind of stuff …

… and this world is going to get traction as you guys move from standard sql business intelligence to this more complicated world of data scientists …

… so this will get traction if and when-- well, as-- you move to more complicated analytics …

… and the systems don't look at all like the traditional wisdom …

… they're built on arrays …

… they're not built on tables …

… they don't look anything like what the elephants sell you …

… so there may or may not be a sizable market for array database systems, depending on how fast and if people move to complex analytics …

… and this doesn't look like the traditional wisdom at all …

graph databases

… the last thing i want to talk about is graph databases …

… and there's lots and lots of interest in this topic in the research community …

… and there's starting to be some commercial implementations …

… so the whole idea is that if you look at facebook's data, you know, it's i like sam madden, and both of us happen to be followers of madonna …

… facebook is one big, gigantic graph …

… and when facebook does calculations, like the average distance between me and sam madden as such and such, it's the distance in that graph …

… so twitter is a graph …

… facebook is a graph …

… there's a lot a graph data in the science world …

… so graphs are, conceivably, a very interesting data structure …

… so the first system that i'll just mention is a system called neo4j …

… that stores native graph data and is focused on being able to update a graph quickly …

… so you want to be able to add nodes and edges to a graph quickly …

… the other thing that you often do with graphs is you compute analytics …

… you know, like the shortest path between me and barack obama is whatever it is, but what's the distance of separation between us …

… so shortest path, minimum cut set, that kind of stuff, there is a dozen or so standard analytics that people want to run …

… and you could build a system that focused on doing analytics on graphs fast as opposed to doing updates on graphs fast …

… in this analytics market, it's interesting to note that you can recast a graph as a sparse array …

… meaning that if there's an arc from i to j, all that says is that in the ij-th cell of an array, there's something there as opposed to nothing …

… so array database systems can simulate graphs at, perhaps, very high performance …

… relational database systems can also simulate graphs …

… so the jury is still out whether there is an interesting system to be built focused on analytics …

… it's real clear that if you want to do oltp, things like neo4j are a pretty good idea …

… so basically, this segment has talked about everything else …

… it's talked about nosql …

… it's talked about array databases, and it's talked about graph databases …

… the point of this segment has been to say none of this stuff looks like this traditional elephant systems …

… it all looks very different …

… and how much traction that these various systems are going to get remains to be seen, but there are a lot of interesting ideas addressing various kinds of data management problems …

hardoop

… and now, hadoop …

… and hadoop deserves its own segment, because somehow a bunch of marketing people have somehow decided it's synonymous with big data …

… so what do i mean by hadoop? …

… because in the marketing world, it means various things to various people …

… so hadoop was written originally by yahoo …

… it's an open source version of google's program called mapreduce …

… so that's what it is …

… it gives you, as you might surmise, two operations-- map and reduce …

… and map is basically things that look like filters, things that look like transformations …

… and reduce is basically what you would call a roll up …

… so hadoop is very good at what i'll call embarrassingly parallel operations …

… so let's say i have a million documents …

… and i want to find all the documents that have a strawberry followed within four words by a banana …

… well, i can divide up the million documents over as many nodes in a computer system as i want …

… and then a map operation, i can write it in any scripting language that i want …

… so you could write it in java …

… you could write it in python, whatever …

… and then send that python script as a map operation to all the various nodes …

… each of them goes through their documents that in parallel picking out the qualifying fruit records …

… and then as they roll up at the end to assemble the result, which is a single reduce operation …

… and so document search is very easy to parallelize …

… it's embarrassingly parallel …

… hadoop is very, very good at this kind of stuff …

… now, hadoop has morphed over the years into a complete stack …

… so this thing called mapreduce, which was originally called hadoop, is in the middle …

… but what people have found is that lots of times, they want to code queries in-- guess what?-- sql …

… so there's a layer on top of a hadoop …

… hive is an implementation written by facebook …

… pig was an implementation written by yahoo …

… think of them as sql look-alikes …

… so that's at the top layer …

… in the middle, you process sql by compiling it into these hadoop mapreduce operations …

… and mapreduce runs on top of hdfs, which is a file system …

… so file system at the bottom, and mapreduce in the middle and sql at the top …

… and this runs across any number of nodes …

… so this stack will scale to as many parallel nodes as you want to set up …

… and in fact, facebook is running a gigantic hadoop stack over, i think, 2,500 nodes …

… so people are setting up very big hadoop stacks across many, many systems …

… so what is this stack good for? well, it's very good at embarrassingly parallel computations like document search …

… we already talked about that …

… the second thing you might say it's good at is, what about warehouse style queries? …

… we talked a couple segments ago about walmart and provisioning stores for the next hurricanes …

… now if you run that sort of query, well, you write that in hive …

… so from now on, i'm going to call hive sql …

… so you write that in sql …

… so that sql, when you run it on the hadoop stack and compare the performance with a column store, you're going to run 100 slower …

… so you're going to run a factor of 100 slower than current high-performance warehouse technology …

… so you can have the answer in an hour on hadoop, or you can have the answer in a minute on your favorite data warehouse …

… which one would you rather have? or put differently, you can run it on 2,500 nodes in hadoop, and you can run it on 2.5 nodes in a parallel database system …

… do you want to run 2,500 or 2.5 nodes? …

… so if you're doing sql aggregates, the hadoop stack is disastrous …

… let's suppose you want to do the covariance calculation that we talked about in the previous segment …

… so you want to do linear algebra …

… you're going to be a factor of 100 worse than an array database system …

… so again, have the answer in an hour, have the answer in a minute, your choice …

… so complex analytics, hadoop is a disaster …

… these are usually coded in a thing called mahout, which runs on top of hadoop …

… it's a performance disaster …

… another thing you might be doing is scientific kinds of codes …

… so if you want to do computational fluid dynamics simulations, well, there are lots of scientists who do that …

… they use so-called mpi-based systems that are, again, going to beat hadoop by a factor of 100 …

… so the net net is if you've got an embarrassingly parallel calculation, by all means, use the hadoop stack …

… otherwise, you can beat hadoop by a factor of 100 by something else …

hadoop usage

… so let's take a quick look at hadoop usage …

… while facebook is a very big user of hadoop, more than 95% of their traffic is in hive, meaning sql, meaning aggregate, meaning factor of 100 slower than a parallel database system …

… i have a colleague at lincoln labs, jeremy kepner …

… jeremy has estimated that lincoln labs, which is 4,000 or so scientists, 95% of their usage is not embarrassingly parallel …

… so around the world, it looks like 5% or less is embarrassingly parallel, and 95% would be way better served by something else …

… the hadoop proponents aren't stupid …

… so cloudera and hortonworks-- hortonworks is the yahoo guys moving out to their own company-- and facebook are huge proponents of the hadoop stack …

… they are all doing exactly the same thing …

… so cloudera has a system called impala …

… impala is an execution engine that implements hive, ie sql …

… and it has nothing to do with a mapreduce layer …

… it's a complete execution engine built on top of the file system …

… and it looks exactly like modern data warehouse systems …

… so it looks, to a first approximation, a lot like vertica or a lot like hana …

… so hortonworks and facebook are both doing exactly the same thing, which is they're defining and building an execution engine that processes sql without ever using this mapreduce layer …

… so these guys are effectively moving to an execution engine that's hive-oriented, ie data warehouse oriented, and they're going to compete in the data warehouse market …

… and guess what? all the data warehouse vendors support hive …

… so if you want to process hive, you will have a lot of options …

… you'll have column stores from the data warehouse guys …

… and you'll have stacks from these guys …

… so life will be interesting …

… but in any case, it's not going to involve mapreduce …

… it's going to involve processing sql in data warehouse-like worlds, because that's 95% plus of the market …

… so the most likely future is there is a small market, say 5%, for embarrassingly parallel applications …

… the current hadoop stack is pretty good at that, 5% of everything …

… 95%, there's a much bigger market for a hive sql framework …

… and the data warehouse guys are focused on this …

… the hadoop guys like cloudera and hortonworks are focused on this …

… and there will be a great collision as they fight for market share …

… in my opinion, it's unlikely that hdfs is going to survive …

… because it is also horribly inefficient …

… so the warehouse guys are currently will support hdfs if you want to run that file system …

… that's called the go slow command against what they're currently doing …

… so hdfs has a huge performance penalty …

… so the whole stack is rotten …

… the mapreduce layer is rotten …

… it's going away …

… hdfs is rotten …

… it may or may not go away …

… this whole market is in a considerable amount of flux …

… and i guess you have to hold onto your seat belt …

… and i'll have some conclusions for you in this world and in some other worlds in the conclusion section …

… thank you very much …

market forecast

… so i now want to summarize with some closing thoughts …

… the first one is data warehouses will be a column store market …

… there is no question in my mind that that's going to happen …

… it may take another decade for it to completely unfold, because database systems are very sticky …

… it takes a while to change …

… so if you are not running a column store now, i guarantee you, you will be sooner or later …

… so what do you have to do? …

… well, ask whoever your current vendor is what his column store plans are …

… if he doesn't have any, then you better switch, or better plan on switching …

… and that will be a costly and time-consuming operation …

… but you're going to have to do it because there's a factor of 100 to be gained …

… and you're going to see that that's worth it sooner or later …

… so find out if your vendor is moving to a column store fast enough that you can just wait and move with him …

… or you're going to have to switch …

… oltp is going to be a main memory database market, with anti-caching if your data is too big …

… if you're not running a main memory database system now, you will be in the future …

… and ask your vendor if main memory is in his plans …

… if he's going to implement a main memory system, and it's not compatible with what you currently have, then you have a costly conversion on your hands …

… at which point, you might as well look at lots of other technologies …

… so if you're not running one now, you're going to have to switch or switch with your vendor as he switches, if he switches …

… arrays database systems and graph database systems may or may not gain traction …

… at the very least, you should understand what they're good for, what they're not good for, and whether you ought to consider one …

… nosql is going to be popular for low-end applications, especially things that look like document management, web stuff, and especially places where you really do want schema later and you think that schema later is a good idea …

… and they're going to be acid-less for a while …

… so don't ever consider nosql for in an area where acid might be required …

… otherwise, your hair will be on fire …

… the hadoop stack is going to morph into something that isn't going to look at all like its current stack …

… so hold onto your seat belt …

… at the very, very least, look at your proposed and current hadoop applications …

… see if they're embarrassingly parallel …

… if they are embarrassingly parallel, they'll continue to run on the current stack without a disaster …

… otherwise, what's going to happen as you scale up, you're going to hit a wall …

… and you're going to have to switch to something else …

… and there will be lots of products there to pick up the pieces …

… so you're in deep doo doo if you are contemplating non-embarrassingly parallel applications on the hadoop stack …

… you're guaranteed to have to switch as you try and scale …

… think about it now …

… don't think about it later …

… the current products from the relational elephants are only going to survive in low-performance applications …

… if you don't care about performance, run db2, run oracle, run sql server …

… they're only going to survive where you don't care about performance …

… in other cases, something else is going to run circles around them, and you're going to switch …

summary

… the curse is that may you live in interesting times …

… in my opinion, the relational database world was pretty dead in the '90s …

… one size fits all relational databases were the answer …

… there was one kind of system …

… everybody used it …

… in the last decade, there's been lots and lots of new database ideas and products …

… nosql has come into existence …

… new tp databases have come into existence …

… graph databases, array databases, hadoop has come into existence …

… lots of new database ideas …

… the world is very vibrant, and the database world is very vibrant …

… one thing that's going happen for sure is your business intelligence folks will keep putting more and more and more stuff in their data warehouses …

… my favorite example is a decade ago or so, i got to visit a beer company in milwaukee …

… and they had a traditional data warehouse of sales of beer by brand, by time period, by distributor and all that sort of stuff …

… so this was in november of a year where there was an el nino event predicted by the weather guys …

… so el nino is equatorial upwelling of pacific warm water …

… and it screws up the weather in the united states in the winter …

… and so it becomes wetter than normal on the pacific coast and warmer than normal in new england …

… so i said, well, el nino's coming …

… the weather guys say so …

… are beer sales correlated with either temperature or precipitation? and the business intelligence guys scratched their head and said, boy, that's a good question …

… i wish i could answer that question …

… but of course, weather data wasn't in the warehouse …

… so this is the kind of stuff that just generates pressure to add more and more and more stuff to warehouses …

… so they will get bigger and bigger and bigger …

… there are probably two dozen petabyte size data warehouses, that i'm aware of right now, running in production …

… and they're just going to get bigger …

… there's going to be a sea change from simple analytics to complex analytics …

… why is that going to happen? …

… well, right now, you have your business intelligence guys running sql queries to say, what's my average sale by department, by store? …

… so that'll give you today's data …

… you'd much rather have your data scientist build a predictive model to predict next month's sales by store, by department …

… and that's complex analytics …

… so you get much more sophisticated reasoning about what's happening …

… but you've got to move to complex analytics …

… you've got to move to the world of the data scientists …

… you've got to move away from your business intelligence guys …

… now, i got to make a call on a very large insurance company …

… and i said exactly this, which is, you guys are going to put a sensor in everybody's car to record how and where and when they drive …

… and you're going to build complex risk models based on all these variables …

… and you might as well get going …

… and they said, yeah, we should do that …

… but our business analysts aren't up to this level of sophistication …

… and they're really hard to hire …

… so the sea change is going to happen …

… it's going to be gated on you being able to hire competent data analysts …

… and that will take however long it takes …

… but you better get going, because it's in your future …

… the internet of things is a force to be reckoned with …

… everything on the planet of material significance is going to get sensor tagged …

… i already mentioned your car by your insurance company …

… other things that are going to happen, the mit library would like to sensor tag every book …

… they don't want to do it because undergraduates might walk out without checking them out …

… that's not it at all …

… if a librarian misshelves a book, it's lost forever …

… so they want to be able to find it if a librarian misshelves it …

… parking spaces on the mit campus want to be sensor tagged, so that instead of driving around and around in the parking structures looking for an empty space, you'd just be directed to one that exists …

… marathon runners get sensor tagged …

… all kinds of animals get sensor tagged …

… your kids get sensor tagged if you want …

… so the world is going to get sensor tagged …

… and that's going to generate a data deluge like you haven't seen before …

… so if you think data is a deluge now, just wait for the internet of things to kick in …

… future times are going to be really interesting …

… there are going to be all kinds of changes that you are going to have to cope with …

… and my best advice to you is to hire a really good chief data officer to help you sort all this stuff out …

… in conclusion, the future is going to be really different …

… if you aren't able to move with the times, you're going to die …

… because your data is your most valuable asset …

… and if you can't leverage it, you're going to lose …

… thank you very much …

big data systems


security


multicore scalability


visualization and user interfaces


big data analytics


fast algorithms


data compression


machine learning tools