Mining Massive Data Sets

Long homework is due two weeks from now at midnight, Pacific time and everything. everything that I said in the last lecture applies for this homework. Another important thing is that today, basically right after the class in the scaling of editorial, we will be holding our first of the recitation sessions called Spark to Talia tutorial and Clinic. So this is basically the idea is to to go here. If you have any trouble reading, start installing spark any trouble doing the homework zero. so essentially we will go through homework zero here, show you how to do it and help you debug your installation of spark so that then you are ready to do homework, one and everything else.

So if you have any trouble, this would be a great place to go to seek for help. Yes. Yes, all the recitation sessions will be recorded. Of course, if the session is recorded, it’s kind of useless if you have problems because you cannot ask questions, so plan accordingly. OK, great so now topic for today we’ll be talking. talking about frequent item sets, mining an association rules and the way I’ll do the lecture is our first tell you kind of watch the idea, what’s the goal, and then we’ll develop algorithms that will allow us to do that and the association rule discovery. The idea is that we want to do that. We are a supermarket right and if we had a super market, we want to manage our shelf space right and we will be living in what is called the market basket model, which means that our goal will be to identify sense of items or groups of items that are a group that are bought together by sufficiently many customers. right so we can be whole foods.

In Palo Alto. we can be Amazon dotcom or anything else, where customers can select sets of items consumed by them and leave. You could be even like a music listening website Like, I know, Pandora or Spotify, where people collect items and then go right or items in that case, would be songs that get collected into the plate playlists. and our approach in this kind of market basket analysis cases that we want a process point of sales data collected by let’s say, barcode scanner, or whatever, to find the dependency between our items are products things that users bite, and our goal is to find association rules like rules, for example.

Here is one very famous, you know, if somebody buys diapers and milk, then here she is also likely to buy beer and using this type of knowledge, we can optimize shelf space. We can use this to make recommendations to people and so on and so forth, right and so don’t be surprised if you see diaper smell can be a kind of close together in the store. OK, So how are we going to think about this? So the idea is that we have a large set of items. These are, for example, products or things sold in a supermarket.

These are, let’s say, all the songs in spotting fire, And then we have a large set of baskets, and each basket contains a subset of items. so right when you go in whole foods in your cart and you put it on the in front of the cashier. That’s your that’s your basket. Those are the items in your basket or when you take the songs and put them together in a playlist playlist is your basket right? And then we want to identify association rules. Basically, we want to identify things that say people who bought X, Y and Z tends to buy also V and W right.

This is an association rule right from based on these three things. This tends to follow right. So, for example, up here, I have an example of five different baskets, each one containing a different set of items. And out of this, I could discover association rules that people who buy milk also tend to buy a Coke or people who buy diapers and milk tend to buy beer as well. Ok, so this is what we will want to compute. If this is our input, this is the output given the baskets, I want to figure out what tends to be brought together and given this set of items here, we predict what other items the user is going to buy. so in some sense, you can think of this association rules as predictors.

Given these two things, I predict that lots of people buy beer. Given that you bought Coke, I predict it will sorry, given that you bought milk, I predictable bike Coke as well. Okay, that’s the. That’s essentially what we want to do. The input the output. Now, of course, we’ll talk about. How do we? How do we do this on our data and how to do this in a scalable way? Right? So more generally, what we are doing is we have this general many to many mapping between two kinds of things right, and we are asking for connections between items that that belonged to baskets right. so now here are some examples of what items and baskets might be. For example.

They are abstract, so you can take many different domains and put them into this into this right. So, for example, you could say items and bank baskets can be products and shopping basket. That’s an easy one. For example, items could be words and baskets could be documents right all the words that belong to the same document. So item is a word basket is a collection of words. It’s a document you could think. For example, in in computational biology, bioinformatics, items could be base pairs and baskets could be genes right? I have a gene gene has a set of base pairs or I mean amino acids. If you think of proteins right so items are base pairs.

Baskets are genes right or, for example, you could say I have a patient. A patient is a basket water the items in the patient drugs that the patient is taking right. So now you can start saying, given that you are taking these these drugs, I predict you will. You will take this other drugs as well, Or you could think of again basket being the patient and diseases being the items that are in the patient right like, you know, the cold diseases in my ism is in my basket right now.

That’s why I’m coughing right? But you can start asking Donahoe, given that you have these two diseases. Here is the third disease that is going to follow. so this would be an association rule, so our algorithms that will discuss today can be applied to any abstract setting where you have items and baskets right. As I said, we’ll talk a lot about items being products and baskets being sets of products, and we’ll talk about our kind of running example. use case will be, you are a chain store. You have terabytes of shopping data. The question becomes, how do you discover this association that we’ll see in a sense, you know, People who bought X also bought Y.

This is what Amazon Amazon is famous for. When you were looking at product tax, Amazon tells you this is what other products peoples people buy. Those things are are computed using the algorithms I’ll talk to you about today, So it essentially to the simplest association rule that says, people who bought by X also tend to buy y right. We are but will be able to discover more complex association rules second, set of applications, for example, I was mentioning before that you can think of documents right? You can think of baskets as words or sentences and documents and documents as as basket.

But you can flip things around right? You can say items are the documents and sentences out of the basket right? So now you say I’m taking this sentence and I’m looking in what other documents does it appear right and what is what is interesting because this because now, if you find an association rule, it will basically say, if you have a sentence in this document, then you will find it in these other documents as well across all kinds of sentences, and this will allow you to identify plagiarism right. It will allow you to identify sets of documents that share sentences among each other, and it will allow you to identify documents that are interdependent right so for plagiarism detection.

You could apply this way and, as I was saying, Before you can think of baskets of patients and you can think of items as drugs or side effects or diseases that people have, and now right, you can. This will allow you to detect combinations of drugs that you know lead to particular side effects. What is an interesting distinction here is, for example, that is not. That is interesting is that it also requires you to model the absence of an item right, not not having these disease or not taking this drug, and there are extensions that allow us to model the absence of an item, not just the presence of the item in the in the basket. OK, so that’s kind of the motivation. What are we trying to do today? So here is now the outline of of the rest of the lecture. What I’m going to do is first define.

What do I mean by frequent item sets? What do I mean by association rules? How do I measure the quality of the association rule in terms of support, confidence and interesting nests, and after we are done with that, we’ll talk about the algorithms, how to identify and actually compute these rules and item sense. OK, so that’s the plan. first, style kind of define the concepts, and then we’ll do the algorithms and we’ll talk about. How do we find frequent pairs? And that is already interesting. Then we’ll talk about dynamic programming based a priori algorithm, and then other kind of more clever caching based techniques. But that’s the that’s the plan for today. Um, if you have questions, just raise your hand and stop me and I’ll answer. OK, So don’t.

Don’t wait too long, right first thing I want to define is the notion of item set and a notion of frequent item sets right. So the simplest question is finally set of items that appear together frequently. OK, so item set means a set of items. and when I say frequent item sets, I just means finally sets of items that co appear in baskets frequently. Right then. Why do I am infrequently I defy frequency based on what is called support right.

So I say, if I have a set of items I then the support of that item Set is the number of baskets in which these items appear. OK so and often be expressed as a fraction of total number of baskets. So if I go back to my five transaction life. Five customer data set where I had five basket, and I say what is the support of the item. set beer, comma, bread, I go through this and I say in how many baskets were bread and beer bought together. Here is one basket.

Let’s find a ha. Here’s the second basket, and that’s it right? So the support of this is true because it’s too because there are two baskets in which bread and beer were brought were bought together. OK, so now, what is our goal? Our goal with that? Given some user defined user specified support threshold s, I want to find all sorts of items that appear in, at least as baskets and those those item sets are called frequent item sets right. If you’re if you’re a frequency, your support is above this user defined threshold death, then you are called a frequent item sets okay. So right now I have one parameter s that is specified by the user by the user of the algorithm says, Find me all the item sets that appear in more than ten baskets. Then all those item sets are called frequent item sets. Alright, good. So now we know what an item set it is. we know what the supporters and what does it mean to be frequent item sets? So those are the three important things we just learned. I’ll give you an example. Imagine I have five items milk, Coke, Pepsi, beer and juice not very exciting items.

The user decide if it gives me the support threshold of three baskets. Here are my eight baskets right? So this would be coke, beer and juice. That’s milk and beef for beer. OK, so now I have this. I could say what water frequent item sets that appear in this in this in this dataset. So, for example, a trivial item set of M is frequent. Why is that? Because M appears in more than three baskets, it appears one two, three, four five. So support of M by itself is five. That’s more than three so its frequent. Now I can ask, how about item set M, B is that frequent as well? and it is right because it appears once MV together here and be one B to be five and it’ll be six.

So it means it appears in four baskets, so the support of envy is for some milk and berries. for that’s about my threshold of three. So envy is a frequent item set OK, B? C also appears in four and C J appears in in three, which is above at least the support threshold. So that’s frequent as well. Now, of course you could go and try a triplet like how about, Uh, M P B Is that a frequent item sets you know it appears here, and it doesn’t appear anywhere else. I think right, so that’s not frequent. It doesn’t appear in three baskets, so we are done right, but here are some examples of frequent item sets for this case. So now that I defined what are frequent item sets is now let’s switch gears and talk about association rules.

What is an association Rule association rule is, if then, rule of the forum like this. Which basically means if a basket contains all of the items I want to I K, then it is likely to contain item J as well. Ok, that’s the idea. So if I have all the eye, all the items on the left in the basket, then the basket ovenden the item on the right is likely to appear in the basket as well. Right. And in practice, there are many many rules. appear in the data, and we wanna find kind of significant and interesting ones. so we need to define how how significant or interesting is the rule. So one way to do this is to define a notion of confidence of an association rule, which is basically the probability of J. Given I write is just saying how often that I had. I want to I K in the basket.

Does Jay also appear in the basket and the way you compute the confidence of the rule from set of items? I follows an item J is simply the support of our union, Jay, divided by the support of fire. Right. I’m saying out of all the cases in which items I were in the basket. What fraction of those where items I plus the item Jay in the basket And that ratio is essentially the probability that this is true. So it’s a confidence of the association rule. and of course, the higher the confidence the the more often from this follows that OK. Yes, happy, happy, great question. Yes, why is that not divided by the support of J? It makes no sense to divide by the support of J, because the question is out of how often does the left hand side occurred? How often does the right hand side occur? So it’s from left to right not the other way around right? You are given I want to I K and then out of that fraction of times.

How often does also J appear next to it, but it’s I’m saying this is what I’m given. And then how often does Jay appear next to it? So that’s the right way to do. Yes. How do you choose a support threshold? You choose it once you see the data. If it’s too easy too late, you’ll get too many rules. If it chooses to high, you’ll get too few rules. So the idea is that you try out a few support thresholds and find us as high support threshold as possible. Where you’re still seeing interesting groups there is it’s a user specified parameter really depends on the use case and use case. Yes. Oh, yeah, I offered such a great question. So what happens if J appears? If J is always bought right? Imagine everyone’s by J, then then whatever is on the lefthand side, J always follows.

This is, this is a great question, and you know, here is my answer to Camilla. Okay, So right? Not all high confidence rules are interesting, right, like, as Camilla said, if milk is always bought than anything on the left will imply that milk is going to be bought so that that rule will have high confidence. But it’s a boring group. So how do you get around this? You define this notion of an interestingness of an association rule where you say this is the absolute difference between between the confidence and the fraction of baskets that contain.

Jane writes on. I’m saying this is the confidence we cared about before, and this is the prior probability of that, and if I see significant difference between the two, then that is an interesting rule. But if the two are about the same, that basically means everything is driven by the prior probability of J. Okay, great question, thank you so much. It motivated this slight quite well. OK, great, so this is what I wanted to show, and I’ll give you one example. Here is again our aid baskets from before. If I want to have an association rule from milk and beer to follow scope, this association rule has support of two. Why? Because M, B and C appear in two item sets, It appears in be six, and it appears in one more. be one. OK, so it appears here and here these three items together.

so that’s why its support of two. What’s the confidence confidence is two divided by four. Where does the food come from? and B should appear in four in four baskets. Let’s find the right, so maybe it’s one two three. Ah, four OK, so and be by itself appears in four basket and DC appears in two baskets. So the confidence of this tool is one half then we can also compute the interest this is the confidence minus the the prior probability of Coke. Coke’s should be appearing in five out of five baskets, one two, three, four, five. Okay, so that’s the interesting this. So it means it is relatively accurate association rule, but it’s not very interesting right, so we probably wouldn’t want to look at it. Yes. The trust is absolute value, so can we think of a low confidence and hyper ability giving high interest, because we’d assume that what if one hundred to properly on one of them, it’s a great point, I defined confidence as as the as the absolute value you could even think of.

It has non absolute value, and then it becomes interesting whether the difference is highly positive were highly negative. Right, and that’s interesting as well, because if the difference is highly negative, then it’s almost like a negative association with it says If this than not that right, So both are interesting. Okay, great question. So now I kind of gave you all the all. the all the all. the terms are explained. All the all the things we need to know here is now what we would like to do. so here is now the definition of the problem. The problem is that we’d like to solve in the lecture today is find all association rules with support greater than S and confidence greater than C right, so S and C are given to us by the user by the data analyst. and our goal is to design an algorithm that will click quickly.

Find all association rules that obey these two criteria right? And, as I said, support of an association rule is this is this is the support of the sack of items in the rules of its left side, plus the right side and confidence we identified now? What is the hard part when we wanna solve this? The hard part is finding the frequent item sets and you are saying, Why do you care about finding frequent item sets? The reason why I care is because once I have the item sets, I take support of one item set and divided by the support of the other to find the confidence of association rule.

So I care about finding support of item sets so that I can compute confidences right. so if we have item sets that in their supports, I can quickly compute confidences. Another thing is right we say I want association rules with support greater than s. This means that this support has to be greater than capital s, and this ratio has to be greater than then. Didn’t see. OK. So this has to be greater than S and the ratio has to be greater than C. That’s why we care about frequent item sets, Because this count, by definition, is greater than the threshold s that is given to us by the user right. So the point is the following right if association rule has high support, this means that both the left hand side and the entire thing the entire rule will be frequent. right, If you are above support threshold, then it means the left hand side as well as the right hand side will be above the support threshold right. So now.

How are we going to do this? There are really two steps to the algorithm. First step is to find all frequent item sets. OK, and we really spend most of our lecture around. Step one. Because once we have all the frequent item sets, generating association rules is quite easy. The way we would generate an association rule is the following we say we have a frequent item sets I, we will take some set of elements of it a and we will put those on the left hand side and the rest will put on the right hand side right. So basically, the idea is, if I have a frequent item, sets eye. Then I’ll take a set of items A, and I’ll put it on the left and whatever I haven’t taken, I’ll put on the right. OK, that’s. I just generated an association rule right since I is frequent.

This means that the association rule itself is also frequent. Writes is basically because this thing up here is the support of the association rule. and that’s exactly a. Oh sorry, that’s exactly right. So what do I do in India? I could do is I could do the following I could take every frequent item sets I. I could generate an association rule out of it. I already know what is the support of entire eye, and now all I need to divide is divided by the support of A and I get the confidence right, so you know, imagine I, I’m looking at the eyes item set A B C D, I can generate an association rule out of it like this, and then it’s confidences. Support of obesity divided by the support of the left hand side, which is a B. I divide this and I get the value. One thing that you should notice and will be kind of popping up in this lecture over and over again is is. The following thing is that if I give an association rule? Is below confidence, for example, if A B C follows D is below confidence than a B.

Follow CD is also below confidence. and you will say, How can you guarantee that and the way you can figure this out is if you look at our formula for confidence right on the on the on the support side. here, our union, you, I union, Jay both this rule and that rule are composed of a B, C and D, so they’ll have the same the same count count in the numerator. But what in the denominator right in the denominator, we divide by the by the by the support of the left hand side here have the support of the left hand side. Is this much, and here the support of the left hand side? Is that right. So now, If this has some count that is below the confidence than the count of the left hand side.

here will only be big, but will only be smaller than the count of the left hand side here. right because how many does a B? C appear? The A B appears in everywhere where A B C appears plus somewhere else where C is not present right, so we have this notion of monotony city where bigger item sets have lower frequency than subsets of them right so you can use this procedure this strategy to generate association routes after you have found frequent item sets. So let me let me give you an example again. Here are my aid baskets we had before I will have support threshold of three, and I will have a confidence of point seven and five. So as a first step, I will find frequent item sets here. They are. These are all the item sets that appear utterly at least three times or more, and now I will generate the rules out of this one I can generate from B follow them, and from M follows me from this one, I can generate the same way BBC CB And so on.

And then from this one, here are the rules I can generate, and now for every rule I can simply compute the confidence right. How do I compute? The confidence is the support of P M divided by Support of B is a support of BM, divided by support of M, and so on and so forth. And now that I have done all this, I cross out the rules that are below my support threshold of point, seven, five and whatever I didn’t cross out. That is what survives and I show to the user. OK, so this this is this is in some sense, easy. Once I have the frequent item sets, I exhaustively generate the rules. I compute their confidences,

Leave a Comment