Description
Sets are everywhere! If you've worked with relational databases, made a venn diagram, maybe touched some relational algebra, then you've already worked with sets. We talk about why they're so common, how well they perform (time for some Big O Notation!), and how they're actually implemented.
Show Notes
This episode of the Basecs podcast is based on Vaidehi's blog post, Set Theory: the Method To Database Madness from her basecs blog series.
Transcript
[00:00:00] SY: Okay. I’ve got a very exciting announcement. We finally have a date for Codeland, our annual conference. It’s happening July 22nd in New York City. Early bird tickets are now on sale. For just 99 bucks, you get breakfast, lunch, dinner, a coding workshop, technical talks and after-party, and of course an amazing community. Sale ends March 10th. Go to codelandconf.com for more info. Link is in your show notes.
[00:00:29] (Music) Welcome to the Base.cs Podcast where we explore the basics of computer science concepts. I’m your host Saron, Founder of CodeNewbie.
[00:00:37] VJ: And I’m Vaidehi Joshi, Author and Developer.
[00:00:39] SY: And she is the brilliant mind behind the Base.cs Blog Series. Today we’re continuing our conversation on…
[00:00:45] VJ: Set Theory.
[00:00:47] SY: This season of the Base.cs Podcast is brought to you by our wonderful friends at DigitalOcean. DigitalOcean provides the easiest cloud platform to deploy, manage, and scale applications of any size. They remove infrastructure friction and provide predictability so you can spend more time building what you love. Try DigitalOcean for free by going to DO.co/codenewbie and get $100 in infrastructure credit. Link is in your show notes.
[00:01:15] This season is also brought to you by the fine folks at Linode. If you’ve got a personal project, a small business or a big business with lots of data, Linode offers you secure hosting for all your infrastructure needs. They are a Linux Cloud Hosting provider where you can get a new server up and running in under a minute. Plans start at one gigabytes of RAM for just five bucks a month. And with the promo code CodeNewbie2019, you can get a $20 credit. So go to linode.com/codenewbie and give it a try. Also, they’re hiring. Check out their jobs at linode.com/careers. Link is in your show notes.
[00:01:53] SY: (Music) Okay, let’s get started. So before we get into some new stuff, we have a little bit of a correction from last episode from Season 4 Episode 4 where we started by talking about a Venn diagram and this was a delicious Venn diagram made of our favorite foods. We had steak and nachos and bacon for me and you have… what was in your Venn diagram again?
[00:02:19] VJ: Steak and nachos and eggplant.
[00:02:21] SY: Oh, right, eggplant.
[00:02:22] VJ: Individually, separately.
[00:02:25] SY: Eggplant was the one. Eggplant was the one. And we learned that Vaidehi is not a big fan of bacon. Okay. And so when we were describing our Venn diagram, we have the two things, the steak and the nachos that we have in common, and then the bacon is for me, which means I could eat all the bacon, and the eggplant is for you. Now when we were talking about the different ways to describe the… would you call it subset of the Venn diagram?
[00:02:50] VJ: Yeah, I mean I think that’s accurate because most of the time we’re describing parts of the Venn diagram. Last episode was us basically like naming parts of the Venn diagram.
[00:02:58] SY: Yes, and one of those parts was union and we kind of didn’t really do a good job of clarifying what that is. So what did we say and what is a better way of saying it?
[00:03:09] VJ: So I had said that the union, in the example of the Venn diagram I said that it was the things that you liked and the things that I liked but not the combination of the two and that’s actually incorrect. And the reason I got confused, which here’s a great example that the stuff can be confusing so it’s okay to be confused, I obviously get confused too. The reason I got confused is because the definition of a union is anything that exists in Set X or within Set Y, but the part that I left out is that that includes things that exist in both, so that includes the intersection.
[00:03:42] SY: Right. Right. Yeah.
[00:03:44] VJ: So I remember you had asked about the union and like why it’s called that and where it got that name but really if you think about the word union, that’s the best way to really remember what a union is.
[00:03:54] SY: Yeah. Everything comes together. Yeah.
[00:03:56] VJ: Exactly. So the union is really steak and nachos that we share and then the bacon that you like and the eggplant that I like.
[00:04:04] SY: All living peacefully under one Venn diagram. Okay. I much prefer this definition, by the way. I like this clarification. It suits my assumptions and the way I want to look at the world. So thank you for that. That’s very helpful.
[00:04:17] VJ: No problem.
[00:04:17] SY: We also said like a little bit of a minor thing about how it relates to an intersection.
[00:04:24] VJ: Right. I had said that the union is the opposite of an intersection. Now that we have clarified what a union really is, that’s not actually accurate because if you were thinking about the opposite of an intersection, you would be talking about everything but the things that are in the intersection. So that’s actually a disjunctive union, you’ll remember from last episode, but a union is anything that exists on your side of the Venn diagram, my side of the Venn diagram, and in the middle. So I wouldn’t really think about it as the opposite, I’d think of it as the combination of both of ours and including the things we share.
[00:04:59] SY: Okay. Cool. Thank you for that clarification. So now let’s do an official recap of what sets actually are. We defined them last episode as a set is nothing more than an unordered collection of elements with absolutely no duplicates and we said they had to be unordered, that they were elements, and that there were no duplicates. So those three factors, I guess, are important in our set definition. Anything else we need to recap or touch on before we move on to new stuff?
[00:05:28] VJ: No, I think those are the important bits and I guess the other thing to note is that the fact that things are unique and the fact that they are unordered is part of the reason that we can operate on sets the way that we can and it’s going to come back into play later on in this episode.
[00:05:48] SY: Cool. Okay. So when we finished up last episode, we talked about this Venn diagram and how a Venn diagram is actually a really great, easy to understand, very familiar way of talking about sets and all the things that we can do with sets and we talked a lot about some theory stuff but not a lot about application. So why do I need to know about sets and Venn diagrams?
[00:06:13] VJ: Well, the reason you want to know is because they are used in things that are all around us and they actually show up in programming languages. So even if you didn’t know where sets were used anywhere in computer science, there’s a good chance that you could be using a programming language. For example, you could be using JavaScript or Python or Ruby or even Java and you could find a data structure called a set and then you would be like, “All right, what is this?” And there you go. There’s a practical application of a set. It’s an actual data structure that you can use in some programming languages. So that’s one reason. The other reason is that sets are actually used to store data or maybe a better way of saying it is the way that we store data and the way that we interact with data as developers actually often tie back to sets and set theory.
[00:07:06] SY: Okay. So for that point, are we talking about set theory or are we talking about the set? Like is the data structure part the set? What are we defining as the data structure in this conversation?
[00:07:20] VJ: We’re talking about the set. So the structure that says that things must be unordered, the elements must be unique, they can’t be repeated, the concept of that thing that data structure is what we’re talking about when we are talking about sets on an application level.
[00:07:34] SY: Why would I use a set over something else like an Array or a Linked List? What are some of the benefits of using sets?
[00:07:42] VJ: So I think the interesting thing about sets is not so much when you would use them, but what’s unique about them. And the reason that I say this is because they are actually used under the hood in other places. And the reason that sets are unique, the reason I kind of want to highlight them is because of their time complexity and the way that their time complexity allows you to operate on them. We talked a bit about operations and like we were naming parts of the set, but now we’re going to kind of explore what it means to do that in practice and how much time that’s actually going to take.
[00:08:18] SY: Okay. And when something is time efficient, remind me again, what does that mean?
[00:08:22] VJ: Well, when we talk about efficiency when it comes to space or time, also known as Big O Notation, what we’re talking about is how much either memory or time it’s going to take to perform a function or an algorithm or an operation in the worst-case scenario. So you kind of want to think about Big O Notation as like, you know, the Doomsday situation of CS. So when we talk about it, even in this context, we’re going to think about time in the context of the worst situation we could deal with when it comes to sets.
[00:08:54] SY: Okay. So when we talk about sets being good for Big O Notation and specifically they can be time efficient, how time efficient are we talking?
[00:09:05] VJ: Let’s start with maybe some of the more obvious time complexities when it comes to sets and we’ve been talking about operations like intersections and unions and last episode we talked about differences and complements.
[00:09:18] SY: Right.
[00:09:19] VJ: In order to do any of those operations, the time complexity for performing those is going to depend on what on earth you have in those sets, right? So you and I.
[00:09:28] SY: Right. Yeah. Yeah.
[00:09:29] VJ: Yeah. We had just come up with a small example so our set just had like four elements in it. But if you had a much larger set, then obviously trying to find the intersection or trying to find, you know, the difference of the things that you like versus the things that I don’t like, that will take more time if your set is larger. So the time complexity of those types of set operations is going to be O of the length of one set plus the length of the other set. The way that we can kind of talk about this as O of length of Set X plus length of Set Y.
[00:10:03] SY: Why? Why would that be the complexity?
[00:10:07] VJ: Well, for example, say you want to find the union of Set X and Y, and for these purposes, we don’t even know how big X and Y are. We could just say like they’re each 10 elements long or however many elements.
[00:10:18] SY: Okay.
[00:10:19] VJ: In order to find the union we’re going to actually have to traverse through the entire length of Set X and Set Y. But even if you think about something simple like the intersection, you’ll have to traverse through all of X and Y to figure out what exists in both. Even though all the elements will be unique, remember that we’re comparing two sets. So you’re still going to have to search through one whole set and traverse through the whole other set and then kind of bring them together and compare what exists in both and that’s how you find the intersection. The issue with set operations is that if you’re trying to find some sort of combination of elements between two sets you’re inevitably going to have to look at all the things in the set depending on the type of operation probably. And remember worst-case scenario, it could be even that you’re looking for one item, it could be at the end of the second set you look at.
[00:11:08] SY: Right. Right. Okay. So that’s the worst-case scenario. So for that example, when we talk about comparing any two sets of things, it sounds like no matter what type of data structure we’re doing if we have to compare two… I won’t call them a set, I’ll call them two groups of things that you’d have to go through each one, right? That’s not something unique to sets?
[00:11:34] VJ: That’s true. That’s not that unique, but it’s worth mentioning but it’s almost like when you think about it, obvious, but the thing that’s unique to sets is even though you need to consider the length of two sets when you’re doing these other operations, a lot of the times what you might just want to do is add an element or remove an element or here’s a great example. You just want to find an element within a set. You’re not really concerned about whether that element exists in another set, you’re not trying to find the union, you’re just trying to find, you know, like steak within your set or you’re trying to add like cheese to my set, something pretty simple, which if you think about you probably will do that almost more often.
[00:12:16] SY: That’s true.
[00:12:17] VJ: You’re probably adding things to your set or removing them or finding them or determining the length of a set and all of those basic operations you can do in constant time when it comes to a set.
[00:12:27] SY: Oh, oh, that’s much better. That’s like the best one, right?
[00:12:30] VJ: Yeah. That means no matter how big your set gets, those operations are going to be consistent in how much time they take which is really, really nice.
[00:12:39] SY: Okay. So walk me through that. So for example if I’m adding cheese, you said, right, like if we want to add cheese to my set, okay, I guess it makes sense, that’s constant time because no matter how big my thing is, the amount of time it takes to add is just the amount of time it takes to add something.
[00:12:56] VJ: Yeah. And if you imagine a set as just kind of an object of unique elements, which is basically what it is, then you’re just appending something to the end of that list of things because remember there’s no order and you don’t need to go and put it somewhere specific. You can kind of just stick it to the end. It’s actually the case that even if you remove something or find something within a set it also takes constant time. Now the one thing I want to say is that you might be wondering how on earth that’s possible because…
[00:13:28] SY: Yeah.
[00:13:28] VJ: In Arrays or Linked List you always have to traverse through the whole structure, right?
[00:13:34] SY: Uh-hmm.
[00:13:35] VJ: But with a set, that’s not the case and that’s another reason why I think they’re unique is it’s easy to do something like finding an element or removing it. You don’t have to search through it.
[00:13:45] SY: Okay. So that makes sense to me when I compare it to Arrays because Arrays have an order. So in order to add it, you have to like put it in its right place, which means you have to traverse it and like find the right place, but we said that one of the things that makes a set a set is the fact that it’s not ordered, so it doesn’t really matter where you put it. But one of the other things that makes a set a set is the fact that there are no duplicates. So what if we have our Venn diagram and I want to add some cheese, but you already have some cheese. How would it like know that that’s a problem without it checking and comparing to see if cheese is there? How does it resolve that but still have constant time?
[00:14:32] VJ: Good question. And actually the answer to this question is the same answer to the question how on earth can you find an element in a set without traversing because I mean…
[00:14:41] SY: Yeah. I was going to ask that one next.
[00:14:43] VJ: Because I just kind of like threw that at you and then you’re just like, “Okay!” And I was like, “She’s accepting this very easily.”
[00:14:53] SY: I got my line of, you know, questions and misunderstandings that we still will walk through. That was coming up next.
[00:15:00] VJ: Well, I’m glad you brought it up because this is my favorite part. Your question actually ties together what we’re talking about today with set theory and something we talked about in the past, which is hash tables. Remember those?
[00:15:14] SY: Yeah. I do. I do.
[00:15:15] VJ: Where’s Bryan when you need him?
[00:15:17] SY: Where is Bristow? You know? Those guys. Those guys are good.
[00:15:22] VJ: So silly. Well, you’ll remember that a hash table -- looking up something or finding something or adding or removing something from a hash table took constant time.
[00:15:32] SY: Yes.
[00:15:32] VJ: And I just told you that with a set you can add or remove or find or determine the length of the set using constant time. So there are some similarities here. It’s like, “Oh, hash tables can do this too and so can sets.” And now you’ve asked me, “How is it possible?” Or rather, “How do you handle duplicate values when a set can only have unique keys?” The answer to both of these questions is sets use hash tables under the hood.
[00:16:03] SY: What? Really?
[00:16:04] VJ: I know. (Laughing) Isn’t it so great?
[00:16:07] SY: No! You don’t understand what’s happened in my mind. In my mind what happened was I pictured a Venn diagram and then someone came and like ripped the circles apart and underneath was like a little hash table just calculating and putting things in its place and it’s like, “You found me!” That’s what I just envisioned in my head.
[00:16:27] VJ: It’s like we found the back closet or like the secret door and like…
[00:16:31] SY: Yeah.
[00:16:31] VJ: Working in the dark quietly was the little hash table and you’re like, “Hash table, you’ve been here the whole time!”
[00:16:36] SY: “I thought you were a Venn diagram. You’ve been here the whole time.” Oh, wow! Okay. So is -- are sets just implementations of hash tables?
[00:16:48] VJ: Well, I would think about it more as you can use hash tables to implement a set.
[00:16:53] SY: Ah, okay.
[00:16:53] VJ: So you can use like the idea of a hash table and the hashing function to give you that, you know, quick O of one access time.
[00:17:01] SY: Yeah.
[00:17:01] VJ: And also the fact that…
[00:17:02] SY: Oh, wow!
[00:17:03] VJ: Sets have only unique values, unique keys. Well, that maps really nicely to the keys…
[00:17:08] SY: It does.
[00:17:08] VJ: In a hash table, right? Because the keys…
[00:17:10] SY: Yeah.
[00:17:10] VJ: …for each element are always unique and because of the hash function, you can quickly retrieve them, you can quickly determine where they go, or where to remove them from, and because order doesn’t matter, you’re like, “All right, well, I don’t need to care about indexes. I just need to know where to put this so I can find it later.” So again, the hash table is a great use case for this.
[00:17:28] SY: Okay. Wait. Wait. Wait. Wait. Wait.
[00:17:32] VJ: Okay. I’m waiting.
[00:17:33] SY: If that is the case… (Laughing) I just hit my nose against the microphone if you heard that. If that is the case, then what’s the difference between a set and a hash table?
[00:17:44] VJ: Well, a set is kind of like… it’s a data structure that kind of builds upon another data structure so most of us aren’t really using hash tables on a day-to-day basis. You probably are going to deal with an object, right? If you’re dealing with a hash table or a hashing function, you probably are doing something specific and you know how to work with the nitty-gritty of, “Oh, this is how I read a hashing function. This is how I determine the size of my table.” And you’ll remember from the episodes on hash tables, I talked a lot about how, “Oh, it really depends on how you build your hash function and how you determine your hash table and what size it’s going to be.” And that’s a lot of things to think about. And I mean most of us aren’t really going to think about that because we’re not really dealing with hash tables every day. A set, most of us also don’t deal with directly every day in the sense of I’m not writing code that is like set.new most of the time. It doesn’t really happen. I don’t know about you, but it doesn’t really happen to me.
[00:18:40] SY: Naw.
[00:18:40] VJ: Maybe if I work in Java it would happen more, but I don’t. But I guess what I’m trying to get at here is that you’re working with a data structure that has abstracted away the mechanics of the data structure that is used within it. So it’s kind of like… now I don’t want to say data structure within a data structure, but a data structure that leans on…
[00:19:01] SY: It’s like building on top of. Yeah.
[00:19:03] VJ: Yeah.
[00:19:04] SY: It sounds like the foundation of it is a hash table. So what is… what are sets doing? It sounds like they’re almost like expanding on the idea of a hash table? Is that a fair way to describe it?
[00:19:15] VJ: Sort of. They’re kind of leaning on what the hash table provides. But then what’s interesting about sets is they have this whole world of operations and that’s when, you know, you might want to do something specific with two sets of data and a hash table doesn’t allow you to do that. A hash table is just like for retrieving and adding and keeping track of things quickly. A set allows you to do a whole lot of things because it’s abstracted away a lot of the how of how things are happening. It’s allowing you to manipulate and retrieve data in a certain way and find data that mixes and matches in a certain way. It’s leaning on a hash table, but it really opens up a whole other set of possibilities, pun intended.
[00:19:55] SY: Yeah. (Laughing) Wait. So okay. So that makes sense, I think. So the idea of sets is that it’s implemented using a hash table. Am I saying that correctly?
[00:20:06] VJ: Yeah. You can… Yes. You can use a hash table.
[00:20:10] SY: So it’s the implementation of a hash table?
[00:20:13] VJ: Yeah. You can use a hash table to implement a set or you can say that sets are implemented by hash tables under the hood or whatever.
[00:20:20] SY: Right. Okay. But because it’s a set, it means as a programmer, I can do more stuff, I have more operations at my disposal, and I can just do a bunch of things not having to worry about literally how do I code for them because it’s taken care of by the hash table behind the curtain. Was that fair?
[00:20:39] VJ: I think that’s a good way of describing it.
[00:20:41] SY: Oh, I can live with that. That’s cool. So where does this show up in coding? What are some things that maybe I hopefully would have come across before that’s actually just sets?
[00:20:52] VJ: Well, aside from the actual set data structures I mentioned earlier that appear in some languages, probably the place than it most rears its head, so to speak, is within relational databases.
[00:21:05] SY: Oh, neat, I know those.
[00:21:07] VJ: Uh-hmm. And the interesting thing about relational databases is that they involve sets. And when you think about it in the context of, you know, those Venn diagrams, you can kind of think, “Oh, if these Venn diagrams are actually, you know, the databases, then you can start to see where a language like for example SQL kind of pulls from those set operations we talked about last episode.”
[00:21:32] SY: Yeah.
[00:21:33] VJ: So for example SQL is based on set theory and relational algebra itself is based on set theory itself.
[00:21:42] SY: That’s why it felt so familiar. Actually I’m remembering a blog post that I really liked that explained different SQL operations and I’m pretty sure it used a Venn diagram.
[00:21:52] SY: Uh-hmm.
[00:21:53] VJ: Like I’m pretty sure I’m imagining right now is like black and white and had these circles. I don’t remember what was in it, but I remember different parts would be shaded and I’ll be like, “This is an inner join, this is an outer join.” And it made sense. And yeah, I like that. Okay, so Venn diagrams are good way to visualize how relational databases work. Interesting.
[00:22:13] VJ: Yeah. And it’s pretty cool when you think about like an inner join for example. I think I’ve seen the same… the same blog post with the Venn diagrams and if you kind of look at the blog post, then… or look, if you look at any image that’s kind of visualizing SQL joins, you’ll see that like it’s usually like two Venn diagrams and the inner circle was like filled in and it makes sense because you’re joining two sets.
[00:22:35] SY: Yup.
[00:22:35] VJ: It’s the interaction of two sets, right?
[00:22:37] SY: Yeah.
[00:22:38] VJ: And in the same way, if you’re looking at something like a left join, which is pretty common with… I don’t know. I feel like I’m always writing left joins or having to deal with them and they annoy me because I don’t… I never know how to write them correctly or I… I… I know that I need to write them and then I can’t figure out how to do it because I have like four different tables I’m trying to join on. It’s terrible. (Laughing) But at the end of the day, a left join for example is really just the set difference or the relative complement of two tables and you’re just… it’s usually the left table that is the… the section that you’re looking for and that’s I guess why it’s called the left join, but it’s really just those operations that we talked about on two sets.
[00:23:18] SY: Yeah. Oh, wow! Okay! Well, I’m really excited that when we first, first talked about set theory, I told you, “Oh my God! It’s going to be so scary. It’s going to be like the hard math, not the fun math, like the hard, scary math.” And we’ve kind of ended with, “It’s basically relational databases, which you’ve been doing for years. Everything is cool.” So I appreciate that. Thank you.
[00:23:40] VJ: Yeah. And you know? We worked in some steak and nachos and cheese…
[00:23:44] SY: We did.
[00:23:44] VJ: …along the way.
[00:23:45] SY: That’s right.
[00:23:45] VJ: So I think that makes for a pretty good journey in general. We had, you know, what it was, we had snacks for the road trip.
[00:23:50] SY: We got one vegetable in. So that’s good too. And actually, actually, actually, recently I had eggplant that I enjoyed. I had it. I had it and I liked it.
[00:24:00] VJ: Oh my gosh!
[00:24:01] SY: I liked it. So…
[00:24:02] VJ: It all comes full circle.
[00:24:04] SY: You know?
[00:24:04] VJ: This is so good.
[00:24:05] SY: But I have to do redo our Venn diagrams, but I just want to let you know that, you know, I’m opening my eyes, opening up my set, you know? (Laughing)
[00:24:13] VJ: That’s great. (Laughing) I’m so excited for you.
[00:24:17] SY: (Music) Me too. And that’s the end of today’s show. If you liked what you heard, leave us a review and make sure to check out Vaidehi’s blog post. Link to that is in your show notes. Also want to give a huge thank you to Linode for sponsoring the show. Try their powerful servers built for all your infrastructure needs. Getting started with a shiny new server takes less than one minute so you can get up and running in no time. Get a $20 credit by using the promo code CodeNewbie2019. Just go to linode.com/codenewbie for more info. Also want to give a huge shout-out to DigitalOcean. They are the easiest way to deploy, manage, and scale your application. Everything about it was built with simplicity at the forefront: setting, deploying, even billing. Try DigitalOcean for free by going to DO.co/codenewbie and get $100 in infrastructure credit. Link is in your show notes. Vaidehi, you want to say goodbye?
[00:25:12] VJ: Bye everyone!
[00:25:13] SY: Thanks for listening. See you next week.
BLOOPERS
[00:25:19] VJ: I’m going to start that whole thing over.
[00:25:22] SY: Okay.
[00:25:22] VJ: I don’t know where I was going. I just started talking about words. I mean, I started using words that I was… I don’t know. Let’s start over.
Thank you for supporting the show!