EdX is an online education platform used by more than 20 million learners and institutions including MIT and Harvard. We discuss migration from Python 2 to 3, where to put business logic, and the limits of Django defaults on a website at massive scale.
Will Vincent 0:06
Hello, and welcome to another episode of Django chat a weekly podcast on the Django web framework. I'm Will Vincent joined by Carlton Gibson. Hi, Carlton. Hello Will. And this week we have two guests from edX, David Ormsbee and Nimisha Asthagiri. Welcome both of you. Thank you.
Nimisha Asthagiri 0:21
Thank you. Happy to be here.
Will Vincent 0:23
Yeah, we're thrilled to have you. So we connected because we met at the Django Boston meetup. And edX is both one of the largest educational sites in the world, and I believe has used Django from the beginning. So I wonder. We could talk about June either of you know, why Django originally and then how has that been? Because you've been through a lot of these changes of Django, Python two to three. And so hopefully we can dive into all of that. Jen. The choice
David Ormsbee 0:48
for Django, I think primarily came out of the choice for Python. Because this edX originally started as MIT x which, you know, it came MIT. And so Python was kind of the default choice, right? Like it's the intro CS course teaches Python, all the Europe interns that can work on that know Python. And the fact that the first course was an electrical engineering course means that you're having NumPy sci fi, like that sort of set of tools was was very convenient. And so like, once you and so by the way, these early decisions were made by Peter matros, who was who created the original prototype for what became edX platform, and was the chief scientist for edX for, I don't know, five or six years. So for a while,
Will Vincent 1:44
what was the timing because I believe this goes back quite a ways right before Python was standard at the undergrad level right?
David Ormsbee 1:50
would have been the first prototype work would have started and I want to say October or November of 2011. So sometime around Django, one Three, I guess. And so, yeah, and and so Peter also made the decision for Django. But I mean, once you've gotten it down to Python, then it's Django was like one of the, one of the obvious choices.
Carlton Gibson 2:17
And can I just ask because maybe our listeners don't know. Could you just give an overview of what the edX platform? Oh, yeah. Here's a nice, good call.
David Ormsbee 2:24
So I'm gonna,
Nimisha Asthagiri 2:26
Oh, sure. So yeah, edX is an online education platform. It was founded by Harvard and MIT, as Dave mentioned in 2011. We do specialize in higher education courses with courses and content from, you know, one of the best worldwide partners from universities.
Will Vincent 2:47
And what's the sort of scale that y'all are at right now? Because it's in terms of, I guess, number of courses and end users because it's in the millions right? of users anyways,
David Ormsbee 2:57
yes. 10s of millions of users thousands of courses. Like probably over a million lines of code if you include all the Python code if you include the different repos.
So yeah, it's at about that level.
Nimisha Asthagiri 3:11
And the other thing is that we are an open source platform as well. So one of the great things about edX is we have edx.org, which is our website, but then we provide our source code to the community. And the Open edX community has, you know, thousands of instances of our source code where they are, you know, providing courses for regional content as well as even national platforms.
Will Vincent 3:37
Wow. Yeah. It's, we'll link to those in the show notes. Also, one issue is since you started on on one, three is migrating from Python, two to three Is that something that has happened is going to happen? How do you think about doing that it's such a large site. You don't want to be everywhere. Everyone's dealing with this right now. By the way, so
David Ormsbee 4:00
Yeah, it's something that is still in progress. We are hopefully weeks away from, from getting it, it's there been a number of efforts that we've, we've done to try to bridge a one is the was a con. We have incurred tickets, which are basically a lot of because so much of the Python two to three conversion is you can, you can run the tooling for it. And you just kind of have to sanity check that, you know, it's doing the right thing and like 97% of the time it is but then you know that so, so that was a very conscious effort to be able to try to leverage our community better and try to give them like really small, small, incremental tasks that they that don't require you to know the full stack in order to help. And so that that effort, Jeremy Bowman, sort of really help push that through. And we did get a lot of contributions. But like, and right now, I think for Neil and company and some other folks are sort of driving the last bit like once, once this working enough set, the tests are running on both environments, like it's been this weeks of just, you know, find the Find the breaking thing, watch 300 more tests pass, find the next breaking thing. And then there's the and it's, it's interesting, because there's the like, there's the easy parts relatively easy, which is just Well, there's the strings, right? And then there's like code snippets on previous stuff and then there's the real stuff. Yeah, yeah, there's like Riley behavior is a little bit different. And, you know, if we were to change how grade rounding worked anywhere in the stack, there, there would be consequences. And
Nimisha Asthagiri 5:52
then even in terms of backward compatibility, if we, you know, chose to UTF eight encode the wrong way around. whatnot, it's possible that our hash values are not the same as before. And then we might have run into, you know, backward compatibility issues once we actually launched this in production. Yeah, and also, just to give an idea, like in our codebase, we do have a monolith with, you know, close to a million lines of code, depending on which you know, which code you count. But then we also have a few like satellite micro services that run, you know, around our monolith. And so each service needed to be upgraded. And, as Dave mentioned, like there was a, an effort to try to do some of this initially with the community, but you know, because we're running against the deadline, we've actually taken in a lot of this resourcing in house to try to, to make this happen. So and you know, there are definitely some some things that we could do in terms of like regex replacement, you know, in the strings are definitely you know, just converting because response content, you know, now is returning is being returned as bytes. You know, just there were a lot of code that just said a cert in, in our test code and searching for us, you know, within response content. So replacing that to be a cert contains or cert not contains doing that in one bulk, you know, regex replacement and your ID and whatnot, that definitely helps, you know, get us a long way. But there's a lot of these intricacies that we find, as we are going through, we're like, okay, what's the right way of making this change? And then Are there any principles that we can follow, for instance, doing the conversion from bytes? Or conversion from strings to bytes? Can that be closest to the perimeter, right? So, you know, throughout our code base as things are being passed, right, from one method to another, and whatnot, you know, pass them as strings as they were before because that's, you know, that's what the business logic is. But then only when you need to serialize it, let's say persisting in a CSV or sending it over the wire or things like that. That's when you actually then do the conversion. Because one of the concerns that I've always had with some With the Python three upgrade was, I felt there was so much there, I wanted to minimize the amount of code that has to do with strings, so that the business logic continues to string back at you. Right? You don't want you don't want all your code to be, you know, you want your code to be as readable as possible. You want the business logic to be the one that screaming back, you don't want it to be, you don't want to get too caught up on Okay, what's happening in the details with the spring conversions. So the more that we could try to keep that logic separated, the better. So that was, you know, that was one of the few principles that I wanted to see. See, make sure that happened during this conversion?
Will Vincent 8:40
Well, it does seem that's the hierarchy of any conversion from from two to three, where you start off with strings, and those are pretty easy. And then you get to the asserts, and that's relatively easy, but takes a lot of time. And yeah, and then it gets real with all these other issues well, and on business logic, so this is actually I wanted to highlight this we talked about this was last year. Two weeks ago at Django, Boston, because you had a really interesting take on how to where to put the business logic because this is a question that Carlton and I get. And for a lot of people, once you get to a decent sized Django project, it's hard to cram it into the models or the managers or the views.
Nimisha Asthagiri 9:16
Yes. And David and I have other architects HERE AT ATs, definitely, I've gone back and forth on on this. And we so there's, you know, step back a little bit like the reason why we're driving for this right is we're trying to scale development of our platform. And over time, it's definitely grown organically in terms of the code that's there and whatnot. But now in order to allow our code base to be approachable by new developers, as well as existing developers, and for us to continue to maintain it, how can we do that in a way that once again, it's easier to understand what the business logic is. And so we've actually had a book club here with you know, David participating and others here as well for for domain driven design and domain driven design, right really talks about being very, very focused on what is the domain that you are building for, and having those concepts and having the boundary, the bounded contexts and those boundaries being explicit. So being so using that as a as a mindset, we were thinking, Okay, well, given that we have a monolith monoliths don't necessarily always have to be split into separate services, can you still have great abstractions in place while you're in the monolith? And then if we choose to extract them as a secondary step for other reasons, that's that's definitely a thing that can still happen. However, even within the monolith, what are those bounded contexts? How do we, how do we make our code base such that it's understandable of what those interactions are? Otherwise? There's a lot of tight coupling and it's really hard to understand. So. So yeah, we can go on So do you want to add anything to that?
David Ormsbee 11:02
Well, I guess one of the other things that I would add is that x platform started as an application, right, like, like, it started as an LMS. And studio setup for a particular set of courses. edx.org. And it was made into, you know, an open source project, and it was intended to be that, but as a sort of, you know, the name is, the name is that its platform, right? It is people are supposed to be able to build extensions on top of it. And so which a lot of times you have a much stronger need for these kind of internal API's between the apps because, you know, you might want to take an explicit form, but I you want to add your own thing that accesses enrollment data, for instance, or like because you have your yours particular special feature that you're kind of adding on to the side of it. Yeah, so so and having a story have strong abstraction layer that you can like a sort of stable API interface within the application was, it would be very, very helpful for that. And we've tried to provide that various places, but it's, you know, it's, it's a hard problem that a lot of senses, like there are a lot of things that if you try to go into the sort of, if you try to move your logic outside of the outside of the models like that, you cut against the grain of Django and a bunch of a bunch of ways, right, like, you know, the, where you're going to do your validation if you want to have the same set of business logic that powers like, you know, a rest framework like Django rest framework and in your views there. But you also want to have not like repeat yourself for like internal API's, like, where does that logic go like your validations in the serializer here, but you know, you can call the serializer explicitly from your in process. API's and you know you one pass back query sets probably in case people want to modify them because because, you know, it's often like, Hey, get me, get me you know these enrollments plus this data that I've keyed off of enrollments, because I've added my own table here. Like, how do you do that in a performant? way that doesn't encourage everyone to do like n plus one query? So yeah, it's just, there's, there's these all these parts where you try to? Yeah, it's, it's, it's been, it's been challenging.
Nimisha Asthagiri 13:30
And so good.
Carlton Gibson 13:33
Go, I was gonna say, and then to scale up to, from a single application all the way up to like a million lines of code. There must have been dealing points and
Nimisha Asthagiri 13:41
Carlton Gibson 13:42
difficult areas. But then the goal, I guess, and I guess the question would be, do you think at a million lines of code that the the general structure is still comprehensible
Nimisha Asthagiri 13:51
right now, so you you would, this is why it's with at that scale of the code base, it's more important I believe, to be able to know Understand the historical trace says, Because unless if we do put a lot of resources to make everything consistent, you will see, you will see a heterogeneous heterogeneous code base, right? Some that haven't yet been updated to the latest direction, and some that are still a couple of years behind in mindset and so forth. So, you know, we do have this concept of architectural decision records. Definitely, you know, this was also inspired by Michael Nygaard, who from you know, had and thoughtworks had promoted this as well. But architecture decision records are very easy to very lightweight documentation. It's immutable documentation records on the decision that you've made. And that goes into the docs itself goes into the code base exactly in any decisions, a separate decisions directory, and you know, you have a context, the decisions and the consequences in each ADR. And so so for instance, what the directory structure should be of our monolith code. Bass or you know, that that could that could be an ADR. And then this way people know when exactly that was that was decided. And then if they do see code that's not yet up to date, they can realize, okay, that was written earlier. And if I'm now going to be working on that part of the code base, let me bring it up to the latest, you know, code standard. So, but yeah, so in terms of maintain maintenance of a large code base, we need to, you know, developed processes such as this to just sort of keep everyone up to date on because you're gonna see broken windows everywhere. So, which is the right window to follow in which ones are broken?
Carlton Gibson 15:34
So I really like that idea of the architectural decision record. It's like leaving breadcrumbs. Okay, here's the path.
Nimisha Asthagiri 15:41
Yep. And, and one thing that we also found was going back to the where the business logic should be kept with me when we're thinking about Django and initially, when So Dave had had Dave as a founding architect here. So Dave knows a lot about their initial horror stories and how that happened. I joined by about five and a half years ago. So By then some of the architectural decisions were already made, and there was a lot of organic, organically made changes. So it was actually a great partnership, I believe, because I'm able to, like have a little bit more fresher eyes. And you know, Dave gives a lot more horror stories with context. But when I when I approached a codebase, I saw a lot of divisions by technical concerns. And this is the thing about Django goes out of the box, right, when you're thinking about views and models and urls.py. And, and they are, in my mind, like, they're very much technical concerns of as an engineer, like how to think about how to construct something. But if you want to figure out what are the business concerns, what is it that this app is actually trying to do? You don't see that right away when you see these files, everyone's you know, just models and URLs and views or whatever so So I think, you know, having a way to make that come across much Better is what we're looking towards. And then one, one great example of where this applies is, right now when we're now moving towards a different technical technology stack based when we're thinking about moving our front end logic to be in react, micro front ends, right? So now whatever business logic happened to be within views.py, or even in a template, or you know, it shouldn't have been the template the first place, but anyway, some of our code was, so like some of those things were in models.py and views.py, and whatnot. So like, how do we, if we had better abstractions in place, it would have been easier for us to do this replac forming effort. And so I think from the get go, if we could start our work Django projects and apps in a way that it's clear where the business logic is, and let the web framework be a detail, and this is what like, I'm really talking, you know, echoing Uncle Bob here. But, you know, the web is a detail whether it's exposing via REST API is a deep detail whether it's being exposed via channels, and eventing models, that's also a detail, but the business logic can stay more stable and more core to the app. So that's how I've been thinking about it. And I think Dave's Dave's bringing in the perspective also about the scalability and performance impact, right. So like, if you do have these great nice abstractions nimisha Well, what are we going to do about those queries? And you know, the scalability of making sure that you know, when we we don't we're not always translating to API's, when you know, that we actually if we do need to go directly to the models we can in order to be performant.
David Ormsbee 18:43
I guess a couple of things I do want to add to that is just one, like nimisha has been at x like five and a half, six years. The last couple of years. She's been chief architect, so she's, she's got plenty of context. I do. I do have some of the like old stories. But another thing is that to be clear here, a lot of the early work on onyx platform what became edX platform was not very intentional in a lot of ways, right? Like, you take a group of people we've we most of us had done Python worked before. Some of us had it you can actually tell like parts of studio that are written in Java Java accent. But the, but you know, we'd never done Django, I could never done large Django and while Django Doc's are great for many things, and starting out like the there was not as much guidance for like, here's how you build this enormous application or the foundations of this enormous application, or projects, you know that it's going to grow to this size. And even if there were we probably wouldn't have had time to read it because we were in like frantic scramble. The tire buds are this hazy sleep deprived The thing that I just so, so So certainly, like things were put together very quickly. And I think it's, and now you know, we are trying to adjust some of those things but you can, but as nimisha said, like it's it represents different generations of code and philosophy and, and expertise and understanding, frankly, or the path of, of Django itself. And so like you can, I can look at a piece of code completely out of context and tell you approximately what year it was written. And because just because, like, the idioms that like shift over time,
Will Vincent 20:36
right, well, there's also been, I mean, the mutual you were saying about react and the logic from from views to the front end. I mean, there's, there's also been that shift in the last four or five years where I mean state in general, and logic has moved to the front end. So I don't see how you could have architected for that five years ago because I mean, was it 2013 when react came out, something like that. I mean, that's, that's also just The case of being flexible. And I like the idea of just having the the Django piece architecture be is kind of simple as it is so that you can go back and forth as really as the web changes with how to how to do state and how to show things. I mean, who knows if that's going to switch back.
Nimisha Asthagiri 21:17
So for us, so what we started prototyping and assessing and hopefully will solidify into a stronger best practice, but is to actually have a separate module. Right now we're calling an api.pi. But, but but essentially, it's it's basically the domain logic.py you know, where your business logic is, and then the views.py file within a Django app would, you know, consume the functions or, you know, classes that are within those api.py files, and, and that, that layer, that Doom let's just call it domain logic.pi that later domain Python is also an abstraction above your models.py. And other apps that may exist in your Django project, they cannot go directly to your models, they can't they won't access your views or anything else directly, but they would go through this an interface right above your domain logic.py. So that's how we're thinking about it. And so, this way if we want to then have Jango signals or Jango, channels integration or some other eventing if we want to have a Kafka layer later and you know, these are all ways of communicating out rest API's are just one way of communicating out right and so that's why views up high then becomes much thinner in that views.pi would be more about it would it's its responsibility would be more around the authentication layer, right, other things that you might want to do or like response, they should you be converting things to proper HTTP response codes and formats and things like that. So that's what that, that it becomes has this very separate separation of concern from where the business logic is and models, pies separation, you know, concerned with its responsibility would be more about around the data. And so that's where I'm thinking we might, we're gonna try some of our Django apps and trying to implement it in that way. And we're hoping that will then allow us to evolve as the web evolves. And as the, you know, all of our industry evolves. So yes, but you're completely right, that there is no, there are going to be things that we cannot anticipate. So we'll always need to figure out how to refactor as we go. But like currently, that's, that's one way that we're thinking about it,
Will Vincent 23:40
Carlton, how does that ring for you? Because I was I was so struck by this conversation we had right before Django con that at Django con basically, the conversation I had with everyone was, where do you put the logic where you put the logic, you know, and folks, really large sites and it's basically always somewhere else, you know, they call it something else, but it's basically they Yes, we're somewhat abstracted from the traditional Django hierarchy, Carlton, what were you gonna say?
Carlton Gibson 24:04
Right? So so the sort of basic that you know too much simpler Django example is let's say you have your, your business logic, your model validation and your save method, right? You got, Well, fine. Okay, so but then what happens you've got a model form on top of that. And then your view validates the model form it says, Yeah, this is valid data. And then so instead of returning a 400, saying D, this is a bad response, this is a bad request, you've got invalid data, it says, No, this is okay, let's go to save. And then you end up raising an error at the same point, which turns into a 500. That's a server error. Right? So you don't want your validation logic in save. I mean, you might, you might want to use it in save, but you also want it available to your form or to your serializer or wherever, so the view can use it. So to where it earlier and say to the user, hey, what you've given me isn't correct. Can you try again?
David Ormsbee 24:56
Yeah, and I think that's one thing that we also keep running into a Nazi I am not completely happy with the sort of trade offs even where we're lending is just that, it feels like you're fighting the framework right in, in a lot of ways. Like this is not like the like the the primitive pieces that you're given don't plug in, like don't connect to each other in quite the way you want to to make this sort of obstruction work. And you can do it but you know, either you get kind of you can get kind of clunky code and like or like duplicated code and kind of repeat that that validation call like, somewhere else explicitly. So yeah, I yeah, I don't know that one of the big issues that we find on how passing models around for instance, and sort of not having that layer is that one. Those things can change. Like you know, we can add stuff to it or whatever to is like the model passing on models is like this huge implicit in interface, right? That you're just kind of just throwing around everywhere, because anyone can say, hey, I've got this. Now let me like, just grab that class too and do a query and, and, and, you know, sort by this unindexed fields, you know, on the field on the table that has billions of items in it. And like, grades like, you know, and, and even scarier, it's like, maybe that does, that's fine on your machine. That's fine for your system, because your system you've got, you know, two classes and 50 students and hey, like, it works, okay. And then, you know, you bring it over and you try to merge it upstream. And that's like, not like this,
Carlton Gibson 26:41
ya know, what you could scale up, you need to have more strict rules.
David Ormsbee 26:45
And instead of having explicit layer where we say like, these are the things you're allowed to call and these are the things that we can make some kind of performance guarantees around. Is, is really it's really important for us.
Nimisha Asthagiri 27:00
And and also to clarify I mean, there's definitely a maturity model, right? Like I don't I think what we are where we are arriving is because of the scale that we are at exactly, no. So a company that's starting up fresh and new, and with Django, there's so many great things that Django comes out of the box. And actually, it's its defaults are probably good enough or for what you need at that time. But I think depending then, as you scale, perhaps by users, by developers or whatnot, then you'll have different concerns and different requirements. So for us, for instance, like even even this what we talked about, even for edX at scale, we may not apply this design pattern that we're talking about, like domain logic.pi, whatever to the entire code base, right? I mean, this might be more for in domain driven design, they have these terms called core supporting and generic and core is the one that's more of your core value proposition. It's has more of your domain concepts, and so the At that layer within your core, you might want to have these this way and this design pattern. But the things that are perhaps more, more volatile and more in the periphery, and you do want to change those more quickly and experiment, you know, you're going to just use DRF right off the bat and use a serializers and models directly. I mean, fine, you know, that's quick, you're trying to, but then if, after some time, after a couple of years, or or months, you realize, Oh, no, this is a great core concept, and you might want to figure out how to stabilize it. So there's definitely a maturity model of the company of the codebase. And then even have a feature
Carlton Gibson 28:35
boom. So the return on investment, there's no point doing this kind of super engineered, high scale thing for a proof of concept you do it proof of concept, test the concept does it work, then we'll put the extra resources in.
Nimisha Asthagiri 28:48
And even for us like our monolith, the way that we're going about and thinking about it is, are there parts of the model that are core to the business and core to the platform and other things I are more extensions can we build them as plugins, and one thing that edX has developed and we love to contribute back to actually the Django community once we extract it out of the monolith, because right now it's still within it. But there's something that we're calling Django app plugins. And it's built on pythons on stevedore technology that will allow one to basically be a, you know, import their own extensions or plugins to a monolith, the monolith provides some interface perhaps, but the plugin its own view its own urls.pi and even installing it all of those things can be automatically detected via stevedore. So anyway, it's a it's a great technology, we've found it to be very valuable. Our our open source community really appreciates it because they don't if they need to make a change or want to add something they don't need to fork the entire monolith, they can just create Their own plugin and have it automatically be detected by the monolith. So it very much goes with the solid design principles with dependency inversion and things like that. So anyway, that's that's where is also an ID, you know, this concept of what's core and what's not core making that more explicit. And the things that are not core core, how can we incorporate them with appropriate bounded contexts and boundaries?
Carlton Gibson 30:26
Yeah. And this goes back to what David was saying about them, giving people a set of API's, they're allowed to call that, you know, a safe, right.
Will Vincent 30:33
Well, one thing I want to go over is a feature toggles, right? I believe you use Django waffles now because this is another. Again, I was having this at Django con with so many people when you're at scale, rolling out new features. You don't always want to just turn it on. So I guess how, what did you do before Django waffle and how do you think about turning things on given the size of your community?
Nimisha Asthagiri 30:58
I'm David, I'm gonna take this Sir.
David Ormsbee 31:00
Yeah, you wrote that up? I think so.
I'm sorry, the O f is our cert proposal process so that we have architecture decision records that are specific to a given repo. They're kind of local decisions by the team on their particular like this this repo. But if you have something that affects that has implications across all repos, like org wide engineering, then we have the Open edX proposal process where we have sort of more general guidelines. And the Misha was the one who wrote up the one on sort of feature toggles and such.
Will Vincent 31:35
Well, it's like Django has its depths that makes it very similar.
Nimisha Asthagiri 31:39
And oh, apps were so kale acts like he came up, I think with that term, but and this process where Yes, it is it was inspired by Pep. So we have you know, so we have one or two, whatever. And it's, and so this is open edX proposals similar to Python enhancement proposals. But yeah, so there's one on feature toggles, and definitely the Reason, we we decided we needed to have some good guidelines and best practices for it is because we do when you're when you're now we're talking about scaling. And even the deployment of features, right? And we don't want, we want to be able to allow teams to try features and whatnot in a way that is more controllable. We wanted to decouple deployment from or enablement of a feature from deployment of the code. So this way teams have their own autonomy on when exactly they enable it feature toggle and how it is rolled out to the user base. Perhaps they want to have it in beta testing, and then eventually roll it out to a larger user base and so forth. So um, I mean, we could we can share the Oh app in the notes if you want of the podcast. Yeah. It goes through a bunch of Different like use cases. So, you know, like I said, beta testing is one, some of the other ones might just be that we want to, for operational reasons, we want to just roll out gradually in case we are concerned, there might be a performance issue or scalability issue or even functional issues and whatnot. And so it allows us a lot more control over that. And the thing about this, though, is this like, feature toggles are great. And it gives you that control, but then also increases the permutations in your codebase. Yeah, yeah, exactly. You know, how what, which exactly which set of permutations are your tests actually going to cover? And and so there's this also process of how do we then make sure that feature toggles are created and all of those code branches are then deleted once they are no longer needed?
Will Vincent 33:52
Yeah, exactly. How do they clean up as always the hard part?
Nimisha Asthagiri 33:54
Yes, yes. And so making teams accountable and reminding them of that and so this to the So it covers a little bit of that process as well and allowing us to have a tool and a reporting mechanism for understanding exactly. When was the feature toggle created? When is the expiration time for it? What was the use case for it? and so forth. So that allows us to then, you know, monitor, monitor those toggles. And and yes, as well, as you said, we are using Django waffles, Django waffles are great allows us to, you know, specify whether they're on or off, and then also which subset of users and want to turn it on for and things like that.
Will Vincent 34:32
Yeah, it does seem to be the the default that I that I can tell anecdotally that companies are using for this right now. Yeah. Well, so another. Another thing we've talked about, I want to highlight. So Django celery usage. And this is particularly relevant because at Django con, there's a lot of talk around async starting to be rolled out. And so both how you use Django salary and I guess the huge question is, if and when or when to Django is fully async. Do you see any use cases at edX? Or is it more of a side thing? Because that's like the two parter kind of, for a lot of folks like, do I actually need it? When celery tasks, you know, and queues work pretty well. So as a lot of questions, take whatever you want to answer.
Nimisha Asthagiri 35:17
Yeah, I'll, I'll start with a few principles and design concepts that I have. And then Dave can talk a little more of the details. That sounds good. Okay. So, I mean, for us, because we're, we're thinking about running at scale, right? We and there's, you know, we want to keep the response time back to the users as you know, within within one to two seconds. And if there are going to be some tasks or some, some requests that are going to take longer, either because we need to recompute your grades or we need to, you know, do some extra work in the background and whatnot. It's very important for us to separate those into asynchronous tasks. And one of the things that We found was that being intentional about perhaps even separating our reads from our rights would be is very valuable. And so when someone is, so basically don't do too many side effects that don't do too many costly operations within a request, especially if it's a user facing request and the user definitely, and so you need to be able to put some task into into a synchronous you know, operation and so Django celery was definitely one that is a technology that we've used and you know, initially we were doing it on rabbit mq, then we converted to Redis for scalability reasons, other reasons but but but that's definitely has been a way for us to scale out our infrastructure. I would love though to move to towards another model, because one of the one of the downsides with celery is that as the as the task that call, but basically it's not a it's not a fully pub sub model. So you do need to if you want to be able to have the celery tasks run on a separate service you do, you do need to know what, you know, like the API, it's not very easy to have an API that is more like a, the subscriber needs to know what the publishers API and vice versa. Right? You know, I mean, like, it's, you're too tightly coupled.
Carlton Gibson 37:23
Oh, instead, Sandra, you import your entire Django project, right? So you have to import the monolith into the celery instance to run the queue. And it's like, it would be nice if if tasks could be sort of separate if they didn't have to know.
Nimisha Asthagiri 37:37
Exactly, exactly and so, so anyway, so that that's what I would like to lead to eventually, and I was thinking about more of an event ng architecture, but but maybe there's something else and I need to learn more about async and what it provides and there might be some things out of the box now once we upgrade to Django, two in a couple of months, but
Will Vincent 37:54
yeah, well, Django three is when it starts to get real and even then it's 3.0 has ASCII. server and then the plan. I believe Carlton is 3.1. We'll start with views. And then ORM.
Carlton Gibson 38:08
Thereafter, more stuff thereafter,
David Ormsbee 38:11
we tend to hold the long term, long term support releases.
Will Vincent 38:15
I don't get started on this.
David Ormsbee 38:17
Yeah, well, because
Will Vincent 38:20
because people do, most people do.
David Ormsbee 38:22
Yeah. But because also we need something, hopefully simple like when we put out a version of a release of Open edX for people to use because we do have sort of like we named them after cheese but so yeah, we put out ironwood and universities that use are not going to upgrade until like maybe the next you know, the following summer or something when school is not in session and, and so it's so yeah, we're not going to see async on the the big repo. So the smaller repos, they tend to go up faster because there's just less service inertia to move and also fewer fewer groups run like the the e commerce service that we have, as opposed to the you know, everyone runs edX platform because it's where the LMS and the the authoring environment, right. So, yeah, in terms of the async, that like there are features I would love to play with.
Will Vincent 39:21
Yeah, like what
David Ormsbee 39:22
I think channels is, is is great. I played around with version one of channels before. I can't play around version two on the next platform because Carlton's in charge of maintaining channels now, okay, that's not it. It's really cool stuff. And frankly, there's a lot of stuff around like WebSockets in such that we love to use, you know, having a sink support for like service service calls or not blocking can make better use of better use of resources is great, although, I think edX platform is a bit unusual for Django project in that a lot of our performance issues or performance challenges are actually CPU and CPU and cash related Really? That's so that.
Carlton Gibson 40:11
So okay, what are you processing? Like what's good? Okay, cuz
I have we've been talking, I've been managing that you've got databases, scaling issues, you've got a lot of content in your videos, you've got course content, you've got loading sites, loading pages, but where the calculation issues that's kind of interesting.
David Ormsbee 40:28
Okay, so this is, okay, so this is this is a history stuff.
So when when edX started, right, we were like, again, like imagine. So it started with a course, a single course that was being offered to both MIT students and to, and as a MOOC version, that rents are two weeks behind the MIT cohort. And you know, like, okay, when Coursera if like those early days, it's like, Coursera or Udacity, they can say, Oh, you know, we're going to push back to the start of this course by two weeks because you know, I'm, you know, whatever to help iron out. And they're, you know, non paying students are like, okay, that's cool, whatever. You don't you don't tell MIT you have to delay this semester for two weeks, right like that, that's just not gonna happen. So, a lot of things like we those things were like, really, really like too. So very quickly put together and one of the things I was put together was that the course format in those early days course teams is XML. It's like a giant XML file. In fact, it was a giant XML file, that was a Maiko template. So and so the entire like definition of what the course is all the sequences, all the problems were this big file that that Django read that the project read on startup, and it's really easy to like create a set of objects like this very quickly. And that's fine for the prototype. But you know, obviously, like, the prototype days are over. And we are in a world now where everything's in databases like, like you would expect. But a lot of the sort of access patterns, if you look and sort of those early days, like when you're trying to prototype something, you're basically you're going for, you know, maximum power with minimum code. And you can create, it is easier to create really powerful interfaces than it is to create performance data models,
right? So yes, like,
an example like the, if you look at the grading code, like the original grading code for x platform was basically, okay, here's my tree of content in the course, I'm going to look I'm going to like crawl through and do like check, you know, go through the whole tree by checking my children. And for each node that can be graded as a gradable thing like a problem. I'm going to ask that pluggable interface. Hey, how many points are you worth? And how many points did the The students earn based on their like current state of your problem, because we have this notion of like x modules back then and like videos or x modules, problems or x modules, everything. So you had some common set of interfaces, you could ask. And that gives you a great deal of flexibility in terms of power. But at the same time, you know, how long does it take to show the progress page? And when you have to ask this question of every node in the course. And the answer is, you have no idea. Yeah, you, you in sort of, you created this interface where like, it's like max score and get score or whatever. And you've pushed off having to think about the data model. But in return for that, you've lost all ability to like reason about the performance of the system. Right? So we had x modules that started sandbox processes, like in Python and did RPC call like to sort of get information because you don't want untrusted code to be running. We have ones that will Like parser internal XML, then depending on how many response types you have inside would change their score accordingly. We have ones that would make HTTP calls to another system entirely to get back, like what the latest score for for, you know, that person was because the first version of pure grading happened on like a different service. And so, you know, like, if you wanted to make an X module, that would have returned you a different max possible grade for username starting with C on Tuesday afternoons, like there's nothing in the contract that explicitly forbids that. And so
Will Vincent 44:39
I'm seeing a theme here with the architect,
David Ormsbee 44:41
right. But once you sort of have that out, they're like, yeah, that's fine for the prototype. And for an exercise that's not cool when you have millions of users and you know, you're trying to run a course the scale but and so you find yourself trying to claw that back in a bunch of different ways. And, and so eventually what happens is you sort of see have this sort of shift that goes from the your course we're being these like smart objects that you can just make requests for him and leave it to them to figure out how to implement it. And you sort of shift that relationship to be, okay, I'm going to have this I'm going to have a grading system. The grading system has a data model, like and I know that the the grading system, I can query it, I can ask for things like what is this student's grade in the course, and get a quick reply, and then I'm going to make those the x modules and Xbox, the the course where like individual leaf nodes, they're going to push data into that data model so that I can query that efficiently. Right. So you but you have this kind of like shift, and that applies to a bunch of things, you know, we, we have to do that for grading. We're doing that for more scheduling related things. But it's hard to do that in a sort of battle. backwards compatible way, because we have a lot of a lot of course teams expended a lot of effort into a lot of like course content. And so I think one of the things that Misha was, was sort of pointing out was one of the ways we do this is to kind of shift it, like use async methods, like celery tasks, not async as in the Python async, but use celery tasks in order to kind of shift the burden of processing. So for instance, right now, what happens is that there is a grading system and it will hold your it will hold your scores, like whenever they change. But there's a set of very complicated permissions calculations that came about because hey, we have this whole thing loaded in memory already. Right, right. Right guys, right? Yeah, yeah. Yeah. And so now what we do is when you change your score, that all like those crazy computations still happen, but they happen in the salary test that runs Then it puts your like calculate score into this grading system that has a data model that you can actually make guarantees about. And so that's been sort of our bridge in a lot of ways to take things from this, like, the the Smart Object world to the, like, discrete services world. But But yeah, but that's, that's the reason why we're CPU bound. There's so many things
Carlton Gibson 47:22
because I just if I was writing a new module, a new plugin, I just go straight to the new data store. I wouldn't go by the old mechanism. Is that?
David Ormsbee 47:34
Yeah, I mean, probably, I don't know. Yeah, if you Yeah, definitely. I mean, there is a great there's a great system now and if you want to ask questions about a student's grade, yeah, you would you would go to the new thing, you wouldn't. You wouldn't want to mess okay in that way. To me Sure. You're gonna say
Nimisha Asthagiri 47:50
yeah, I was just gonna say that one of the design principles that I'm you know, when we're thinking about when we implement it, the grading system and you know, talking about precision Grades before grades were computed on the fly. And now we're able to read them in the data from the database. But But in order to make that happen, because we had so much flexibility with those x modules and x blocks and what they could do, and giving us data and the runtime and all that stuff, we were inspired by it's like the reverse ETL, you know, with ETL, you know, basically you're, you're, you're with ETL, you're you're reading the data, then you're transforming it, and then you're writing into new form. It's a reverse in the sense that basically, we allow these plugins in these Xbox, right to basically give us a lot of that content. And they were able to then write and push that into a form that then they were able to then transform however they want to for read optimizations. So basically, that's where that separation of reads versus writes come in, where like, we want to have very fast read optimization views of this data and do it in a so we implemented this very simple interface where basically allowing anyone to collect the data and Then allowing anyone to transform the data. And then we automatically look through all of the Transformers that are registered in our system, run them through sort of like an ETL type of thing there, except it's LTE in that, you know, it's allowing people to to write. And then when it comes to responding to the user, we're able to do that very quickly. Because it's already transformed for read optimization. And there and then response times are now once again, improved. So it's tricks like this that have allowed us to, you know, figure out how can we take what we had legacy wise, they were much less a lot more generic, generic interfaces, we're going to be more intentional and going forward with what our API's and interfaces are, but we still needed to support the old so how do we do that in a way that's performant but you know, now with optimizations in mind,
David Ormsbee 49:50
and actually there's one of the fail in defense of the people who made the that system originally, like it is. It is really powerful. Right, like, like, there there are, if you are in a situation where you don't know what the interface really is for, like what what you should and shouldn't allow for grading, and also where there are bugs all over the place, right? We're building the platform as the first course is running on it. And oh, this this bug, like, you know, you detect this bug is Miss scoring someone or like, you know, as soon as interpreted score wrong, then if you store it in a persistent state, then Okay, you have you we have tasks to like regrade them and we restore, modify their stores and whatever. If it's always dynamically computed on the fly, then like, yeah, just reload the progress page, boom, wherever your, your score is fixed. And so it was one of those things where like, an even the sort of query or anything like one of the, one of the sort of painful things about moving from like query the smart object to, like separate systems is that you do lose Some power, right like the very original version of the prototype that we had had a B test running in a totally hacky crazy way, but they were running. And it took us years to get that functionality back into sort of more like predictable and performant way just because like you. Yeah, it's it's so that there were trade offs. And I definitely like I have suffered, I have suffered from from the grading system more than most. So as the Misha, but and so definitely, I'm not going to say that that's we should accept it. But I guess it's I feel like it's important to note that there are trade offs, especially for something that as young and quick as we were doing back then
Carlton Gibson 51:38
when you're prototyping to help just write a Python class and keep everything in memory is, you know, Piglet if you have to reserve if you want, that's a great development environment. And that's a great way of finding out what your requirements are when people can't specify them for you can say, Oh, well, you know, what is your grading system involved? It's like, A to F No, is loads more than that? What does it actually involve? Well, we don't know. Let's build something let you use it. And then when we've when we've exposed what the actual requirements are, then we can build something which scales up
Will Vincent 52:08
as we wrap up one thing if we can, I'd love to talk about the automated discovery of CSRF issues, because this is something nimisha. You and I briefly talked about Jenga, Boston. you'd mentioned just generally, there was a number of security things that edX had come up with that to help Django but weren't part of Django core.
Nimisha Asthagiri 52:27
Yes, yes. Um, so, I mean, security is definitely very important to us. I think Django has definitely improved, at least in the last five years that I've seen, like over time in terms of having more secure defaults. I think, for us, we we do have a security working group within edX, that triage has issues that come in and so over time, there have been some issues are more prevalent than others. The automated discovery of CSRF issues, I was something that I had written a while back, and it's automated Trying to just context switch what that was in exactly. But I think what that was was, you know, basically side effecting GET requests, right? Whenever someone makes a get request, we want to make sure that there are no modifications to a model or, you know, to any data right in your system. So, on a get request,
David Ormsbee 53:23
yeah. So so in courseware, this actually, like was the the case for various student tracking things. Okay.
Nimisha Asthagiri 53:30
Yes. And so and of course, we're Yes. Might have a legitimate issue that when it did that way, but we we found so basically, I had created this middleware, it was a Django middleware, I believe, and whenever there was a it detected that there was a get request being made. And then what it did was actually I think it used Django signals to for to track any post save model changes, and so it was able to detect if any Django model change Whenever a get request was made, as opposed to post request, right, so, and that then then then basically we reported this, and then, you know, you would be able to realize that this is a potential CSRF issue that needs to be fixed. And I think there was other things too, that I detected I pause may have detected if there were Django celery tasks that was also being initiated whenever it GET request was made. So anything that could have potentially had a side effecting, you know, a side effect that could have resulted. So, so that's, that's one thing that we had created. We, I think what happened was we found a few CSRF issues within our system. And so we had them been fixing them over time. And so at some point, we wanted to make it a public, you know, application, others could also use but we haven't gotten around to that yet. So we'd love to be able to contribute that as well.
David Ormsbee 54:57
There's also like Robert CSRF Little snare like template level CSRF stuff that runs in our Jenkins bill. So
Nimisha Asthagiri 55:04
yeah, those are excesses. Oh, I'm sorry. Robert, has implemented an excess linter as well. So he's able to detect if there's any SSL issues in Django templates, and maybe other things as well. Yeah.
Will Vincent 55:19
So Carlton would Jango be interested in these things?
Carlton Gibson 55:21
Yeah. I mean, quite probably. Yeah. I mean, you know, they're extracted nice and small and easy to implement. Yeah, I mean, just going back to the CSRF example, that's a great use of signals, right. So people use signals as a way of decoupling their application, but they tend to can get overused. And they can lead to hard to understand and hard to follow code. So the sort of Maximo was used is do you know, at runtime, who the receiver of this signals going to be? Who's going to act on the signal if you do, don't use a signal? But of course, when you're monitoring is there other any saves across any models in this request, you don't know who's going to be sending it so you don't. It's perfect case for signals. It's a lovely example.
Nimisha Asthagiri 55:56
Right now signals can be very powerful and there are actually Good use cases for it. So this is definitely one. Another one I will say is that, you know, when we're thinking about the monolith, and the core, and then we have these Django app plugins, because they're too We don't know who the recipient is, and it's not core to our platform, we'd actually prefer to use Django signals as a way of communicating that out. But then to our earlier point, you know, that has done becomes an intentional interface.
Carlton Gibson 56:21
Yeah, and you have to document that and provide guarantees and
Nimisha Asthagiri 56:24
audit and all that stuff. But Jango signals does allow asynchrony it allows decoupling, you know, of your components. And so but Yeah, I agree with you. That's, that's a good rule of thumb. If you don't know the recipient, that definitely makes sense. It's this is a possible interface that you could,
Will Vincent 56:41
yeah, I mean, I'm sure we could keep talking. I know we're kind of run out of time. Thank you so much for sharing the realities of a large scale. codebase because this is how it is for everyone. There's always a trade off between prototype and legacy. And, you know, I guess the thing when, when beginners asked me sometimes about these questions, I say it It's an honor to be where you all are at, you know, as frustrating as it is, it's like, these are the problems you want to have that you're at scale, and you're keeping up to date with Django. So it's fascinating to hear, you know, kind of under the hood, how it's all working and how you're thinking about it. And we'll link in the show notes to a number of these, especially the app. That'd be great. Yeah, really, super interesting. So again, thank you, David. Thank you, Misha.
Carlton Gibson 57:25
Yes, thank you both.
Will Vincent 57:26
And for those listening. We're at Django chat, comm chat Django on Twitter, and we'll see you all next time. Bye bye. Join us next time. Buh bye.