Monday, November 9, 2009

View Sanitizing and Micro-Optimizing

Maurício Linhares posted an intriguing response to my recent post about auto-sanitizing Rails views. I was just gearing up to respond via a comment when I realized I could probably turn it into a full post, so here goes my response!

So, first of all, let me applaud Maurício for actually writing some code and sharing it, rather than keeping the discussion academic and merely flinging arguments around the interwebs. I can't say I always do it, but I have the most fun writing a blog post where I show some code to achieve a solution to a problem I am having. He even went the extra mile to create a plugin for his idea.

That said, I must respectfully disagree with his approach. The gist of how he tackles the problem is to sanitize data as it comes in rather than when you are displaying it. He even argues in his blog post that it is "more adherent to the MVC":

Now, as the data is cleanly stored in your database, you don’t have to waste CPU cycles cleaning up data in your view layer (and you can even say that you’re more adherent to the MVC, as cleaning up user input was never one of it’s jobs).

He makes a convincing argument that the view layer should not sanitize input. My big problem with this is that you have actually taken very specific view details and moved it into the controller and model layers, contrary to what he is claiming. Namely, you have introduced into the controller/model layers the idea that your data is going to be displayed via HTML. However, what if you want to expose a JSON API later on via the same controllers, but with new views? Now, you will need to unsanitize the data, and resanitize it for JavaScript output! You have inadvertently snuck view information into the database! Your data is pigeonholed as HTML data, and it now takes double effort to use the data in another matter (such as JSON data).

This last point deserves some extra attention. Consider if our transform on the data was a lossy transform. This case isn't, because you can easily unsanitize sanitized HTML, but forget that for a second. For example, let's say we wanted all data to be sanitized and censored, such that words like "ass" and "crap" got changed to "***". If we had a bug that caused "crass" to be changed to "cr***", we have just lost information that is irretrievable. If we saved the sanitizing and censoring for the view, where it belongs, we could always fix the censoring code and our "high fidelity" representation will allow us to now correctly show "crass." Let me quote a Stack Overflow podcast, where Joel explains this same position:

Spolsky: Here's my point. Uhh, in general, my design philosophy, which I have learned over many years, is to try and keep the highest fidelity and most original document in the database, and anything that can be generated from that, just regenerate it from that. Every time I've tried to build some kind of content management system or anything that has to generate HTML or anything like that. Or, for example, I try not to have any kind of encoding in the database because the database should be the most fidelitous, (fidelitous?) highest fidelity representation of the thingamajiggy, and if it needs to be encoded, so that it can be safely put in a web page then you run that encoding later, rather than earlier because if you run it before you put the thing in the database, now you've got data that is tied to HTML. Does that make sense? So for example, if you just have a field that's just their name, and you're storing it in the database, they can type HTML in the name field, right? They could put a < in there. So, the question is what do you store in the database, if they put a < as their name. It should probably just be a < character, and it's somebody else's job, whoever tries to render an HTML page, it's their job to make sure that that HTML page is safe, and so they take that string, and that's when you convert it to HTML. And the reason I say that is because, if you try to convert the name to HTML by changing the less than to &lt; before you even put it in the database. If you ever need to generate any other format with that name, other than HTML - for example you get to dump it in HTML to an Excel file, or convert it to Access, or send it to a telephone using SMS, or anything else you might have to do with that, or send them an email, for example, where you're putting their name on the "to" line, and it's not HTML - in all those cases, you'd rather have the true name. You don't want to have to unconvert it from HTML.

Yes, it is tedious and error prone to use "h" everywhere, but that is the exact same problem I was trying to address in my post. However, I feel training myself to use <%: foo %> over <%= h foo %> is a better muscle memory than marking all input as sanitizing. Let's consider the consequences if you forget to apply the new scriptlet versus if you forget to apply the sanitizing of inputs. If you forget the new scriptlet, you have a new XSS hole that needs to be closed by simply changing "=" to ":" (or alternatively adding a call to "h"). If you forget to use the sanitizing of inputs, you have 2 major problems. You have an unknown amount of XSS exploits (everywhere you display that data, which could be in many places). You also have a bunch of data that is now invalid. You now need to either add sanitizing to all the view locations you output the information (which would be tedious and contrary to the whole point of Maurício's approach), or you need to update all existing records to be sanatized, just before enabling sanitizing of input.

There is another issue with this approach that a new scriptlet tag avoids. By making the sanitize decision in the view layer, you have the option of exactly what you will sanitize. Let's consider a site like a blog or Stack Overflow. In such applications, you want some amount of HTML displayed, though not necessarily on all fields of the model. You might want to whitelist sanitize the blog post, question text, or answer text, yet fully sanitize the labels or tags. Granted, you could update the plugin to allow such complexity, but it will be just that... complexity that will bleed through how you invoke the plugin. You will now need to not only specify which actions or controllers use this sanitizing, but also which parameters are excluded from being sanitized.

All of the above pales in comparison to the biggest sin of sanitizing the parameters, and it is one of my biggest pet peeves. It is one of the key points for why Maurício chose the path he did. Premature optimization.

The argument goes that rather than waste the CPU cycles every time you load the page (which is hopefully a lot), you should waste the cycles once as the input is being passed in and saved to the database. Premature optimization usually rears its ugly head in the form of much more insane choices, like insisting on how you should concat your strings. Thankfully, Jeff Atwood has already done metrics showing us that it doesn't matter.

Is sanitizing as quick as string concatenation? Probably not. I would be willing to bet, though, that it is fast enough for a small website. Why waste extra consideration on it until you have the awesome problem of having too many users?

Let's take a step back. Is pushing View logic into Model/Controller territory even worth the possible performance benefit? If I am going to throw proper MVC separation concerns out the window, it better be for a damn good reason. Allowing us to get orders of magnitude more pageviews might be worth it (if metrics proved that it was the best possible improvement, which is doubtful, but let's consider it). The whole concept is to cache something you are doing a lot by doing it once, before it even goes to the database. Let's extrapolate that concept. Instead of sanitizing first so we don't have to sanitize on every page view, why don't we cache the result of the action invocation itself? For every action/params pair that produces a view based strictly on data in the database, we could invoke the action once and just cache the rendered view for future use. Then, simply blow the cache away and re-render when you change the related database row(s). It is typically much more common to view than to update, so I expect this approach would give significantly better performance benefits than simply avoiding a bunch of sanitization calls.

With some careful thinking, we now have a much better solution to remove all the redundant sanitize invocations... and we've even removed redundant calls to the database, and any other costly algorithms we have done within our actions! All while preserving proper MVC separation of concerns. You can bet that I will explore this space when I have the fortunate problem of having too many users (and I wouldn't be surprised if there are available solutions that match my description).

Sorry for going off so much on your very well-meaning post, Maurício! I think you brought a very interesting possible solution, and it's always great to see code brought to the table. However, I do feel we should all seriously consider all approaches, and fully consider the consequences of the path we choose... not just to this particular problem, but any problem. It's best to drill down early and think about what issues our code may cause for us in the future. Don't consider this an excuse to dwell on issues to the point of failing to release useful functionality, though.


angelbob said...

My take on it would be that auto-sanitization (sanitation?) should be entirely orthogonal to the whole MVC thing.

When you sanitize input or output, what you're doing is converting it to a different format. If you're smart, you'll convert it in a non-lossy way (i.e. compression is okay, censorship like what you mention is not). As Joel says, fidelity to your actual input is a very, very good thing.

However, format conversion isn't really view logic or controller logic per se. It might be model logic in some cases, but mostly it's framework logic -- you don't really care about it in MVC terms, it's just a conceptual change of (you hope) the exact same data.

I agree that optimization is the wrong argument here. I think the right argument is probably more about where such a format conversion should occur.

Good sanitation should be fully reversible, since it's just string escaping. So it's really okay either way, as long as it's well-defined when and where it occurs.

Existing format conversion occurs everywhere in this process. In your browser, in your web server, in Rails, going to and from SQL, often in models, frequently in views or helpers... So I think it's unreasonable to say "this is where we should do this step" as long as its just a reversible format conversion. There's no single obvious place.

Bearing all that in mind, then, the big advantage he's looking for is, as you said, the same as what you gave. You don't want to manually type all the ERB invocation and h() calls by hand. Your plugin, and many others, are fine choices for that. A variation on your plugin that did auto-sanitize on input and then provided an easy way to de-sanitize on command would also be just fine.

And as you say, the performance is just not a big deal.

Maurício Linhares said...

Hey Mike,

First of all, thanks for the input, I think this is a topic really woth discussing :)

The optimization issue is just a little joke, what I'm really looking for here is, just like you, ease of use. I don't want to have to clean up the data whenever I use a model, I want to have it already cleaned up and ready to be used wherever I want.

But about lossy and lossless I strongly disagree with you and Joel. If I have defined that I will not accept any possible XSS data in my database I don't care if the bad data is wasted, it is crap, I don't even want to think about having it in my database at all.

Obviously, this isn't a one size fits all solution, it has been working correctly so far for me as my needs were simple. For mode complex needs, features or flexibility, rolling your own solution or just using "white_list" and "h" might do the trick, but I find it easier to know that the information I have stored in my database is safe to be used whenever I want, even if I have forgotten to escape it on the view.

And the next features I'm planning to add to the plugin are a way to select which params are going to be sanitized and custom sanitizers ;D

Mike Stone said...

Thanks for the comments guys! I guess it all just comes down to preference, and whatever you personally feel is the most convenient option.

@anglebob: I still feel sanitizing most belongs in the view layer, since that is the layer that knows what format is actually being output (in the case of HTML, you could be outputting in both html and javascript). But, I get your point that since you aren't really losing fidelity, it's not really an issue... and that it doesn't necessarily belong in any particular place.

@Maurício: I'm glad to hear the performance comment was a joke... it didn't quite seem like it when reading, but sarcasm can sometimes be hard to detect in print. I guess we will have to agree to disagree on the approach. In the end, as long as the output is sanitized, the rest is just unnecessary details.