A General Defence Against Injection Attacks on Websites

By Adrian J. Beasley

The usual range of IT Security techniques is of little use against injection attacks. They can mitigate some of the effects of such attacks by, for example, setting proper permissions on resources, and ensuring that access from websites is under a user with the appropriate least level of privilege. But ultimately you are inviting user input, and the only reliable defence against malevolent data is to deal appropriately with what the user/attacker actually inputs.

The defining characteristic of an injection attack is that the input data is crafted in such a way as to fool the interpreter dealing with it (typically the SQL engine) into treating parts of it as commands, rather than as purely textual, and executing the commands, to yield results of varying seriousness, but all certainly unintended by the application.

  “All input is evil.”

This article suggests a general defence against such attacks, in the normal case where user input is received from a website, and the contents are written to a database. It is based on the observation that such attacks are not possible using only alphanumeric (and space) characters. (If anybody can tell me differently, please comment!) So why not simply convert all non-alphanumeric characters into the corresponding HTML Entity form, and store the data in the converted (and thus neutralised) form?

(For readers not familiar with HTML Character Entities, the HTML standard specifies that all characters may be represented in the form &#nnn; where nnn is the 3-digit ordinal number (i.e. the binary value as represented in decimal) of the character, thus &#044; is comma and &#045; is hyphen. When such an entity is sent to a browser, it displays the corresponding character. Certain entities have special names, thus &quot; &amp; &lt; and &gt; which are the same as &#034; &#038; &#060; and &#062; represent “ & < and > respectively. All web programmers will be familiar with &nbsp; which is the same as &#160; and represents the non-breaking space character.)

The germ of this idea is contained in functions such as htmlencode, which is available in various .NET classes, and is applied to character data before sending output to a browser. It converts any occurrences of  “ & < > and ‘ into &quot; &amp; &lt; &gt; and &#039; so that those characters are reproduced as such by the browser rather than being interpreted as part of HTML tags, i.e. it is to prevent HTML-Injection attacks by your website against the browsers of visiting users (in the situation where some attacker had caused the injection code to be stored in your database, not attacking you directly but causing your website to attack unsuspecting visitors – fun eh?) The original contribution is to ask: why stop there?

By converting all the non-alphanumeric characters in user input into the HTML Entity form, SQL- and well as HTML-Injection attacks (and probably any other form) are neutralised. The beauty of the technique is that it doesn’t require validating the input to search for particular types of attack; it simply renders any attack ineffective, welcoming the attack code and storing it in harmless form internally. When sent back to the browser, it is displayed exactly as originally input. (I cherish the – probably over-optimistic – image of a baffled hacker redisplaying his cunning attack code and wondering why it didn’t seem to be having the desired effect on its target.) The only place where the data are visible in the converted form is if you look in the database. Since the alphanumeric and space characters are unaffected, the meaning is clear, and you quickly come to recognise the common HTML entities. Should it be necessary to process the data, in generating reports, for example, the reverse transformation is easily incorporated. In fact, the only downside I can think of in incorporating the technique is that if you are already using the htmlencode function, you must remove it or it will simply generate nonsense, carrying out the transformation a second time for the five characters it deals with.

The following JavaScript illustrates how the technique might be implemented.

The function htmlCharacterEntity returns the entity form of the single character input parameter (or INVALID if out of range). The slightly kludgy approach of identifying the character by scanning through the numeric range is due to the surprising deficiency of JavaScript in converting between character and integer types; I am not an expert in JavaScript, so if anyone knows how to do this more elegantly I would be pleased to hear of it. But it does the trick.

function htmlCharacterEntity(charStr)
{
 // Returns the (6-char &#nnn;) HTML Character Entity value corresponding to the single char supplied.
 // If the character supplied is outside the ASCII range, the value INVALID is returned.

 for (var charIndex = 0; charIndex < 256; charIndex++)
 {
  if (charStr == String.fromCharCode(charIndex))
  {
   var charValue = new String(charIndex)
   while (charValue.length < 3)
   {
    charValue = "0" + charValue
   }
   charValue = "&#" + charValue + ";";
   return charValue;
  }
 }
 return "INVALID"
}

The function neutraliseString does exactly what is says, taking a general input string and converting any non-alphanumeric characters to the entity form, using the htmlCharacterEntity function above.

function neutraliseString(strValue)
{
 // Returns the supplied string with all non-alphanumeric (and non-space) characters converted to the
 // corresponding HTML Character Entity form. This neutralises any SQL- or command-injection attack
 // contained in the string (typically user input to a web form), while reproducing the original
 // string when the neutralised value is displayed in a browser.

 return strValue.replace(/[^ 0-9A-Za-zÀ-ÖØ-öø-ÿ]/g, function(charStr) {return htmlCharacterEntity(charStr)} )
}

This article arose from work I performed for a client. The client wanted to allow users to input certain HTML tags, specifically <br>, <b>, </b>, <i> and </i>, so that more technically aware (i.e. nerdy) users could apply simple formatting to their input, effective when redisplayed at the browser. No problem. The function NeutraliseInputString splits up the string on the specified exceptions, allowing these through but applying the neutraliseString function to the parts in between. This is an excellent illustration of the sound security principle of blocking everything except what is specifically allowed.

function neutraliseInputString(strValue)
{
 // Same as neutraliseString, but allows certain nominated constructs through, while neutralising the rest.

 var rgxExceptions = new RegExp("(<br>|<b>|<\/b>|<i>|<\/i>)", "i") // Case-insensitive matching!
 var strArray = strValue.split(rgxExceptions)
 var strOut = ""

 for (var strIndex = 0; strIndex < strArray.length; strIndex++)
 {
  if strArray[strIndex].search(rgxExceptions) < 0  // Search returns -1 if not a match,
  {
   strOut = strOut + neutraliseString(strArray[strIndex]) // in which case, neutralise it,
  }
  else
  {
   strOut = strOut + strArray[strIndex]  // but pass matches straight through unchanged.
  }
 }

 return strOut
}

The above technique is offered to fellow professionals as a weapon in the fight against the common enemy. I can hardly imagine that I am the first to have thought of it, but I have not seen it documented anywhere, despite focused searching. You may choose a more or less restrictive interpretation of which characters are dangerous. I choose to allow alphanumeric characters through because I don’t believe any useful purpose is served by converting them, but I can’t tell in advance which other characters might prove susceptible to misuse so I convert all the rest. The entity form takes 6 characters per displayed character, so there could be a storage penalty if used too enthusiastically. Also, keeping alphanumeric unchanged, the database contents are still easily readable. However, should it prove possible to mount injection attacks using alphanumeric characters only, the technique could still be used, but converting all characters. At present I see no justification for that.