httpd, decoding strings and all that

Mon Apr 17 23:09:34 UTC 2006

  ok, it's finally becoming clear what's happening here so i figured
i'd give a quick summary.  feel free to be bored and change the
channel.

  in the current httpd.c, we have the decodeString() routine which,
because of its name, i thought was just for decoding "%xx" substrings
in a general string:

===================================================================
static char *decodeString(char *orig, int flag_plus_to_space)
{
  /* note that decoded string is always shorter than original */
  char *string = orig;
  char *ptr = string;

  while (*ptr)
  {
    if (*ptr == '+' && flag_plus_to_space)    { *string++ = ' ';
ptr++; }
    else if (*ptr != '%') *string++ = *ptr++;
    else  {
      unsigned int value1, value2;

      ptr++;
      if(sscanf(ptr, "%1X", &value1) != 1 ||
                sscanf(ptr+1, "%1X", &value2) != 1) {
    if(!flag_plus_to_space)
        return NULL;
    *string++ = '%';
      } else {
    value1 = value1 * 16 + value2;
    if(value1 == '/' || value1 == 0)
        return orig+1;
    *string++ = value1;
    ptr += 2;
      }
    }
  }
  *string = '\0';
  return orig;
}
====================================================================

  what threw me was 1) why "/" and the null byte were being treated
specially, and 2) what was the point of returning the address of the
*second* character of the string parameter there near the bottom?
what did that represent?

  re: that return value, i can now see that that's how this routine
returns an error condition.  and that error condition is ... if it
hits a "/" or null byte.  but what's so special about those two
characters?

  well, if you look at the corresponding routine in apache-1.3.34, it
becomes clear:

===================================================================
API_EXPORT(int) ap_unescape_url(char *url)
{
    register int x, y, badesc, badpath;

    badesc = 0;
    badpath = 0;
    for (x = 0, y = 0; url[y]; ++x, ++y) {
        if (url[y] != '%')
            url[x] = url[y];
        else {
            if (!ap_isxdigit(url[y + 1]) || !ap_isxdigit(url[y + 2]))
{
                badesc = 1;
                url[x] = '%';
            }
            else {
                url[x] = x2c(&url[y + 1]);
                y += 2;
                if (url[x] == '/' || url[x] == '\0')
                    badpath = 1;
            }
        }
    }
    url[x] = '\0';
    if (badesc)
        return BAD_REQUEST;
    else if (badpath)
        return NOT_FOUND;
    else
        return OK;
}
===================================================================

  aha!  decodeString() is not meant to decode *any* old string, it's
meant to decode specifically a *URL*, so the error on a slash or a
null byte makes perfect sense now.  under the circumstances, it would
make far more sense to rename decodeString() to, well, unescape_URL()
or something to make it more obvious.

  in addition, the apache routine (IMHO, more reasonably) now returns
an int error code to represent an error, rather than messing with the
argument address.  again, i think a better idea.  but, if you're still
with me, look at what happens in httpd-2.2.0 to that routine.

==================================================================

AP_DECLARE(int) ap_unescape_url(char *url)
{
    register int badesc, badpath;
    char *x, *y;
...
                if (IS_SLASH(*x) || *x == '\0')
                    badpath = 1;
...
}

AP_DECLARE(int) ap_unescape_url_keep2f(char *url)
{
    register int badesc, badpath;
    char *x, *y;
...
               if (decoded == '\0') {
                    badpath = 1;
...
}
====================================================================

  there are now *two* versions of this routine, the latter of which
does *not* treat a slash as an error.  (lord, talk about overkill --
i'm sure all that could have been done with common code and a single
flag argument.  yeesh.)

  so, in httpd-2.2.0, why would you *not* want to treat "/" (%2f) as
an error?  AFAICT, there is only *one* place in all of httpd-2.2.0
that makes that call, in server/request.c, for the following reason:

====================================================================
...
    /* Ignore embedded %2F's in path for proxy requests */
    if (!r->proxyreq && r->parsed_uri.path) {
        core_dir_config *d;
        d = ap_get_module_config(r->per_dir_config, &core_module);
        if (d->allow_encoded_slashes) {
            access_status = ap_unescape_url_keep2f(r->parsed_uri.path);
        }
        else {
            access_status = ap_unescape_url(r->parsed_uri.path);
...
====================================================================

and that's the only apparent reason for allowing encoded %2f.

  in any event, it now makes way more sense, but i'd still like to see
that stringDecode() routine renamed to reflect the fact that it's
specifically processing a URL.  and passing back real return codes
rather than messing with that argument pointer.

rday