Since I'm doing some HTTP proxying I've been
thinking about how it should work. And I really don't know -- it's
all very vague. Much more vague than it should be. HTTP won't be a
pipe
if we have to rely on ad hoc configuration everywhere.
Some of the issues...
Some information gets covered up. For instance, in many HTTP proxying
situations the Host header is lost; changed into localhost or
something that doesn't accurately represent the initial request. I
guess there's an informal standard that X-Forwarded-Server
contains the original Host value.
Also the remote IP address is lost, but there's a more widely used
convention to put that information in X-Forwarded-For. This seems
much less interesting than Host to me, but much more widely
supported. Go figure.
I don't know of any standards that exist for remapping request paths.
I guess in theory you shouldn't need to do this, but in practice it is
pretty common. For instance, lets say you want to map /blog/* to
localhost:9999 where your blog app is running in a separate
process. Do you preserve the full path? Do you duplicate any
configuration for path mappings or virtual host settings in that
server on port 9999? All too often the target server isn't very
cooperative about this, and various hacks emerge to
work around the problems. Ideally I think it would be good to give
three pieces of information for the path:
- The resource the upstream server expects to handle this request.
- The base path that the upstream server used to determine what
resource should handle the request.
- The rest of the path, which is the responsibility of the downstream
server.
2 and 3 are similar to SCRIPT_NAME and PATH_INFO. 1 is
something new. HTTP always leaves these all squished together.
Also: potentially the request path has nothing to do with the
original path or domain. This can happen when you are aggregating
pieces of data from many sources (e.g., using SSIs, which get
pieces of content as subrequests, or HInclude which composes content in the client). If the
output uses HTML then that HTML needs to be written either with no
assumptions about what URL it is rendered under (i.e., all links are
fully qualified), or it needs to be smart about the real context it
will be rendered in. How to write relocatable HTML is a separate
issue, but there also doesn't seem to be any conventions about how to
tell the web app about the indirection that is happening.
There's a lot more information that can be passed through. For
instance, the upstream server may have authenticated the request
already. How does it pass that information through? Maybe there's
other ad hoc information. For instance, consider this rewrite rule:
RewriteCond %{HTTP:Host} ^(.*)\.myblogs.com$
RewriteRule (.*) http://localhost:9999/blogapp/$1?username=%1 [P,QSA]
If you aren't familiar with mod_rewrite, this tells Apache to
take a request like http://bob.myblogs.com/archive?month=1 and
forward it to
http://localhost:999/blogapp/archive?month=1&username=bob
This kind of works, but is clearly hacky.
Ideally we would pass a header like X-Blog-Username: bob. But
that opens up other issues...
We can start adding headers willy-nilly, and that's actually okay, but
opens up security concerns. If you aren't certain that only trusted
clients can access your backend server, can you really trust the
headers? It's no good if anyone can connect to your server with
X-Remote-User: admin and then you trust that information. With no
concept of trusted and untrusted headers, we have to rely on ad hoc
configuration for security. This is both difficult to setup and
maintain, and doing it wrong can lead to a very insecure setup.
The previous issues can all be resolved with conventions about new
HTTP headers. This one is much harder.
I'd like to use HTTP like a pipe. Really! None of the issues I've
brought up are new, but they also aren't well answered despite their
age. In comparison, FastCGI and SCGI actually answer most of these
problems right now.
If we're going to use HTTP this way -- and there's great reasons to
start doing this -- we need to work harder at coming up with a good
answer for these kinds of issues.