Since I'm doing some HTTP proxying I've been thinking about how it should work. And I really don't know -- it's all very vague. Much more vague than it should be. HTTP won't be a pipe if we have to rely on ad hoc configuration everywhere.

Some of the issues...

Host

Some information gets covered up. For instance, in many HTTP proxying situations the Host header is lost; changed into localhost or something that doesn't accurately represent the initial request. I guess there's an informal standard that X-Forwarded-Server contains the original Host value.

Also the remote IP address is lost, but there's a more widely used convention to put that information in X-Forwarded-For. This seems much less interesting than Host to me, but much more widely supported. Go figure.

Request path

I don't know of any standards that exist for remapping request paths. I guess in theory you shouldn't need to do this, but in practice it is pretty common. For instance, lets say you want to map /blog/* to localhost:9999 where your blog app is running in a separate process. Do you preserve the full path? Do you duplicate any configuration for path mappings or virtual host settings in that server on port 9999? All too often the target server isn't very cooperative about this, and various hacks emerge to work around the problems. Ideally I think it would be good to give three pieces of information for the path:

The resource the upstream server expects to handle this request.
The base path that the upstream server used to determine what resource should handle the request.
The rest of the path, which is the responsibility of the downstream server.

2 and 3 are similar to SCRIPT_NAME and PATH_INFO. 1 is something new. HTTP always leaves these all squished together.

Also: potentially the request path has nothing to do with the original path or domain. This can happen when you are aggregating pieces of data from many sources (e.g., using SSIs, which get pieces of content as subrequests, or HInclude which composes content in the client). If the output uses HTML then that HTML needs to be written either with no assumptions about what URL it is rendered under (i.e., all links are fully qualified), or it needs to be smart about the real context it will be rendered in. How to write relocatable HTML is a separate issue, but there also doesn't seem to be any conventions about how to tell the web app about the indirection that is happening.

All the other information

There's a lot more information that can be passed through. For instance, the upstream server may have authenticated the request already. How does it pass that information through? Maybe there's other ad hoc information. For instance, consider this rewrite rule:

RewriteCond %{HTTP:Host} ^(.*)\.myblogs.com$
RewriteRule (.*) http://localhost:9999/blogapp/$1?username=%1 [P,QSA]

If you aren't familiar with mod_rewrite, this tells Apache to take a request like http://bob.myblogs.com/archive?month=1 and forward it to http://localhost:999/blogapp/archive?month=1&username=bob

This kind of works, but is clearly hacky.

Ideally we would pass a header like X-Blog-Username: bob. But that opens up other issues...

Security of piped information

We can start adding headers willy-nilly, and that's actually okay, but opens up security concerns. If you aren't certain that only trusted clients can access your backend server, can you really trust the headers? It's no good if anyone can connect to your server with X-Remote-User: admin and then you trust that information. With no concept of trusted and untrusted headers, we have to rely on ad hoc configuration for security. This is both difficult to setup and maintain, and doing it wrong can lead to a very insecure setup.

The previous issues can all be resolved with conventions about new HTTP headers. This one is much harder.

Conclusion

I'd like to use HTTP like a pipe. Really! None of the issues I've brought up are new, but they also aren't well answered despite their age. In comparison, FastCGI and SCGI actually answer most of these problems right now.

If we're going to use HTTP this way -- and there's great reasons to start doing this -- we need to work harder at coming up with a good answer for these kinds of issues.


	Web Artima.com