Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Considering clusters of clusters

The Session approach to implementing log in, along with session affinity and session sharing techniques, has worked just fine for a large number of J2EE applications. However, in thinking about scalability for Artima's new architecture, which will serve up a network of sites, I became concerned about the Session approach. As I described in my previous weblog entry, Database Partitioning for Scale, we are planning to make it possible to scale our database horizontally by partitioning sites data into multiple databases. We haven't decided whether we will actually deploy multiple databases, but we wanted to make it easy to do so if we later decide to take that approach. Each sites database (whether we have one or a hundred) will be accessed by a “mini-cluster,” a small cluster of servers, and different mini-clusters may or may not be driven by different load balancers.

On top of this cluster of clusters network topology, we want to implement single-sign on across the network. If you log into one site in our network, and then follow a link to another site in our network, we want the second site to recognize you. In other words, you wouldn't log into a site in the Artima network, you'd log into the entire network. In addition, we don't want a server crash or reboot to log our users off. We will likely deploy new software quite often, such as on a weekly basis, and when we do that we may need to reboot servers. Though being logged off unexpectedly is likely only slightly annoying to users, to me kicking users off the network while they are actively using it seems unprofessional.

Given our cluster of clusters network topology, the single-sign on requirement, and our desire that authentication sessions survive server restarts, I began wondering if there was a better way than the usual Sessions approach for log in. The main source of my concern was that the technique of session affinity is most effective when every server participating in the authentication session belongs to the same cluster, and that wouldn't necessarily true in our cluster of clusters topology. For example, if you log into a site served by mini-cluster A, a server in A would create the session ID and return it. Thereafter, requests to that site, and any other site hosted by mini-cluster A, would go back to the load balancer fronting A, which would direct the requests back to the originating server that created the Session. However, if the user clicked on a link to a site served by mini-cluster B, that request might end up at a different load balancer (the one for mini-cluster B) that has never heard of the session ID. In that case, the server in mini-cluster B that receives the request would need to get the session state from the appropriate server in mini-cluster A. Thereafter, the load balancer for B could redirect all requests with that session ID to the same server.

The thousand servers thought experiment

One of the techniques I have used to design for scalability has been to ask myself how I would implement our requirements across 1000 servers. If it will scale to 1000 servers, I figured, it should work on 10, or 100, or however many we may ever need to deploy. The cluster of clusters topology mixed with the single-sign on requirement adds a bit of complexity to Session sharing, and a bit more drag on scalability, but I believe it could be done. When thinking about single-sign on across 1000 servers partitioned into multiple mini-clusters, however, I began to wonder if a better way existed. Is there a way to implement our requirements that doesn't require Session data at all?

One potential Session-less approach to log in is HTTP authentication. When using this aspect of the HTTP standard, users provide a username and password to the browser, and the browser sends these credentials in one form or another to the server on subsequent requests. The server can authenticate the user by essentially logging them in again on each request, for example, by computing a hash of the password sent via HTTP Basic Authentication and comparing the supplied username and hash to that stored in a shared database. (In our case, since we want single-sign on across our network, we had already decided to provide usernames and password hashes to all servers via a shared database.) Unfortunately, although HTTP authentication would help with scalability by eliminating the need to share any state between servers beyond usernames and password hashes to support log in, it brings up several other concerns, which I described in an earlier weblog entry, HTTP Authentication Woes.

Despite HTTP authentication's problems, however, I wondered if I could potentially take a cue from it. HTTP authentication does not require data other than a username and password hash to be shared across servers primarily because it includes the username and password, in some form, with every request in the authentication session. In other words, once you use HTTP authentication to log into a realm, an area of a site, the credentials will be sent in some form along with every request to that realm until the browser is exited. By including some form of user log in credentials in the session ID placed in a cookie or embedded in URLs, I mused, a server could determine the identity of a user solely by inspecting the session ID that arrived with the request, and referring to user data that's already shared among the servers via our shared database. I'll call this the “embedded” approach, because the session ID is not just a hard-to-guess string, but a hard-to-guess string that includes embedded user credentials.

In the embedded approach, the server could still store identity data in a hash table using the session ID (or part of it) as a key. However, such state is now transformed into a cache that serves merely as a performance optimization. The server-side state is not necessary. If memory gets low on a server because too many users have logged in, some of those cached identities, probably the least recently used, could be evicted from the cache. If a request later arrives that includes one of the session IDs whose identity data had been evicted, the server could use the credentials embedded in the session ID itself to re-authenticate the user. Moreover, given that a server can authenticate a user solely from the session ID and shared user data, any cached identity information need not be shared between servers.

For example, if a user logs into a site in the Artima network hosted by mini-cluster A, the load balancer driving cluster A, because of session affinity, will subsequently send all requests to the server that handled the log in. Most likely, that server will contain the identity data in a cache keyed by the session ID. This means that most often authentication of a request will be efficiently performed through a simple hash table lookup. If a request arrives containing a session ID for which the server has evicted the identity data from the cache because of low memory, the server will have to once again perform “full-blown” authentication by extracting the credentials from the session ID and comparing them against the shared user data. If the full-blown authentication succeeds, identity data can once again be placed in the cache, possibly squeezing out of the cache some other now least recently used identity information.

If the server crashes, the load balancer will direct subsequent requests with that session ID to a different server in mini-cluster A. When the first such request arrives at this “fail-over” server, the session ID will not exist in its cache. The fail-over server will therefore need to perform full-blown authentication. On subsequent requests, however, the fail-over server will likely have the identity data in its cache, yielding efficient authentication via hash table lookup. If the user at some point clicks on a link to a site hosted by mini-cluster B, the receiving server will on the first request need to perform full-blown authentication using the credentials embedded in the session ID, and thereafter will be likely be able to use hash-table-lookup authentication.

Security implications

Probably the main weakness of using session IDs stored in cookies or rewritten URLs for authentication is that session IDs can be rather easily hijacked. For example, if a session ID is transmitted in plain text over HTTP (not HTTPS), it can be intercepted by monitoring the network traffic between you and the server. A session ID may also be hijacked via a cross-site scripting attack, or by the person who uses a public computer after you walk away from it without logging out. Or, if your session ID shows up in a rewritten URL that you email to a friend, and they click on the URL while your session is still active, they get logged in as you. (This is one way in which URL rewriting is less secure than cookies.) To reduce the potential for session hijacking, it is useful if the session ID times out after a period of time. If someone does obtain a session ID from one of your active sessions, they can only act as you while that session is active. After you log out, or after your session times out due to inactivity, the session ID is worthless.

Another potential problem with session IDs is that they may contain encoded information, such as server IP addresses or session ID counts, that should not be revealed to the public. Also, the session ID generation algorithm, if poorly designed, could enable an attacker to guess neighboring session IDs from a few legitimately obtained session IDs. Lastly, session IDs must be assigned such that they are unique across the entire cluster. If two servers assigned the same session ID for you and your archenemy, and a request from your archenemy ended up at your server, he would assume your identity. To avoid these problems, session IDs should be assigned in a way that is difficult to guess and unique across the cluster, with any encoded information either already public knowledge or very difficult to extract.

Implementing the embedded credentials approach

With these security issues in mind, my thought was that we could replicate a pair of "session passwords" across all servers, replacing each password every other 15 minutes. For example, if session password 0 is replaced at 2:00, session password 1 would be replaced at 2:15, session password 0 would again be replaced at 2:30, session password 1 again at 2:45, and so on. Each session password, therefore, would be in force for 30 minutes. The session ID could then include some hard-to-guess user credential encrypted with the latest session password. We would, therefore, also have to replicate a 0 or 1 to indicate which session password to use to encrypt user credentials, and this 0 or 1 could be included in the session ID as well. For the credential, we could use all or part of the user's password hash. (We don't store user passwords in the database, just a hash of the password, so we couldn't use the user's password as this credential.) But to avoid letting any string derived from the user's password back out the door, and to simplify matters when users change their password, I figure we'll generate a unique hard-to-guess string for each user and store it in the shared database when they create their account. This string could then be used as the credential to encrypt with the session password.

To ensure each session ID is unique without requiring servers to communicate with each other (except through the already shared user data), I figured we could include some non-encrypted string that is unique for each user in the session ID. If we decide that we don't mind letting the public know each user's ID (the primary key in the database), we could derive a string from that. Or, alternatively, we could generate yet another unique string when a user creates their account, save that in the shared database, and use it as the user identifier in the session ID.

For example, one form such a session ID could take is:

<session password index (0 or 1)> <user identifier> <separator character> <encrypted credential>

Each server could maintain a cache of user identities (and possibly other things such as roles and permissions) and the user's credential encrypted by one or both of latest session passwords, keyed by the user identifier. When a request arrives at a server that includes a session ID, the server can extract the user identifier from the session ID, and use it to look up this user's data in the cache. If found, the server would then compare the encrypted credential from the session ID with the corresponding one stored in the cache. If it matches, then the user is authenticated. If the corresponding credential does not appear in the cache, the server would encrypt the user credential with the session password identified by the index (0 or 1) in the session ID and place it in the cache. If the resulting string matches the encrypted credential in the session ID, the user is authenticated. If not, the user will have to log in again.

Several scenarios exist in which a session ID will arrive at a server for which the user's identity data is not stored in the cache. The data may have been in the cache earlier, but evicted because of low memory. Or, the data may have been in a different server that crashed, and the load balancer chose this server for fail-over. Or the user may have clicked on a link to a site hosted by this server's mini-cluster on a page obtained from a server in a different mini-cluster. For whatever reason, if a session ID arrives at a server and it finds no identity data in its cache, it will attempt to perform full-blown authentication. It will figure out which user the request is supposedly from (by looking at the user identifier in the session ID), and which password that user's credential was supposedly encrypted with (from the session password index in the session ID). It will encrypt that user's credential with the specified password and compare it with the encrypted credential in the session ID. If it matches, the user is authenticated. If not, the user will need to log in again.

Design tradeoffs

The advantage of the embedded approach over the traditional Session approach is that it minimizes the amount of data that needs to be shared among the servers. Instead of sharing all the sessions of all the logged in users, it requires sharing only three pieces of data: two session passwords and the index (0 or 1) of the latest password. Not only is this less data, it is less dynamic. This data changes—and those changes must therefore be replicated—only once every 15 minutes. Other than the session passwords, the only other data that needs to be shared between servers is data already available via our shared user database.

One downside from a security perspective is that you can't invalidate a session ID by logging off. When you log off in this scheme, you essentially log off your browser by removing the session ID cookie or removing the session ID from the URLs. If someone has hijacked your session ID in the meantime, they would still be able to use it if they make a request before the session password used to encrypt your credential was replaced (up to 30 minutes). One such scenario is if you were logged in via URL rewriting on a public computer, and didn't close the browser or otherwise purge the history after logging off. For a few minutes thereafter, the person using the computer after you could log back in as you by clicking the back button and returning to one of those rewritten URLs. (This scenario highlights another way that URL rewriting is less secure than cookies for holding session IDs.) By contrast, if you log out of a server using the traditional Session approach, you not only log out of the browser by removing the session ID from the cookies or URL, you also log out of the server by removing the Session object.

On the other hand, one security advantage of the embedded approach over the traditional Session approach is that it makes less harmful one kind of potential denial of service attack: a script that repeatedly logs a user in. Such an attack attempts to fill server memory with Sessions. In the embedded case, the memory impact of repeatedly logging in the same user would be that the user's identity data gets cached once on each server that receives the log in requests.

One important difference between the two approaches is that an embedded credential is only a token that can be used to authenticate a user. A traditional session ID, by contrast, is a token that can be used to authenticate a user and a handle for any other arbitrary state, such as conversational state, that may be stored on the server in the Session object. I will discuss conversational state in a later weblog post, but in our case conversational state need only be shared within a mini-cluster. Because of our single-sign on requirement, however, we need to recognize session IDs across all servers in all mini-clusters. So in our case, it seems natural to handle authentication and conversational state differently anyway.

In the embedded approach, at any one time two hard-to-guess strings exist that will authenticate a user when presented as a session ID in a cookie or rewritten URL. You can find out what the latest of those two strings is simply by logging in over HTTPS with your username and password. If cookies are enabled, the session ID will be stored as a cookie. If not, it will show up in rewritten URLs. On subsequent requests, if the session password has changed since the previous request, the response of the new request will contain a different session ID. So as you use the network, every 15 minutes or so, you'll get a new session ID. Most of the time the session ID will be transmitted in the clear over HTTP, so it has the same hijacking potential as traditional session IDs.

I recognize that the session passwords need to be hard to predict, and I'm not sure yet how to create them. We could use an intelligently seeded pseudo random number generator, or perhaps base them on some snapshot of server traffic, or a combination of the two. The user credential should also be difficult to predict. And of course an important part of the security is that the user credentials and session passwords, and the algorithm used to encrypt one with the other, be kept secret.

Although I don't like the loss of ability to log out of a server, I feel it is an acceptable design tradeoff since in exchange we have less state to share between our servers, and hence an easier path to scalability. In general, session IDs sent over HTTP are a rather insecure means of authentication, but they are good enough for a lot of purposes. My feeling is that the embedded approach is good enough to apply to the same sorts of authentication problems as the traditional Session approach, but with less drag on scalability. For certain situations, however, I feel that neither approach is sufficient and that other security and authentication mechanisms should be layered on top of the session ID. I'll talk more about that in a later weblog post.

What's your opinion? Have you heard of anyone taking an embedded approach for authentication in practice? Have you any experiences to relate about security or scaling in either this or the traditional Session approach?

Charles Miller

Posts: 1014
Nickname: carlfish
Registered: Feb, 2003