Web Cache for Grid HTTP Access
Total Page:16
File Type:pdf, Size:1020Kb
Web cache for grid HTTP access Duarte Meneses Fabrizio Furano [email protected] [email protected] April 1, 2014 1 1 Introduction Sites have usually several jobs and people accessing files in the grid. Even though files are often on site, these files may come from several storage elements, which can be located at any other site. Due to recent developments, many storage elements have currently an HTTP interface, implemented for example, with XRootd or with a WebDAV server. Thus, the files can be accessed with this widely known protocol using any HTTP client, instead of forcing the usage of grid-specific protocols. The use of HTTP could also enable easy caching of data using existing web caching products. In this document we explore further this possibility. The HTTP resources (files) can be re-fetched several times within the same site, ei- ther by the same user or not. This can be especially frequent with batch jobs, that pull periodically the same files from the network. To optimize the access to the data and reduce the load in the network, we are in- vestigating potential advantages and implications of deploying HTTP caches at the sites. Accessing a cached file within the site could bring a significant performance boost, pactic- ularly for sites that constantly consume external resources, i.e. resources located in other sites. Using web caches is a solution adopted often by websites for optimizing the access to it by caching its contents (possibility at several locations, close to the users). These web caches are deployed in reverse mode, and the DNS is usually pointed to it. Figure 1: Reverse proxy Many requirements exist when accessing the grid using HTTP that are quite particular because it's use case is distinctive from the typical use case, which will be explained next. Several web caches product are evaluated to understand how they could fit in our specific case. In this document, \Web server\, \backend\ and \origin server\ refer to the same thing. Relevant work has already been done in this area and a fully working proxy cache has already been developed for grid data access [1]. Sill, there are a few reasons to revisit the problem: • HTTP interfaces: Several storage elements support now HTTP access. This sim- 2 plifies the system as web caches may access directly the storage elements. • Security restraints could perhaps be relaxed. Read access could be given to web caches, which would possess themselves a certificate, instead of proxying the users' permissions. • Cache performance: some cache features might significantly improve performance; for example, streaming contents directly to the client while it is being cached will vastly improve latency when a cache miss occurs. 3 2 Web access Files in the grid are identified by a unique logical name, which is registered in the grid's file catalogue, and may exist physically at several locations, in storage elements (SEs). The physical copies of a file are called replicas of the file. The size of files in the grid can range from a few bytes to several gigabytes. In the grid, users can request files using HTTP by pointing a HTTP client to a webserver and providing the file’s logical name. The users authenticate to the HTTP servers by connecting with a SSL layer and providing a user certificate that is signed by a certificate authority trusted by the grid. When the server authenticates the client, it generates a token for that particular request. The token is only valid for that client and for a limited amount of time. At the same time, the server tries to find a replica of the file, and if successful, it redirects the client to the replica. The server finds replicas by checking its local file store or by consulting a file catalogue (local or remote), which can return the location of the replicas. The redirect is done through the HTTP response \302 Found\[2] , which includes a Location header with the new URL that the client should follow. This redirect includes the token generated during authentication, as an argument in the URL. Figure 2: Web Access Due to the existence of very large files, one important HTTP feature that is commonly used by the clients is the Range header. With this field, it is possible to get chunks (pieces) of a file without having to download the entire file, by specifying one or more byte ranges. Some servers might only be able to find the file if it is in its local storage element, while others have a global knowledge about the grid and may find files located at other storage elements. These servers might consult the grid file catalogue, be the front server of other servers or belong to a federation, having knowledge about the federated resources. 4 Therefore, the replica that it finds might or might not be located in the same server, so the client might be redirected to another server. Also, redirections might happen several times before the client reaches the actual replica. After following the redirects, the client will eventual reach the final location, and get its contents within a normal HTTP \200 Ok\ response. Figure 2 shows the exchange between the HTTP client and server, in a simple case where the replica of the file is located at the same server where the first request was done. 2.1 Backend Depending on the resource requested, different servers might need to be contacted. By requesting resoruces to a LFC (with HTTP support) or a federator (a server with global knowledge about a federation), the job of finding resources is passed to that server. That server will try to locate the resource and issue an appropriate redirect. With this strategy, the web cache can be configured with a single backend server, as long as it supports redirects, possibliy to other servers. Federations are preferable over LFC/HTTP because they are much faster. Also, LFC keeps an index of SRM interfaces. The translation of a SRM address to a HTTP URL is sometimes not direct and LFC oversimplifies it, resulting sometimes in invalid HTTP addresses. Most web caches are flexible and allow to choose, amongst several pre-configured back- ends, which one to contact for a given situation. This allows the system to balance the load to the backends and to not be affected in case one of the backend goes offline. For the reasons previously stated and for simplicity though, we will assume (without lose of generality) that it is enough to have a single backend server, to where all requests are done when a cache miss occurs. 2.2 Cache keys and redirects Even though it is the final HTTP request, with the URL of the replica of the file, that returns the actual contents of the file, the files contents must be cached using its logical names as cache key, which is unique in the grid and common for all its replicas. Caching the file using the replicas path is a bad idea, because a subsequent request of the same file (using the same logical name), can end in a redirect to another replica of it. The cache would have the contents of the first request in cache but won't find it with a different replica name, resulting in a cache miss and in a refetch of the other replica of the same file. Moreover, for a similar reason, the cache key must also not include the token generated for each request. The token is supposed to be unique for each request and including it in the cache key would prevent any cache hit to happen. Several web caches allow specifying the cache key for each request, but that is not enough. Either the cache tunnels the redirects to the client, and is able to connect infor- 5 Figure 3: Resolution of redirect by the web proxy mation between it's requests so that it caches the replica request using the URL of the first request, or it needs to resolve the redirect itself. To relate between two requests, HTTP headers can be used. The web cache can set a HTTP cookie or add data to the URL, for example, indicating what is the original URL. This strategy would require that all the HTTP clients used in the grid support not only redirects, but also cookies. Since the clients are supposed to be close (in terms of latency) to the web cache, there is no big difference between both solutions in terms of performance. The choosen option should be the one that has a easier integration with the existent products. 2.3 Security WLCG uses the GSI (Grid Security Infrastructure), which is based on grid proxies extended by VOMS attributes for authentication. This allows differentiated read and write access permissions of each file for every user. Ideally, the web proxy would use the same system: the user creates a grid proxy and delegates its privileges to the web cache, and the web cache acts on behalf of the user using its proxied credentials when fetching data from resources. This system has, in practice, important drawbacks: • Typically, web proxies or caches don't support GSI. This further complicates the usage of web cache products, by requiring substantial extensions or modifications to them. • Cached data would only be valid for the user who fetched it. When other users request it, the web proxy would need to contact the backend server and check the credentials 6 of each user. Then the data in cache would be returned only if the credentials of the user are valid for that file.