for grid HTTP access

Duarte Meneses Fabrizio Furano [email protected] [email protected]

April 1, 2014

1 1 Introduction

Sites have usually several jobs and people accessing files in the grid. Even though files are often on site, these files may come from several storage elements, which can be located at any other site. Due to recent developments, many storage elements have currently an HTTP interface, implemented for example, with XRootd or with a WebDAV . Thus, the files can be accessed with this widely known protocol using any HTTP client, instead of forcing the usage of grid-specific protocols. The use of HTTP could also enable easy caching of data using existing web caching products. In this document we explore further this possibility. The HTTP resources (files) can be re-fetched several times within the same site, ei- ther by the same user or not. This can be especially frequent with batch jobs, that pull periodically the same files from the network. To optimize the access to the data and reduce the load in the network, we are in- vestigating potential advantages and implications of deploying HTTP caches at the sites. Accessing a cached file within the site could bring a significant performance boost, pactic- ularly for sites that constantly consume external resources, i.e. resources located in other sites. Using web caches is a solution adopted often by for optimizing the access to it by caching its contents (possibility at several locations, close to the users). These web caches are deployed in reverse mode, and the DNS is usually pointed to it.

Figure 1:

Many requirements exist when accessing the grid using HTTP that are quite particular because it’s use case is distinctive from the typical use case, which will be explained next. Several web caches product are evaluated to understand how they could fit in our specific case. In this document, ““, “backend“ and “origin server“ refer to the same thing. Relevant work has already been done in this area and a fully working proxy cache has already been developed for grid data access [1]. Sill, there are a few reasons to revisit the problem:

• HTTP interfaces: Several storage elements support now HTTP access. This sim-

2 plifies the system as web caches may access directly the storage elements.

• Security restraints could perhaps be relaxed. Read access could be given to web caches, which would possess themselves a certificate, instead of proxying the users’ permissions.

• Cache performance: some cache features might significantly improve performance; for example, streaming contents directly to the client while it is being cached will vastly improve latency when a cache miss occurs.

3 2 Web access

Files in the grid are identified by a unique logical name, which is registered in the grid’s file catalogue, and may exist physically at several locations, in storage elements (SEs). The physical copies of a file are called replicas of the file. The size of files in the grid can range from a few bytes to several gigabytes. In the grid, users can request files using HTTP by pointing a HTTP client to a webserver and providing the file’s logical name. The users authenticate to the HTTP servers by connecting with a SSL layer and providing a user certificate that is signed by a certificate authority trusted by the grid. When the server authenticates the client, it generates a token for that particular request. The token is only valid for that client and for a limited amount of time. At the same time, the server tries to find a replica of the file, and if successful, it redirects the client to the replica. The server finds replicas by checking its local file store or by consulting a file catalogue (local or remote), which can return the location of the replicas. The redirect is done through the HTTP response “302 Found“[2] , which includes a Location header with the new URL that the client should follow. This redirect includes the token generated during authentication, as an argument in the URL.

Figure 2: Web Access

Due to the existence of very large files, one important HTTP feature that is commonly used by the clients is the Range header. With this field, it is possible to get chunks (pieces) of a file without having to download the entire file, by specifying one or more byte ranges. Some servers might only be able to find the file if it is in its local storage element, while others have a global knowledge about the grid and may find files located at other storage elements. These servers might consult the grid file catalogue, be the front server of other servers or belong to a federation, having knowledge about the federated resources.

4 Therefore, the replica that it finds might or might not be located in the same server, so the client might be redirected to another server. Also, redirections might happen several times before the client reaches the actual replica. After following the redirects, the client will eventual reach the final location, and get its contents within a normal HTTP “200 Ok“ response. Figure 2 shows the exchange between the HTTP client and server, in a simple case where the replica of the file is located at the same server where the first request was done.

2.1 Backend Depending on the resource requested, different servers might need to be contacted. By requesting resoruces to a LFC (with HTTP support) or a federator (a server with global knowledge about a federation), the job of finding resources is passed to that server. That server will try to locate the resource and issue an appropriate redirect. With this strategy, the web cache can be configured with a single backend server, as long as it supports redirects, possibliy to other servers. Federations are preferable over LFC/HTTP because they are much faster. Also, LFC keeps an index of SRM interfaces. The translation of a SRM address to a HTTP URL is sometimes not direct and LFC oversimplifies it, resulting sometimes in invalid HTTP addresses. Most web caches are flexible and allow to choose, amongst several pre-configured back- ends, which one to contact for a given situation. This allows the system to balance the load to the backends and to not be affected in case one of the backend goes offline. For the reasons previously stated and for simplicity though, we will assume (without lose of generality) that it is enough to have a single backend server, to where all requests are done when a cache miss occurs.

2.2 Cache keys and redirects Even though it is the final HTTP request, with the URL of the replica of the file, that returns the actual contents of the file, the files contents must be cached using its logical names as cache key, which is unique in the grid and common for all its replicas. Caching the file using the replicas path is a bad idea, because a subsequent request of the same file (using the same logical name), can end in a redirect to another replica of it. The cache would have the contents of the first request in cache but won’t find it with a different replica name, resulting in a cache miss and in a refetch of the other replica of the same file. Moreover, for a similar reason, the cache key must also not include the token generated for each request. The token is supposed to be unique for each request and including it in the cache key would prevent any cache hit to happen. Several web caches allow specifying the cache key for each request, but that is not enough. Either the cache tunnels the redirects to the client, and is able to connect infor-

5 Figure 3: Resolution of redirect by the web proxy mation between it’s requests so that it caches the replica request using the URL of the first request, or it needs to resolve the redirect itself. To relate between two requests, HTTP headers can be used. The web cache can set a HTTP cookie or add data to the URL, for example, indicating what is the original URL. This strategy would require that all the HTTP clients used in the grid support not only redirects, but also cookies. Since the clients are supposed to be close (in terms of latency) to the web cache, there is no big difference between both solutions in terms of performance. The choosen option should be the one that has a easier integration with the existent products.

2.3 Security WLCG uses the GSI (Grid Security Infrastructure), which is based on grid proxies extended by VOMS attributes for authentication. This allows differentiated read and write access permissions of each file for every user. Ideally, the web proxy would use the same system: the user creates a grid proxy and delegates its privileges to the web cache, and the web cache acts on behalf of the user using its proxied credentials when fetching data from resources. This system has, in practice, important drawbacks:

• Typically, web proxies or caches don’t support GSI. This further complicates the usage of web cache products, by requiring substantial extensions or modifications to them.

• Cached data would only be valid for the user who fetched it. When other users request it, the web proxy would need to contact the backend server and check the credentials

6 of each user. Then the data in cache would be returned only if the credentials of the user are valid for that file.

The last point invalidates the advantage of low latency when cache hits occur, in the cases where different users access cached files within the site. Considering these difficulties, an alternative approach might be preferable. Most files in the grid belonging to a certain VO are readable for all users belonging to that VOMS. So one option is for the web cache itself be registered in the VOMS - with minimal permissions - and use its own credential to fetch data. Only users belonging to the VO will be able to access the web cache. This can be enforced by the web cache itself, or within the site. Sensitive data, to which only certain users have access to, won’t be accessible by the web proxy with its minimal permissions. Users wanting to access it would have to bypass the proxy and use their own credentials.

2.4 Range requests Ranges are frequently used by clients accessing files in the grid. Instead of requesting the entire file, clients request parts of it by specifying the range of bytes that they want, allowing fast random access to any specific piece of the file. The HTTP “Ranges“ field can be used to specify several intervals of bytes, which can even intersect each other. This is an example of a ROOT program request:

[truncated] Range: bytes=9935544-9943311, 9962737-9965622, 9974976-9982123, 10058678-10087098, 10112599-10122414, 10159358-11263472, 11476161-11482985, 11494800-11496245, 11643684-19920592, 19928998-19930317, 19948371-20036026, 20045700-20069828,

The usage of ranges, in general, is problematic for most web caches. There are several strategies that can be followed: 1. simply tunnel these requests, and tunnel the response back to the client without accessing the cache; 2. if it is in cache, get the chunk, and return it in a 206 HTTP response. If its not, first pull the entire file from the backend server (by removing the “Range“ http header) and cache it; 3. if it is in cache, get the chunk, and return it in a 206 http response. If its not, pull the entire file from the backend server while streaming it to the client once it reaches the relevant part; 4. use the Ranges header as part of the web cache key. If this specific range of the file is not in cache, request it to the backend, and cache the 206 HTTP response. Every different combination of Ranges and resource will be cached at a different location; 5. cache chunks of a resource using a bittorrent-like file, so that different pieces can be combined, and pull from the backend just the pieces that are missing in cache and that belong to the range requested by the client; 6. a combination of 1 and 3, depending on how far into the file is the range requested by the client;

7 1st Option turns the web cache mostly useless if most of the requests are requests of chunks. The 2nd option is problematic if the files are large: the client that requests a chunk (even if very small) of a file that is not cached would have to wait for the entire file to be downloaded by the web cache. It could even timeout in some situations. Also, it could make the proxy use a lot of space. The 3rd option solves the problem associated with the 2nd option when the range request if for chunks near the beginning. The client just needs to wait until the proxy downloads up to the point where the requested chunk starts. Then, it will start to receive it as the proxy downloads from the backend. The proxy continues to download until the end, and therefore caches the entire file. 4rd Option could allow a lot of duplicate data to be stored in web cache, quickly filling it even with a few files, especially if the requested ranges are very varied. The 5th option is the one that would have best performance, but is by far the most complicated one and it doesnt fit in the way that web caches store the objects. No web cache seems to support it.

2.5 Streaming One of the disadvantages of many web caches is that in the event of a cache miss, the file will first have to be downloaded from the backend to the web cache (where it is stored in cache if it’s cacheable). In these web caches, only after they download the entire file, they start to stream it to the end client. This behavior can result in a serious performance hit if the file is large. In some cases, the client might even timeout while waiting for the first bytes of the requested resource. This was one of the factors in [1] that contributed negatively, in a significant way, to the performance tests of the web cache. In the grid, files are often several gigabytes, so it would be very advantageous that in the case of a cache miss, the web cache is able to stream the file to the client while it is downloaded from the backend server. An alternate solution to prevent the client from timing out is to send a HTTP “503 Service unavailable“ response to the client with the field “Retry-after“ set. The first option is much prefer- able though, because not all HTTP clients support this functionality and because it only protects against time-out, it does not improve the latency on cache misses.

Figure 4: Direct streaming of a file after a cache miss

8 2.6 Scalability Several web caches can be combined at a site. This is a necessary feature to be able to scale a web cache, knowing that the resources of individual machines are limited. If we have many clients, a single web cache might not be able to serve all requests, at least not at high speed. In these cases, the storage access speed is usually the bottleneck. If the set of files being accessed is much larger than the cache size, there will also be a lot of cache misses and the deployment of a web cache in these conditions has very limited benefits. As in our use case there are several backends, if the clients request resources that are spread across several backends, it can easily be the case that it would be faster for them to access directly the backend servers. When deploying several cache nodes, we can combine their capabilities, such as network band- width, storage access speed, storage size and request processing power. Multiple web caches can be combined in two ways: cooperatively or indenpendently. In the first case, they act as a single cache in the sense that files are not stored more than once accross all nodes. In the second case, the nodes act individually, and replication of storage may happen. Both strategies can also be combined. For the web caches to cooperate, they need to communicate to each other. Most web caches offer this functionality and use a inter-cache communication protocol that enables the cooperation, such as the Internet Cache Protocol (ICP)[4]. Basically, when a client requests a file to a web cache and that particular node doesn’t have the file (a cache miss occurs), it will contact each of its peers to see if any of them has the file. If one of them answers positively and has the file, it either tunnels the file to the client or it redirects the client to the cache node that contains the file. Otherwise, the cache node to which the client did initially the request will pull the file from the backend. With this strategy, files are never stored twice in the entire cluster of web caches and thus the storage space of all web caches is combined in an optimal way. The main disadvantage is that if many clients request the same file, a single server (which has that particular file) will have to handle all the requests. This could overloaded the server while all other servers remain idle. The other way of combining web caches doesn’t require any special feature to be supported by the web caches. A load balancer simply assigns each new request to one of multiple web caches. This strategy doesn’t improve hit rates, but it manages the load. With a round-robin algorithm, the load will be evenly spread across all the cache nodes. This allows much better handling in situations where there are spikes with a lot of requests. Even if the same file is requested by many clients, it will end up being cached and served from many cache nodes, increasing the download speed to each client. The disadvantage is that the file would need to be downloaded several times from the backend. It’s storage space can’t be combined thus the caches’ hit rate is not improved. Each web cache acts as if it is alone and there is only a cache hit if the requested file is in its own cache. In any case, we have usually a load balancer that assigns new clients to one of several nodes. Thus, several nodes share the load of handling the initial requests. The assigment can follow different algoritms: can be random, round robin, depending on the load, etc. Load balancing also has the advantage that if a web cache goes offline, the load balancer can detect this and avoid assigning clients to it.

9 3 Analysis of federated xrootd transfers

All data transfers done from federated xrootd servers end up being registered in a database. The servers use a plugin so that when a open operation finishes and the data transfered is complete, it inserts information about the transfer into an SQL Oracle database. This information can be later accessed and processed. With the complete registry of all data transfers available and by processing it, it is possible to achieve several conclusions about what is the use pattern of applications or users accessing files in the grid through HTTP, and even to simulate what would be the performance of a web cache at a specific site. It is very important to point out that most of the jobs at a site access data mostly from the local storage element using other servers, and these accesses don’t get registered in this database. Sites schedule the transfer of data to their SE in a coordinated way, so that the data is available before the job starts. The xrootd federation is accessed when, for any reason, a open to a file in the local storage fails. Therefore, the conclusions withdrawn by this analysis are limited. They can naturally be used to assess the usage of a web cache serving content from federated xrootd servers only. The data volume and number of opens to federated xrootd servers is not insignificant and there would be advantages in using a cache at several sites. They could also be relevant in evaluation the possible advantages of having sites using only a web cache to access the grid, replacing any existing local storage. In that case, we would need to consider that the accesses to the xrootd server and to the local SE have a similar pattern, so that we can extrapolate the conclusions to all accesses at a site. These are two different possible use cases for deploying a web cache at a site. There are also different possible advantages to benefit from, when using web caches. One is to reduce the amount of data transfered between sites. The fact that clients accessed cached content located within the same site means that potentially, there will be less data transfered from other sites. The second advantage is reducing the delay and increasing the bandwidth when accessing the data. Local connections have usually lower latency and higher bandwidth than WAN connections. In the case where the local storage is replaced by a cache, there could also be an advantage of simplifying maintenance or reducing costs. The database has a dataset for all federated CMS accesses and another dataset for the federated ATLAS accesses. A program was developed, not only analyze the data but also to do a full simulation of what would be the performance of web caches, with varying storage sizes. A simulation is done for each site, evaluating the performance of a web cache serving all the accesses coming from that site. It was done using Java and JDBC, and profiled and optimized to be able to analyze all the records from the last 90 days. Here, we consider that each domain is a site, and our data covers around 100 different domains. A total of around 33 million records were analyzed for CMS. A summary of the analysis done to CMS records is shown in the next table.

10 max avg ov max ov avg rate client total unique ratio working transferred rate rate rate max ratio hit 1 hit 10 (MB/s) domain accesses lfn repeated set (GB) (GB) (MB/s) (MB/s) (MB/s) concurrent size TB TB hep.wisc.edu 7584132 2419225 0.68 5223618.35 2065801.58 0.25 1787.64 308.75 113015.24 20939 0.13 0.12 0.18 rcac.purdue.edu 3727433 733381 0.80 567169.38 3534124.41 0.28 3067.67 523.31 25007.31 8824 0.26 0.20 0.37 fnal.gov 6268630 496135 0.92 1346183.80 3800905.69 0.33 1396.31 576.40 16285.56 21223 0.33 0.04 0.43 brunel.ac.uk 812845 194763 0.76 106349.04 305691.62 0.18 1433.86 49.80 5887.24 1773 0.32 0.55 0.67 in2p3.fr 696718 190248 0.73 167051.15 250701.11 0.11 292.10 40.01 1158.25 2137 0.23 0.31 0.59 unl.edu 911978 177557 0.81 565434.99 478301.36 0.21 518.35 72.71 885.93 3377 0.32 0.59 0.65 t2.ucsd.edu 250221 100477 0.60 323747.30 140436.35 0.15 200.22 23.90 399.13 1647 0.19 0.15 0.43 oeaw.ac.at 3250241 88995 0.97 23159.76 36479.28 0.23 590.02 9.86 668.26 3536 0.07 0.15 0.97 knu.ac.kr 1277256 79900 0.94 27202.39 201172.75 0.29 246.12 56.18 1437.02 5013 0.52 0.31 0.77 ultralight.org 128466 74193 0.42 238761.76 139457.99 0.16 351.25 21.07 3283.32 1512 0.39 0.21 0.31 chtc.wisc.edu 113735 73216 0.36 54812.53 61524.60 0.37 1388.24 53.01 2385.21 1157 0.71 0.17 0.24 kfki.hu 489395 71348 0.85 15031.87 102399.90 0.40 390.92 21.88 733.57 722 0.31 0.79 0.85 cern.ch 1138876 70916 0.94 35744.98 90251.98 0.05 121.72 16.90 1386.54 5043 0.12 0.70 0.93 kipt.kharkov.ua 271863 63584 0.77 36758.13 84212.13 0.35 452.69 25.92 758.29 698 0.33 0.69 0.74 lal.in2p3.fr 218114 62362 0.71 58357.43 85027.00 0.15 287.15 15.69 581.97 976 0.27 0.44 0.62 ts.infn.it 105104 62079 0.41 16502.86 10602.07 0.16 131.92 7.09 414.24 526 0.33 0.30 0.41 desy.de 106295 33990 0.68 85309.29 45716.00 0.18 610.19 14.39 14150.42 832 0.17 0.24 0.54 accre.vanderbilt.edu 117455 33404 0.72 22538.79 25399.11 0.05 269.53 6.38 1194.88 2013 0.34 0.49 0.69 sscc.uos.ac.kr 326940 32610 0.90 13608.78 26012.52 0.07 140.90 11.72 214.19 1819 0.16 0.45 0.90 cmsaf.mit.edu 377240 27384 0.93 34461.09 19893.96 0.16 52.73 4.66 578.41 3746 0.21 0.70 0.92 csc.fi 141906 23226 0.84 6619.52 20513.59 0.36 399.75 5.95 589.35 213 0.23 0.72 0.84 psi.ch 39406 23020 0.42 8727.92 3056.58 1.56 78.66 13.66 294.54 64 0.21 0.04 0.42 icecube.wisc.edu 55689 14153 0.75 7735.31 3282.52 0.16 176.02 4.54 383.29 246 0.38 0.74 0.75 mit.edu 18450 12760 0.31 16005.87 4159.91 1.32 11.72 69.01 374.31 334 0.22 0.30 0.31 11 cr.cnaf.infn.it 130563 11960 0.91 33295.48 59827.50 0.12 383.35 28.12 5836.59 2976 0.27 0.66 0.91 cs.wisc.edu 18037 10525 0.42 7825.73 6460.86 0.37 875.97 6.96 934.72 145 0.73 0.41 0.42 grid.hep.ph.ic.ac.uk 40935 9608 0.77 27664.66 18001.56 0.11 31.73 5.86 348.96 2323 0.37 0.76 0.76 nat.nd.edu 1577618 9342 0.99 32656.63 199965.90 0.18 137.78 154.43 9165.66 4579 0.03 0.13 0.76 crc.nd.edu 136232 7730 0.94 27014.13 32133.31 0.17 74.88 16.24 1028.76 1296 0.07 0.69 0.92 phy.ncu.edu.tw 14351 6861 0.52 10053.01 6795.52 0.05 14.38 4.11 57.90 621 0.47 0.52 0.52 pi.infn.it 30039 5741 0.81 16863.57 5503.48 0.02 166.32 1.41 167.24 1058 0.10 0.78 0.81 hip.fi 11959 5403 0.55 1251.02 1714.24 0.16 240.44 0.96 354.18 72 0.39 0.55 0.55 physics.ucsd.edu 4914 4543 0.08 14743.43 843.14 0.09 48.60 2.18 52.13 319 0.05 0.06 0.08 nlab.tb.hiit.fi 19357 4384 0.77 5249.55 4700.05 0.09 22.13 5.19 36.79 198 0.16 0.68 0.77 ba.infn.it 23297 4365 0.81 8423.48 13981.39 0.42 267.60 3.25 689.57 149 0.44 0.78 0.81 gridpp.rl.ac.uk 22251 3727 0.83 10706.96 30032.36 0.45 47.98 21.67 246.31 790 0.51 0.13 0.83 unknown 69931 3561 0.95 7047.43 5861.71 0.02 62.44 1.58 97.63 934 0.05 0.92 0.95 hep.kbfi.ee 13026 2976 0.77 10429.47 7455.74 0.17 4.58 2.18 61.14 299 0.30 0.74 0.77 gridka.de 31274 2465 0.92 5616.69 17567.95 0.13 181.26 13.95 585.28 1397 0.39 0.91 0.92 ihepa.ufl.edu 6584 2331 0.65 6313.18 81.25 0.42 105.46 0.44 124.36 6 0.01 0.64 0.65 xlate.ufl.edu 47963 2146 0.96 6353.17 29423.56 0.12 362.33 19.44 688.45 2807 0.64 0.95 0.96 grid.sinica.edu.tw 84045 1930 0.98 804.15 37769.30 0.36 411.66 8.53 1272.25 755 0.64 0.98 0.98 jinr.ru 22083 1794 0.92 357.77 3371.75 0.02 234.72 1.75 286.35 591 0.24 0.92 0.92 sprace.org.br 9528 1745 0.82 6205.96 4742.47 0.09 9.84 2.81 35.25 323 0.32 0.71 0.82 ifca.es 24488 1717 0.93 1624.17 1760.27 0.02 82.91 0.61 86.10 520 0.12 0.93 0.93 datagrid.cea.fr 30761 1636 0.95 1964.81 2334.23 0.01 38.89 0.90 56.52 1232 0.08 0.95 0.95 physik.rwth-aachen.de 8426 1521 0.82 4276.67 2527.39 0.16 28.96 4.86 254.06 350 0.28 0.81 0.82 acrc.bris.ac.uk 37075 1502 0.96 437.30 3176.99 0.52 248.08 2.80 248.08 66 0.12 0.96 0.96 lnl.infn.it 26517 1461 0.94 4364.07 1902.52 0.01 11.03 0.91 26.50 1144 0.05 0.94 0.94 hep.fiu.edu 6918 1434 0.79 4707.24 4584.75 0.25 9.84 6.55 59.55 183 0.43 0.70 0.79

Table 1: Summary of CMS transfers from top domains This table gives us clues about which sites could benefit from a web cache. If the total working set is relatively small, it means that all the files accessed at a site could fit in the cache’s storage and there would only be cache misses for the first fetch. This case corresponds to the best performance possibly achieved by a cache, and would have a hit rate shown by the column “ratio repeated“. This number is high for sites that have a lot of accesses to a realively low number of unique files.

Figure 5: Good cache performance at crc.nd.edu

If the total working set size is much larger than the cache size, the hit rate will depend a lot on the time-locality of the files accesses. To be more precise, it depends on the sizes of other files accessed between two consecutive accesses to the same file. If, at a site, the usage pattern are burst of multiple accesses to the same file, even with a relatively small cache storage, it will be able to fidn the file in cache, since it usually recycles its space using the Last Recently Used (LRU) algorithm. Two columns, “hit 1 TB“ and “hit 10TB“ were generated, showing what would be the actual hit rate of caches with 1 TB and 10 TB of storage space, repectivelly. At some sites a web cache with 1TB of storage space would have very high hit rates. One important characteristic of grid data access is that most open operations only read a fraction of the file - check the column “ratio size“, that shows the average percentage of the file read in each open. While most caches support these vector reads, they will fetch and cache the entire file from the backend. This makes a cache miss especially expensive, causing high traffic between

12 the cache and its backend even if a client requested only a small piece of it. The analysis shows that in most situations, even with a good hit ratio, the traffic between the backend and the cache would be higher than the traffic between the cache and its clients. As a consequence, it is unlikely that a web cache will actually bring the benefit of reducing inter-site traffic, and the main benefit will be the reduced data access latency. The plot 5 shows a site where a cache would have a good performance. We can see that with a relatively small cache, much smaller than the total size of all uique files, we already get a high hit rate, close to the optimal performance of a inifite-sized cache. This is probably due to a favourable pattern of file access, as explained. In Figure 6 we can observe the time evolution of the operations, also at crc.nd.edu, and notice some occasional peaks. Another challenge for caches is to deal with peaks of usage. The column “max ov rate“ shows the maximum rate at which the site potentially consumes data. This is an approximation given by the sum of the average rates of all transfers occuring at any given time, and are subject to erroneous records. Based in this data, there are very high peaks of usage, with rates reaching gigabytes per second. The cache would need to scale well, probably to several nodes, in order to coop with this occasional high demand.

Figure 6: Transfers at crc.nd.edu

However, when the access pattern is different, the web cache has sometimes a bad performance, even with a very large storage. This can happen for several reasons:

13 • The site accesses a large set of files; • The ratio of repeated accesses is low, which means that the overhead of the first access is high in the overal hit rate; • Low locality in the access pattern. In other words, the files may be accessed repeatadly but very dispersed in time, making the cache purge the copy (following the last recently used algorithm);

Figure 7: Bad cache performance at nat.nd.edu

One of the cases where we can see a low locality is shown in Figure 7. nat.nd.edu accesses 32TB of files and has potential for 99% hit ratio with a cache big enough. However, it a cache with 1TB of storage space only achieves a hit rate of 13% - one of the lowest. The ratio improves greatly and steadily as the storage space increases. Files are accessed very often repeatadly in this domain, but with a lot of other accesses between repeated accesses, causing a lot of cache misses in small caches and consequently turning them ineficient.

14 4 Web cache products

Several widely used products were tested and evaluated. Based on the discussion above, the prob- lematic requirements of our specific case are the following: • Support SSL connections, both with the clients and with the backend servers (or, even better, support GSI); • Streaming of files to the client, while they are being downloaded; • Good handling of range requests; • Be able to use custom cache keys; • Be extensible, so that we can implement some of the inexistent features, and possibliy to solve the issue with redirects; • Scalable, which usually means to be able to set up a hierarchy of caches; Special attention was given to these issues when trying out several web caches. The relevant characteristics of four products tried are in table 2. More details about them can be found in the appendix A.

15 Feature Apache Traffic Server Varnish Reverse cache mode Yes Yes Yes Yes Forward proxy Yes Yes Yes No Cache Disk and Memory Disk and Memory Disk Disk or memory Backend dynamic dynamic dynamic fully dynamic Scalability ICP hierarchy ICP hierarchy No No Parallelism Threads (I/O and requests) Worker processes Worker processes Thread pool Selective caching No Yes, based in ACL, url No Yes Streaming Yes Yes No Yes Caching time custom Yes Yes Yes, not flexible Yes

16 Cache headers handling Yes Customizable Yes Customizable Change cache key Yes No Yes Yes Follow redirect No No No Yes SSL Clients & backend Clients & backend ? No Negative caching Yes Yes Yes Yes Overwrite headers No Yes, but hardcoded Yes Yes Scalability ICP hierarchy ICP hierarchy No No Range solutions Fetch chunk Fetch full file Fetch chunk Fetch full file Extindibility Plugins None Modules None Logging Good Good Good Excellent

Table 2: Web cache products 5 Solution using Varnish

None of the existing products that were evaluated offers, by itself, all the needed functionalities. Nevertheless, it may be possible to combine several products and/or to extend these products in a way that we end up with a solution that complies with the requirements. In this section we analyse some solutions that do not require extending anything. In table 2 we can see that Varnish is a good candidate but misses SSL/GSI support, is not scalable nor is it extendible. By adding an extra layer that handles SSL/GSI, we would only need to address the scalability. A solution was tested based in this idea, but Varnish showed having some bugs that make it incompatible to some uses. These details are explained next.

5.1 Varnish + Apache httpd We evaluated the combination of Apache+Varnish without any modifications to these pro- grams. This would be a solution consisted of open source products. Apache would sit in front (communicating both with the clients and the backend servers) handling authentication with GSI, and Varnish would be behind, caching and handling redirects. Apache listens on port 443, as a HTTPS server, and forces clients to provide a client certifi- cate. When a client connects, Apache checks the provided certificate against its trusted certificate authotities (CAs). If the connection is accepted, it is proxied to Varnish. Apache is also binded to localhost port 8080, to receive requests from Varnish. Varnish has Apache httpd configured as its backend (using plain HTTP). It is also configured to add the remote backend’s hostname - the federator - to the HTTP requests that it makes to httpd. Thus, when a cache miss occurs, it sends a request to httpd in the form of “GET https://federator host/file“. httpd, acting here as a forward proxy, knows to which server it should proxy the request, as it is included in the HTTP request. It is configured to proxy SSL connections, and being a https address, it establishes a connection to the remote backend over SSL, using the web cache’s host certificate and key. The response is tunneled back to Varnish. In case of redirects, the host to which the request is redirected (might be other than the federator) is already in the “Location“ HTTP header, so Varnish just creates a request using that URL, using always httpd as its backend. Apache httpd will be able to fetch any resources requested by Varnish, either HTTP or HTTPS.

5.2 scalability Varnish does not support any inter-cache communication protocols, so it is impossible to configure an hierarchy of caches in which each node checks if one of its peers has a file. The reason is that it is usually used right in front of a web server and it is quicker to handover cache misses to it then to find another node that could potentially have the file in cache. It is still possible to use the Apache frontend instance to spread the load to multiple Varnish instances and also to specify a vertical cache configuration. This is limited though, as discussed in

17 section 2.6. The performance tests suggest that scalability is important. It can be achieved without coop- eration between nodes with a node balancer. For a more effective solution, we tried to federate several web caches. A federator is used to manage the cooperation between the web caches. The chosen software to do this was the Uniform Generic Redirector (UGR). The internal workings of UGR are out of scope of this document but can be understood in detail with [3].

Figure 8: Scalable solution with Varnish

The basic idea is that in the event of a cache miss, the cache will rely on the UGR to check if any other cache node has the file, before it asks to the remote backend. Consider the situation where a request was done to a particular cache and it resulted in a cache miss. The cache will first use the UGR as its backend, and request the file. When UGR receives a request, it performs a HEAD request of that file to all the cache nodes. If any of the nodes answers with a “200 OK“ instead of “404 Not Found“, it has the file and that means the file is within the set of caches. It will then return a redirect to the cache node that has the file, and the cache will pass this redirect to the client. The client will then request the file to the cache node which has the file, and it will receive the cached copy. Otherwise, if no one has the file, UGR returns a “404 Not found“ and the cache will have to get it from the remote backend. We can see that the objective of UGR’s coordination between the caches is to minimize the

18 access to the backend servers by avoiding to fetch a file that is already in any of the caches. UGR also caches both positive and negative answers with variable timings. Thus, not all requests to UGR will result in a broadcast to all nodes. Special care has been taken to avoid loops. Varnish will not lock a request coming from the UGR together with the original request from the client - otherwise this would result in a deadlock. The UGR uses now a plugin that filters out of the possible answers the node that did the request. If a node queries the UGR, it means it doesn’t have the requested file anymore, and UGR should never give as a answer that same node, as it is naturally a outdated answer and could be hiding a valid result. The full flow chart of Varnish’s behavior can be seen in Figure 9, and an example of a configu- ration file for Varnish can be found in Appendix B.

Figure 9: Varnish flow chart

19 6 Another solution: Squid and Apache

Squid is one of the most widely used proxies, and it is already used at several sites and is therefore known by its administrators. This is a big advantage to start off with. To make it work, we would need to find a way to handle follow redirects, as discussed previously. This is a feature that Squid does not support. Since Squid is not extendible and we would need to use anyway Apache httpd because of its support of GSI, the solution would be to develop a plugin for Apache httpd to handle the redirects. This is not a trivial task due to the complexity and architecture of httpd. httpd is extendible by hooking functions that are called at various stages of the processing of HTTP requests. To handle redirects, we need to be able to restart the processing of the request (with a diferent backend), or to resolve several backends in a way entirely independent of the processing of the original request. The easiest way might be to adapt mod proxy to do it. Figure 10 shows how Squid would use httpd as its frontend.

Figure 10: Squid with Apache httpd

We would loose, when comparing with the solution using Varnish, the flexible handling of packets that can be fully programmable. Squid is very mature when it comes to scaling, offering a lot of options. It uses ICP to make several instances cooperate, so it can easilly scale.

20 We were very interesting to compare the performance of several Squid web caches cooperating with ICP comparing to our home-grown solution of Varnish and UGR. To test it, we did not develop the apache module required to have a fully working prototype. Instead, we introduced a layer between Squid and Apache that handles the redirects. This is a simple deployment of Varnish with a configuration that allows it to tunnel the requests and follow redirects that come from Apache.

21 7 Tests 7.1 Functional Both solutions were functionally tested. In particular, it was tested the streaming feature in several situations. For example, the situation where a file is being streamed to a client, and a second client requests the same file. Both caches have the correct behavior, and stream the file to all clients, even if it is already on going, without doing additional requests to the backend. A few notes from the tests (also some performance remarks): • Varnish needs to store all files that it streams. In other words, the storage space must be at least the size of the files that are being transferred by all clients at a given moment, or the requests will start getting error messages; • From v3.4, Squid needs to be built with “–disable-arch-native“ for it to work in virtual machines, even with full virtualization. • Several configuration parameters need to be tunned for the caches to be able to work properly with high number of connections and with very large files. Varnish needs a higher number of threads much higher than default; • Even if each backend can serve files much slower then a cache, it could still be quicker for the client to fetch files from several backends than all fetching files from a single cache. Distributed file delivery combines memory, disk and network bandwidth; • Scalability is important to improve cache hit rate without severly compromising performant by doing disk I/O, and also to spread the load and increase the maximum transfer rates for each client;

7.2 ROOT analysis ROOT is a framework used to analyze large amounts of data. It offers a large collection of objects and methods that can be used to fetch and process data for all domains of high energy physics computing. The goal of this test is to use a program within ROOT that has a work pattern that is typical of analysis programs done with ROOT and see how the web caches perform. These programs do vector reads: they request several chunks of the file using the “Ranges“ HTTP header. Varnish failed functional tests done with ROOT programs. It doesn’t support correctly multi range requests. Additionally, even for simple range requests, it has a bug when streaming the contents from the backend to the client. Some changes would be needed to be done in Varnish to make it work correctly with a typical grid client. The changes needed in the code are not very big and are reasonably localized, but it still requires some familiarity with Varnish’s source code and architecture.

22 7.3 Performance test We were intested in comparing the performance of both solutions when the web caches are under load, in different conditions. For that, for each of the two caching solutions, the cache was installed in 4 machines. Three situations were tested: cache miss, cache hit and neighbour hit. The caches were configured in a way that each situation is continuously reproduced. For example, for the cache miss test, the caches were configured to consider the file not cacheable. For the neighbour hit request, one of the neighbours of the cache that receives the request has the file in cache. The caches and the machines where they ran were properly configured to handle a high load and a high number of connextions. Low performance machines were used, to expose more easily the performance limitations of the software. There tests were done in the internal network at CERN. All the hosts are connected by gigabit connections, offering maximum practical throughputs of around 110-120MB/s. The caches were installed in virtual machines, with 2GB RAM, 1 VCPU and a disk with 20GB. After trying several tools, httpress was the tool chosen to test the handling of a large number of connections. It is based on weighttp. It was slighlty changed to handle timeouts in a different way. With this tool, several clients in parallel download a file several times as fast as they can. Even though the tool establishes a certain number of concurrent requests by independent clients, there is no guarantee that the server is handling that many connections in parallel. The file requested by all clients is very small - the entire response fits in one TCP packet - so that the bottleneck is not the network connection bandwidth. The results are in table 3. Please check its notes for more information on the results. Many of the tests are irrealistic on purpose, just to stress the cache nodes within a given situation. For example, it is extremelly unlikely to have hundreds of cache misses in a row, since the contents are cached and subsequent requests to the same file will return a cache hit. We can see that both caches perform well. The solution found with UGR gives good results (as seen in the neighbour hit rows), partially because UGR caches the answers and doesn’t need to query all servers, contrarly to ICP.

23 Test No. clients Varnish Squid Cache miss 10 181 rps (55.2 ms) 133 rps (74.9 ms) Cache miss 100 371 rps (269.8 ms) 164 rps (758.6 ms) Cache miss 500 424 rps (1178.3 ms) 239 rps (2089 ms) Cache miss, KA 10 182 rps (52.7 ms) 184 rps (54.3 ms) Cache miss, KA 100 327 rps (325. 4ms) 355 rps (281.2 ms) Cache miss, KA 500 186 rps (1551.3 ms) 175 rps (2842.4 ms) Neighbour hit 10 213 rps(46.8 ms) 137 rps (73 ms) Neighbour hit 100 265 rps (377.2 ms) 328 rps (304.2 ms) Neighbour hit 500 247 rps (2020.3 ms) 49 rps (10137 ms)* Neighbour hit, KA 10 237 rps (42.1ms) 199 rps (50.2 ms) Neighbour hit, KA 100 255 rps (392.1ms) 107 rps (928.4 ms) Neighbour hit, KA 500 242 rps (2062.7ms) 58 rps (8480.6ms)* Cache hit 10 3028 rps (3.3 ms) 4454 rps (2.2 ms) Cache hit 100 2964 rps (33.7 ms) 3522 rps (28.4 ms) Cache hit 500 1954 rps (255.8 ms) 1558 rps (321 ms) Cache hit, KA 10 3313 rps (3.0 ms) 9804 rps (1.0 ms) Cache hit, KA 100 4347 rps (23.1 ms) 9410 rps (11 ms) Cache hit, KA 500 3574 rps (139.9 ms) 3301 rps (156 ms)

Abbreviations: • rps requests per second; • KA keep-alive; • * Failures happen due to timeout (15s) or connection reset;

Notes: 1. Some tests were repeated when some slowdown happened, so can be considered the best of 2 runs; 2. Failures happen usually due to timeout (15s) but sometimes also by connections being reset; 3. Tests of redirect with varnish are incomplete; the redirect is not followed; 4. Some tests with keep alive suffer from slowdown due to a delayed ACK; 5. On cache hit, the machine is under 100% of CPU load; 6. Cache lock was disable on the cache miss tests for Varnish, otherwise we get 18rps (backend perfor- mance with 1 client); 7. These results are somewhat volatile (they can vary quite a lot); 8. Redirect with Varnish could also be quicker without the lock, but would load much more the UGR;

Table 3: Performance test results

24 8 Conclusion

The widespread of the usage of HTTP to access files in the grid enables the deployment of web caches at sites, to serve local clients when accessing data from outside the domain. This has the potential optimizing significantly data access. The accesses to federated xrootd servers are registered in a database and the records were analyzed. A program was done for this purpose, which also did a full simulation of what would be the performance of caches deployed at various sites. The results show that several sites would have a high hit rate, even with very reasonable cache storage sizes. Usually the requests are done to fetch only part of the file. Because of this, in most cases the caches would end up pulling more data from the backends than the clients do without them. So the main advantage of having a cache would be of enabling local low latency accesses rather than reducing inter-site traffic. In some cases, a cache could replace the local storage element, with the benefit of reducing costs by having a smaller storage capacity. It’s hard to evaluate the feasibility of this, since we don’t have a lot of data about what are the local accesses at sites. There are several unique requirements that are not typical when normally using web caches. Even though several products exist that are widely used, they cannot be used as none of them satisfies the needs for the grid’s specific use. These turns the direct usage of existing mainstream web cache products impossible without some adaptations. Some solutions were proposed and partially tested, using a combination of software or through the extension of existing software. The first solution is based on a combination of Varnish, Apache httpd and UGR. It has the problem that Varnish doesn’t fully support multi range requests. The second solution uses Squid and Apache httpd, but would need a module for httpd to follow redirects. This module has not been developed but tests were still done by simulating the behavior of the module. Both options accomplishes what we need, even though none handles chunks in a optimal way, and both would involve development work. The next step is to take one of the directions and make a fully working prototype. Afterwards, it can be deplyed for testing at a site and have real HTTP grid applications use the cache to do a final evaluation of its performance under a normal usage.

25 References

[1] Traian Cristian Cirstea, Jan Just Keijser, Oscar Arthur Koeroo, Ronald Starink, and Jef- frey Alan Templon. A scalable proxy cache for grid data access. Journal of Physics: Conference Series, 396(3):032027, 2012.

[2] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Rfc 2616, hypertext transfer protocol http/1.1, 1999.

[3] Fabrizio Furano. Dynamic federations v1.0.3, 2013.

[4] D. Wessels and Claffy K. Rfc 2187, application of internet cache protocol (icp), version 2, 1997.

26 A Caching software A.1 Apache Traffic Server Apache Traffic Server (ATS) was originally a commercial product developed and used by Yahoo! to deliver many of their contents, belonging now to the Apache Foundation. It caches objects both on RAM and on disk, with configurable substitution algorithm and sizes. ATS uses threads to process requests and a pool of threads to do I/O operations, relying on the implementation of ”pthread”. ATS is modular, meaning that it has a core composed of the fundamental features, allowing additional functionality to be developed as plugins. One plugin shipped with ATS allows rewriting the cache key. ATS has over 300 configuration parameters, but it is mostly a choice between yes/no, which makes it very easy to configure but also somehow limited when compared to other products. A lot of its power comes from the fact that it is quite easy to extend it with plugins. ATS comes already with some plugins. It is also very flexible in the way it figures out if objects are cacheable and for how long they should be cached. This decision can be made according to several parameters that can be overridden, such as if the content is dynamic, if there are cookies, if ”Expires”, ”last modified” or ”max-age” headers are present, when to revalidate stale cache entries, etc. Several backend can be specified. The choice can be made based on the HTTP method, requested URL, IP address ATS supports the configuration of a hierarchy of caches, so that if the cache search fails, it searches a parent cache before going to the origin server. The choice of the parent cache can be made from several parameters (URL, time, IP address ), and using a round robin algorithm.

A.2 Nginx Nginx is a widely deployed web server that can also act as a proxy and cache responses. Nginx uses several OS processes to handle the requests, and a main process that manages configuration and the other processes. It is easy to configure Nginx, and yet a lot of options are offered. Most of the configuration directives accept variables that are substituted by parameters of the HTTP package. The cache is saved on the hard drive by generating the MD5 hash of a cache key that can be customized, using variables. Therefore, it is very easy to cut out the arguments in the URL: one simply builds a key that doesnt depend on the arguments. There is native support to change the text in ”Location”, ”Refresh” and ”Set-Cookie” header fields. This way, it is possible to tunnel redirects from the origin server to the client without problems. It can be configured to ignore headers so that it caches anyway if, for example, the Set-Cookie or Cache-Control headers are present, or define conditional headers in which the cache is by passed. It can also modify/add headers to the packages (hardcoded).

• Handling redirects It is possible to trick nginx to follow redirects. This is done by creating a error page for 302 and pointing it to a location that passes it to $upstream http location. • Extinsibility You can write modules to define new variables and statements that can be used in the configuration.

27 • SSL Nxingx supports clients connecting through HTTPS, check their client certificates against trusted CAs and provide the server certificates. It also supports connections over SSL to the backends but without providing a client certificate, though. • Scability Specify several backends (a upstream) with max fails, weights, timeouts, ...

A.3 Varnish Varnish is a caching HTTP reverse proxy and only works in this mode (as a web accelerator). It can cache in memory or in disk files. It uses a few processes for management and a lot of worker threads to process each request. It has its own configuration language, Varnish Configuration Language (VCL), that enables writing policies on how incoming requests should be handled. The requests pass through several states and you can define functions that are called at each transition. These functions, which hook to Varnish workflow, have access to objects representing the state and the HTTP request or response, and they can manipulate them and control what the next state is. Varnish barely has configuration values and relies on this state machine to define its behavior. In that sense, Varnish can be seen as a tool to write a HTTP reverse cache. Its the administrator who needs to write the caching policies or anything else he wants. By default, Varnish already handles headers such has “If-Modified-Since“, “expires“, etc. For example, as soon as a HTTP get is received, or a cache hit/miss happens, etc. This means that it is possible to control and redefine the behavior of all stages. It is possible to have several backends. The choice of the backend for a request can be done manually, depending on each request, or automatically according to their availability. Varnish has very good logging capabilities. Its possible to see the details on how each request is processes and in which states it passes, or just have a log using the standard log format, with one line per HTTP request. Varnish also provides a tool to view real-time statistics about several cache parameters. One of the main drawbacks of Varnish is the total lack of SSL support.

A.4 Squid Squid is a very powerful web proxy that can be used as a reverse cache proxy. It caches both in disk and in memory, and its possible to choose cache sizes and replacement algorithms for both storages. There are a lot of configuration directives which are very powerful, as they can usually be often combined with each other. Most configuration directives can use access control lists, to specify to which clients they apply. These ACLs are very powerful because can be based on any combination of source/destination IP address, domains, ports, time, protocol, etc. This way, its possible to specify a different configuration for different range of requests. Several backend servers can be specified, and traffic spread based on ACLs or using algorithms, for example, based on the current load. Its also easily scalable because one can define a hierarchy of proxies to share the load (and cache), so that if there is a miss cache, it will

28 try higher in the hierarchy. The selection of neighbors is also customizable based on the ACLs or on several algorithms such has round-robin. Error responses can also be cached by Squid for a configurable amount of time (negative cache). Squid respects by default to cache-related HTTP headers, such as Cache-Control. Its possible to override the default behavior to cache the non-cacheable responses or to alter the lifetime of the cached objects. The lifetime can be absolute or a percentage of the objects age. This way, for example, we can force caching objects even if no Expiry header is set or if Cache-control: no-store is set. Directives can be used to remove, add or modify HTTP headers on both requests and responses, but its not a very powerful feature because it is impossible to do it based on the request or header itself: it can be only done hardcoding a value for all packages. The main problem of Squid for our use case is that it does not seem possible to change the cache key. Therefore, URLs that have arguments cannot be cached, if these arguments are likely to change for every request. This functionality existed in version 2.7 (back from 2008) but was removed in version 3. Version 3 allows the URL of the requests to be entirely rewritten, but that would mean that the backend server wouldnt see the token id.

Cache keys version 3 doesn’t support custom cache keys. Redirects Impossible to follow them internally. Extinsibility Impossible to extend. Very few statements accept a script to do processing, such as rewriting the request URL. SSL Supports incoming and outgoing ssl requests.

29 B Varnish configuration file

An example of a config file to use UGR. backend default { .host = "127.0.0.1"; .port = "8080"; } backend ugr { .host = "dm-sl664-03.cern.ch"; .port = "80"; } sub vcl_recv { if(req.request != "HEAD") { if(req.http.ugrdone) { if(req.url ~ "^/.*") { set req.http.original-url = req.url; set req.url = "http://lxfsra04a04.cern.ch" + req.url; #set req.url = "http://dpmhead01.cern.ch" + req.url; } set req.backend = default; } else { set req.http.ugrdone = "true"; set req.backend = ugr; } } if(req.request == "HEAD") { set req.hash_ignore_busy = true; return (lookup); } #unset req.http.cookie; #std.log("req " + req.url); } sub vcl_miss { if(req.request == "HEAD" || req.http.Cache-Control ~ "only-if-cached") { error 404 "Not Found"; }

30 } sub vcl_fetch { set beresp.do_stream = true; #std.log("beresp status: " + beresp.status + " ttl: " + beresp.ttl); if(beresp.status == 302 || beresp.status == 301) { if(beresp.backend.name == "ugr") { set beresp.http.Location = regsub(beresp.http.Location, "\?.*$", ""); set beresp.ttl = 0s; return (hit_for_pass); } if(!req.http.original-url) { set req.http.original-url = req.url; }

set req.url = beresp.http.Location; if(req.restarts < 5) { #std.log("restarting"); return (restart); } set beresp.ttl = 0s; return (hit_for_pass); } if(beresp.status == 404) { if(beresp.backend.name == "ugr") { return (restart); } } if (beresp.ttl > 0s) { set beresp.ttl = 1w; } } sub vcl_deliver { if (obj.hits > 0) { set resp.http.X-Cache = "HIT " + obj.hits; } else { set resp.http.X-Cache = "MISS"; } }

31 sub vcl_hash { if(req.http.original-url) { #std.log("hash: " + req.http.original-url); hash_data(req.http.original-url); } else { #std.log("hash: " + req.url); hash_data(req.url); } return (hash); }

sub vcl_error { if (obj.status == 404) { #std.log("delivering 404"); return(deliver); } }

32