Robust Challenge -- Moving Data Between CERN and Tier-1 Centers Milestone Assessment from Jon Bakken, U.S. CMS Tier-1 Center Manager Reliable and automated transfers of data between CERN Tier0 and FNAL Tier1 centers are a key component for successful operations of CMS analysis framework. Much previous work on this front involved ad hoc independent tests checking a single aspect, typically emphasizing rate over reliability. It was decided to try and do a end-to-end robust challenge from CERN to all Tier1 sites, cms and atlas, and concentrate on full system level issues rather that optimizing single components. The global end-to-end robust challenge combines the specialized tests into a integrated framework. The rest of this report concentrates on efforts here at FNAL for this robust challenge. Many people worked hard on making this test a success. Typically, specialized hardware was set up for previous tests. This resulted in good performance for that hardware. Unfortunately, most of the lessons learned didn't translate to the production systems. Therefore, building on the success of previous specialized tests, it was decided to use production dCache systems at FNAL for the robust challenge. This meant that the robust challenge had to co-exist with normal user's data transfers and not interfere or impede any real work. Many different pool designs were implemented and tested. The final design consisted of carving 100 GB of space from each existing pool on the production system and creating volatile pools for the transfer. This design allowed every production node to participate in the robust challenge thereby increasing overall data rate, and it provided for a normal user load on each node, something missing in all specialized hardware tests. An added benefit is that as problems and solutions were found in the robust challenge, they were immediately deployed on the production system. Due to other constraints and goals, CERN chose to set up ten independent nodes not part of any production system. Nine nodes were deployed for this test. Each node at CERN had 3 disk partitions, with ten 1 GB files on each disk, for a total of 9 nodes * 3 disk/node * 10 files/disk = 270 files immediately available for transfer from CERN to FNAL. The Castor system was not involved in this round of tests - the files were statically disk resident. Data was received at FNAL by 32 dCache pools on 22 different pool nodes. Total space allocated for the test was about 3 TB. Since the data was random bytes, it was decided not to write it to tape. Therefore volatile lfsonly dCache pools were selected. These are normal dCache pools, with the additional two features: files are not copied to tape, and there isn't any guarantee of file lifetime. If more space is needed, it is made available immediately. Since the final rate for the challenge was more than 20 TB/day, these volatile pools were flushed with new data around seven times a day. Also note that user encp's to/from enstore were active on the same disks as the volatile pools were using. Hence, choosing not to write the challenge data to tape had little impact on the overall test environment. The default CERN-FNAL network path is over ESNET. One of the goals by other groups for this robust challenge was to demonstrate rates of 500 MB/s, something impossible using the ESNET network path. Therefore, the networking groups at CERN and FNAL set up network paths to use the Caltech Starlight POP in Chicago for data traffic in this challenge. This worked flawlessly and transparently to the dCache production system, so well, in fact, that it is hoped that all traffic between CMS and CERN can soon be routed over Starlight and not over ESNET anymore. Much work was done in trying to optimize WAN rate. Previous tests on specialized hardware concentrated on large TCP buffers and a few streams. This proved to be a disaster in a normal production environment, as expected. TCP parameters are system parameters and with hundreds of user transfers per node, large TCP buffers quickly exhausted memory and caused machines to crash. In the end, all WAN parameters were set back to the kernel defaults for the different OS's. This provided for stable operations, but sub-optimal data rates. The number of transfers and TCP buffer size was theoretically calculated and empirically checked to increase throughput. The final design had 2 MB TCP buffers and 20 parallel streams for each transfer. This provided a rate of 2 MB/stream giving overall data rates of 40 MB/s for each file. The current set of hardware combined with the dCache IO yields a maximum rate between 50-60 MB/s for each node. Independent test using optimized C code can achieve 70-80 MB/s for each node. Therefore, the 40 MB/s per file transfer was deemed acceptable at this point of development. [Note: I have asked Don P for assistance from CCF in helping to set these WAN parameters in a user-production environment.] Many bugs were fixed over the 1st weeks of the challenge. This was the real goal of the challenge. Timur worked very hard in fixing issues as they arose and making new code available quickly. Here is a partial list of fixes: Bad user and host certs caused global system hangs - this was rewritten in a thread to prevent global blocks. Requests were duplicated. The 1st copy finished, but the 2nd did not. This eventually filled all queues and stopped all transfers. The code was fixed. The SRM was submitting requests to pools before updating the pool cost. This caused much 'transfer clumping'. All requests went to a single pool. This code was rewritten after a suggestion by Patrick. Kills of transfers by users didn't propagate through the entire thread chain (down to actual gsiftp on the pool). This caused many transfers to happen even though they were canceled, and also caused some hangs because transfers could not possibly complete (that is why the user canceled them). This code was rewritten. The entire transfer chain is complex. Although it worked, it was time consuming and difficult to debug failed transfers. The code has been rewritten and dramatically simplified. The SRM code was mixed along with other dCache components in one jar file. This made updates extremely difficult because the jar had to be updated across all nodes in the dCache system. The SRM code was extracted and put into its own jar, that only the SRM cell accessed. This made updates and restarts very easy without affecting users. There isn't any concept of priority for a transfer. Since all users get mapped to 'cmsprod', real user transfers of data got mixed with the robust challenge transfers. This is OK, except during the part of the robust challenge where I flooded the SRM with requests to check the queuing, thereby causing real requests to be so deep that they didn't get processed for hours. This is still an open issue. The workaround for the challenge was to map the challenge user to a username different than 'cmsprod'. The SRM has existing working code to only allow so many transfers for each user before looking for transfers from other users before submitting more transfers from the 1st user. Much of the SRM configuration is hard-coded in a startup batch file and it not possible to tune the system without a restart. For example, it's not possible to change the number of streams per transfer. This is hard-coded as a startup parameter. Now that the "best" numbers have been found, this is probably OK. But in the long run, a mechanism for tuning parameters without restarting will be beneficial. There are occasional problems with automated directory creation, and with setting the correct permissions on the directories. I simply pre-created the directories and avoided the issue during the robust challenge. The actual error has not been reproducible since the initial tests. It's not possible to set the number of ftp transfers per pool independent of other user (dccp) transfers. This is an open issue and needs resolving. It didn't effect this challenge since all transfers were gsiftp to volatile pools - and the users were using the tape-back-end read and write pools. The SRM was written assuming that the pools in the system would be robust and transfer the files. Many pools crashed, and these transfers failed without the SRM re-scheduling them. This was fixed. When the SRM crashed or restarted, all information on current and pending jobs was lost. This made it impossible to restart and also didn't provide for a very reliable system. This was fixed by adding a database to the SRM. This dramatically improved robustness. The database slowed down interactive spy dCache response. This was rewritten to keep a copy of active transfers in memory (as before). The SRM database tables need to be re-worked. Currently the SRM scans every item in the database to check what work needs to be rescheduled. This includes the scanning of the thousands of completed transfers, some from weeks ago. This can cause a delay of more than 5 minutes for a complete start of the SRM. There is a 4 AM drop off in rate. This is still an unresolved issue, but we now know it is due to decrypt errors on delegated proxy credentials. Timur is trying to get globus to help. This is the largest problem remaining and is now affecting user transfers as well. WAN parameters caused system crashes when local users transfered files. (see earlier description) Little monitoring information is available, especially for users. This is an open issue. There were also some simple startup issues, like missing files, gridmap files being overwritten, and authorization failures. These were quickly fixed. Initially we were going to use SRM-to-SRM transfers. CERN was concentrating on the issues of the challenge and didn't have their SRM fully deployed. We used srmcp (pulls) from FNAL and communicated with independent gridftp servers at CERN. Adding the SRM at CERN would have allowed them to push data to FNAL, something that doesn't seem to be the best mode of operation for data transfers. Experience has shown that mastering the transfers at the destination is more practical and robust than having the source master the transfer. Reliable and robust running was achieved after several weeks of debugging. Typically there were more 150 srmcp's in the queue at any one time, with each srmcp requesting 3 to 10 files. The volatile queues were configured to allow 6 active gsiftps per pool. The rate fluctuated a lot, but was typically close to 300 MB/s, ranging from 250 to 350 MB/s. The fluctuation was due to the random nature of getting files from CERN. Transfers were submitted, and sometimes many dozens would depend on a single node at CERN. This dramatically affected performance, but not reliability. A more global data handler scheduler, and dCaches on both side the transfer (for pool to pool copies) would improve performance. To check rate throughput with a better scheduler, a special gridftp only script was written that explicitly matched files at CERN with pools here at FNAL to optimize throughput. Only 1 file from each disk at CERN was used, (3 per node), possibly leading to some memory caching at CERN. Files were initially only written to memory here at FNAL, providing a rate of 500 MB/s. Files were then written to dCache pool disks (but by gridftp only, not through the dCache), and the rate was 400 MB/s. Much good work was accomplished, and the robustness goal of the challenge was clearly met. Work is now starting to repeat these tests to Tier2 sites, starting with Florida. For these tests, we will use SRM-to-SRM transfers in a fully dCache environment. After these tests work, it is expected we will start transferring real data using the CMS data handling system.