Getting Data and MC with Rucio

# Getting Data and MC with Rucio - [Rucio users guide](https://rucio.readthedocs.io/en/latest/) Soon you will find yourself in the situation that the input/output sandboxes for regular grid jobs are not big enough anymore. You will need to start using distributed mass storage systems on the grid. Rucio is ATLAS' distributed data management systems and provides a convenient way to manage your files. The basic unit in Rucio is a DID. A DID is nothing but a registered file, dataset, or container. These DIDs are stored at certain grid sites (like CERN, or BNL) and are registered in a central location (the DDM central catalogues). The logical mapping of files in a dataset to the physical location of the files on these grid sites is done by distributed file catalogues local to a certain site. (Multiple datasets can be aggregated into containers, but we will not cover this right now.) So let's set up Rucio and get going: ## Using the Web Interface - [rucio-ui.cern.ch](https://rucio-ui.cern.ch) You will need your grid certificate in a compatible browser to access the UI. To retrieve information about a given scope or DID, you enter the information into the search string at the top, and navigate through the results. It is encouraged that you move on to the command-line section below, and then revisit the web-based page once you complete the section, to compare the different approaches. ## Using the Command Line 1\. It is strongly recommended if using rucio to use a fresh / clean shell that is different to the one you are using for Athena etc. If you do need to set up rucio in the same session, it should be setup at the time you call asetup; or to do `lsetup 'rucio -w'`. To load the environment on lxplus do: ``` lsetup rucio voms-proxy-init -voms atlas ``` and enter your grid password to the `voms` command, answer *yes* to any questions (if the client software requests it). This will set up a proper Python environment to work with Athena, enable the grid environment, set up the Rucio clients and configure them. (Note that this will always set up the latest Rucio client version.) Don't forget to create your ATLAS proxy if you didn't do this in the first step: `voms-proxy-init -voms atlas` In the examples below, we will use the environmental variable `USER`. On Lxplus this is the same as the nickname associated to the rucio account. On other systems to achieve the same effect, one should use the variable `RUCIO_ACCOUNT`. ### The Basics First lets ask a few simple questions about Rucio, and what Rucio knows about us: Type ``` rucio ping rucio whoami ``` This should return first the version of rucio that you have set up, and second, the information about the rucio account you are using. Note, that rucio accounts can represent users (i.e. you), groups (e.g. Higgs) or activities (e.g tier0), and your credential (c.f. grid certificate) can map to several of these accounts if required (e.g. for group production roles). ### Scopes Just like *namespaces* in C++, Rucio has the concept of a **Scope** and can help to organise datasets etc. To see that there is already a scope for you, type: ``` rucio list-scopes | grep $USER ``` (To see all available scopes, just type `rucio list-scopes`) All items known to rucio, e.g. Files, Datasets, Containers are created within a scope with a name; Data IDentifier (DID), which is unique within the scope. \<BR\> Everything can then be defined with a scope and a name: `scope:name`. This is a new feature for those familiar with DQ2. ### Listing datasets Now lets try to list some interesting information. Remembering that all datasets (etc.) exist with a scope, we can try to list all the known items within your scope: ``` rucio list-dids "user.${USER}:*" ``` (Note, depending on the type of shell you are using the quotes may, or may not be important). If this is the first time you are using the grid, then this may well not show anything. Let's now try and find some data. We have taken data in 2015, and the data we were taking is (mostly) at 13TeV, then we are interested in the scope `data15_13TeV` . (note, that there may not be a direct correspondence between the year and the number after data. This is especially true for MC datasets). - **Exercise**: Try to find the list of all DIDs for this scope. <details> <summary>Reveal Answer</summary> -------------------------------------------------------------- ``` rucio list-dids "data15_13TeV:*" ``` -------------------------------------------------------------- </details> Note even by now, we have produced many output files based on the data collected. - Exercise: Now try to limit the returned set of items to those of type *dataset* and from run **276329** and of type **AOD** <details> <summary>Hint</summary> -------------------------------------------------------------- Previously you used the wildcard `*` to search for all names in the scope. You can also search for patterns, e.g. `*Main*` Additionally, you can supply an extra argument to the command to filter the results, e.g ``` rucio list-dids "data15_13TeV:*" --filter type=dataset,datatype=AOD ``` -------------------------------------------------------------- </details> <details> <summary>Reveal Answer</summary> -------------------------------------------------------------- There may be a lot more but here is a taste of what you might see... | | | | ----------------------------------------------------------------------------------------------- | ------------ | | SCOPE:NAME | \[DID TYPE\] | | data15\_13TeV:data15\_13TeV.00276329.express\_express.merge.AOD.x349\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.calibration\_PixelBeam.merge.AOD.c896\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.physics\_CosmicCalo.merge.AOD.f620\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.physics\_Late.merge.AOD.f620\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.physics\_ZeroBias.merge.AOD.x349\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.physics\_ZeroBias.merge.AOD.f620\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.express\_express.merge.AOD.f620\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.physics\_CosmicCalo.merge.AOD.x349\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.physics\_L1Calo.merge.AOD.f620\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.physics\_Standby.merge.AOD.x349\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.debugrec\_hlt.merge.AOD.g49\_f620\_m1480 | DATASET | | data15\_13TeV:data15\_13TeV.00276329.physics\_Main.merge.AOD.f620\_m1480 | DATASET | -------------------------------------------------------------- </details> Now if we enter the name of a non-existing dataset we will get an error in return. ``` rucio list-dids data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m14808 SCOPE:NAME \[DID TYPE\] -------------------------------------------------------------------------------------------------------------- 2015-05-13 15:55:37,770 ERROR \[Data identifier not found. Details: Data identifier 'data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m14808' not found\] ``` ### Listing Files From the above datasets we found, lets choose `_data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480_` and try to list its contents: ``` rucio list-files "data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480" +--------------------------------------------------------------------------------------+--------------------------------------+-------------+------------+----------+ | SCOPE:NAME | GUID | ADLER32 | FILESIZE | EVENTS | |--------------------------------------------------------------------------------------+--------------------------------------+-------------+------------+----------| | data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480._lb0147._0001.1 | 49A115FF-FF9F-A247-935B-DBBB0D7507FB | ad:6c68d477 | 2069949717 | 7430 | | data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480._lb0147._0002.1 | 763ED9E6-14D2-9D42-809C-AF6D251F0483 | ad:bce90c69 | 2036473377 | 7303 | | data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480._lb0147._0003.1 | 50E3BEFC-09A3-714D-8AA4-0D79D3E696A9 | ad:39015b5a | 2027058215 | 7268 | | data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480._lb0147._0004.1 | 872B1BE9-65CD-2B48-B4DE-22A556A6860F | ad:5baadd8a | 2077551382 | 7487 | ... | data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480._lb0565._0002.1 | A5B51E50-0D63-9E44-B5A9-81F7142C742B | ad:4143c879 | 2357565730 | 9526 | +--------------------------------------------------------------------------------------+--------------------------------------+-------------+------------+----------+ Total files : 1474 Total size : 3403951473359 Total events : 12972955 ``` To list the contents of a container (e.g. to get a list of datasets) you should try: ``` rucio list-content user.serfon:user.serfon.test.1234.31052013.214 ``` ### Listing sites Sometimes it is useful to know where a particular dataset (or replica) is physically stored. To get the list of all sites (referred to as Rucio Storage Elements (RSE) ) known to rucio type: ``` rucio list-rses AGLT2_CALIBDISK AGLT2_DATADISK AGLT2_LOCALGROUPDISK ... ZA-WITS-CORE_LOCALGROUPDISK ZA-WITS-CORE_PRODDISK ZA-WITS-CORE_SCRATCHDISK ``` It is possible to further restrict the possible responses by adding an expression argument, e.g. `rucio list-rses --expression "cloud=UK"` To find the locations of particular files, datasets or containers (in fact any DID object), then the following commands can be used. ``` rucio list-dataset-replicas scope:name or rucio list-file-replicas scope:name ``` In fact the `list-file-replicas` options should work for any DID (e.g files, datasets, ... ). - **Exercise**: Find the locations of the above dataset from data15_13TeV <details> <summary>Hint</summary> -------------------------------------------------------------- ``` rucio list-dataset-replicas data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480 ``` -------------------------------------------------------------- </details> <details> <summary>Reveal Answer</summary> -------------------------------------------------------------- Note - your results may vary from that shown below, depending on when the command was run. ``` DATASET: data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480 +--------------------------+---------+---------+ | RSE | FOUND | TOTAL | |--------------------------+---------+---------| | AUSTRALIA-ATLAS_DATADISK | 1474 | 1474 | | CERN-PROD_DERIVED | 1474 | 1474 | +--------------------------+---------+---------+ ``` -------------------------------------------------------------- </details> ### Downloading Files One of the primary uses of rucio will be to download user and production files: To download only a select number of randomly selected files (e.g. 1), you would type: ``` rucio download --nrandom 1 data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480 ``` To download all files, you would remove the `--nrandom 1`. Because everything that rucio knows about is a *DID* it is just as easy to download a specific file, using the *scope:name* syntax: e.g. **NOTE: Please don't do this but for completeness without --nrandom 1** ``` rucio download data15_13TeV:data15_13TeV.00276329.physics_Main.merge.AOD.f620_m1480._lb0147._0001.1 ``` ### Exercise Try to download 1 randomly selected file from the dataset we have been using all week **Remember the --nrandom 1 option, please do not download the whole dataset**