3043 views
# Basics of using the Grid To run analysis jobs on the grid you will need some mechanism to send your software and files to the appropriate Grid site with the datasets you want to run over, and some tool to monitor the progress of those jobs. In this tutorial we will show you how to submit your jobs to the Grid using a tool called PanDA. Below are some web links that can help you track the status of your jobs: - [Big panda](http://bigpanda.cern.ch/): The new and default panda monitoring site for your jobs - [PandaJEDI](https://twiki.cern.ch/twiki/bin/view/PanDA/PandaJEDI): Detailed information. - [ADC monitoring tools](http://adc-monitoring.cern.ch/): Contains all the tools you could need to monitor grid operations. We will use these tools throughout the tutorial to check on the status of the jobs. # Submitting jobs to the GRID using PanDA ## The simplest possible PanDA job using prun The [PanDA client package](https://twiki.cern.ch/twiki/bin/view/PanDA/PandaTools) contains a number of tools you can use to submit and manage analysis jobs on PanDA. While `pathena` is used to submit Athena user jobs to PanDA, more general jobs (e.g. ROOT and Python scripts) can be submitted to the grid by using `prun` . Finally, `pbook` is a python-based bookkeeping tool for all PanDA analysis jobs. Detailed information about each of the tools can be found in the [DAonPanda](https://twiki.cern.ch/twiki/bin/view/PanDA/PandaTools#Documentation_on_individual_tool) page. - On lxplus, the client is already installed, so to use it you only need to do: ``` setupATLAS lsetup panda ``` Here we have set up the cvmfs software environment and asked to set up the Panda Clients. **We will now try to run a 'Hello world' job with prun\!** Create a new directory `mkdir prunTest` and go into the directory `cd prunTest` (this is important as prun sends to the grid (almost) all files in, *and below*, your current directory). Now, create a python script called HelloWorld.py (using your favourite editor), that contains the following lines: ``` #!/usr/bin/python print "Hello world!" ``` We can make the python script executable with: ``` chmod u+x HelloWorld.py ``` And run it locally using: ``` ./HelloWorld.py ``` Now that we've tested the job locally (it's **always** important to do this), we can submit the prun command to run this script on the grid. `prun --outDS user.<nickname>.pruntest --exec HelloWorld.py` Here ``<nickname>`` is your *grid* nickname/grid name (which is the same as your lxplus username). This will queue two jobs, one build job that recreates your job environment and then one corresponding to the actual Hello World job. The build job will execute first, and once it has finished the Hello World job will be executed. When the job has finished, we will try to find the 'Hello world' message in the output\! Find in the output the numbers that are associated to the `jediTaskID` , we will need this in a minute. To monitor the progress and check the log file output of a job, we can use the *big* panda monitor <http://bigpanda.cern.ch>. Scroll down to the field for `Task ID` and enter the number we noted above, and click on search. This will send you to a page associated with this task, and shows that there are two jobs (in some stage of running). Have a look on this page to see the various pieces of information provided Now we will try to look at the output. We will need both jobs to have finished. If the jobs do not seem to have started running after a few minutes, it is suggested to carry on with the tutorial, and to check back frequently on the jobs' status. From the web page search for the link labeled `job list (access to job details and logs)`, and click on it. (if you want to get to this page directly, you can enter the URL `http://bigpanda.cern.ch/jobs/?jeditaskid=4203786` and modify the task id to the corresponding number. If you click on a particular job, the running part of your recent job, you should now see 'Logs', hover over this and click 'Log Files'. The log containing the Hello World output is called `payload.stdout` Open it and try to find the "Hello World" message. This forms the basis of simple debugging of jobs that fail on the Grid. As you will have tested the job locally first, if there is a problem, it may be a transient grid error, but it is useful to know how to search for problems in the output files. Note, it is also possible to download the log files as a dataset using dq2/rucio. Now, because we did not need to compile any code to run this job - it's just a simple python script - we do not actually need the build stage of the job. To run prun, without the build stage, type the following command: `prun --noBuild --outDS user.<nickname>.pruntest --exec HelloWorld.py` Now, only the script will be run. Usually though, you will probably want to compile some code first. You can read more about the [PanDA tools here](https://twiki.cern.ch/twiki/bin/view/AtlasComputing/PandaTools). ### Using the Big Panda monitor / Atlas Dashboard to monitor the job So far, only the basics of the [atlas dashboard](http://bigpanda.cern.ch) are described. As an optional exercise, see if you can find the link that shows all of **your** jobs. This is a good page to bookmark, to come back to in future. ### Retrieving the log file from the Grid Above, we saw how to find and open the log file within the web-browser. But what if we wanted to download it? Here we can use the Rucio tools. Once one of the above jobs has completed we will now find and download the log file. Note; when using rucio, it is almost always better to use it in a separate terminal to where you are running your code or submitting grid jobs, in order to minimise the potential conflicts between different python versions. Setup the rucio tools if you haven't already `lsetup rucio` Go back to the big panda web page and find the page with the taskID that we found previously. search for the box labelled "Output containers", and note the log file container name, .e.g `user.aparker.pruntest.log`. Back in the terminal, we will try to find this log file in the grid. ``` $ rucio list-dids user.aparker:*pruntest*log* +--------------------------------------------------+--------------+ | SCOPE:NAME | [DID TYPE] | |--------------------------------------------------+--------------| | user.aparker:user.aparker.pruntest.log | CONTAINER | | user.aparker:user.aparker.pruntest.log.340520924 | DATASET | +--------------------------------------------------+--------------+ ``` This now allows us two options: 1) download the container, and all log files within it (e.g. if the task contained many subjobs), or 2) just the dataset specific to the single set of jobs. ``` rucio download user.aparker:user.aparker.pruntest.log.340520924 cd user.aparker.pruntest.log.340520924/ tar -xvf user.aparker.pruntest.log.23186476.000001.log.tgz ``` This will give you access to the log file (and in fact much more related information) related to your job. This can be useful for debugging. There will be a lot of information in here but when you have extracted the logs, the file you are probably looking for is `payload.stdout` ## Running a simple ROOT script using prun It is possible to set up root, and run root-based code on the grid too. First you should log out of lxplus, and then log back in. We will create a new directory area: `mkdir -p tutorial/grid/RootGridTest` `cd tutorial/grid/RootGridTest` `setupATLAS` Next we will set up a standalone version of root (and also the pandaclient tools); you can see which versions are available by typing: `lsetup 'root --help'` In our case, we will use the following version (see how we added *panda* as well, so that all steps are configured together): ``` lsetup "root 6.20.06-x86_64-centos7-gcc8-opt" panda ``` Next we will create a simple macro to create a root file, a histogram, fill the histogram with random values, and write the output. Create a file called `HistTest.C` and copy and paste the following lines into it: ``` void HistTest() { TFile * foo = TFile::Open("foo.root","recreate"); TH1D * h = new TH1D("h_gaus","h_gaus",30,-5,5); TRandom3 rand(0); for (unsigned int i=0; i< 100000; ++i) { h->Fill(rand.Gaus(0.2,1.0)); } h->Write(); foo->Close(); } ``` As usual, we check that the code runs normally first locally: `root -b -q HistTest.C` You should see that a root file called *foo.root* was created with a single histogram. We will now run the same command on the grid and retrieve the output into a rucio dataset. (**Note**, remove the local `foo.root` file you just made before executing) Run the command: ``` prun --exec="root -b -q HistTest.C" --nJobs=1 --outputs=foo.root --outDS=user.<nickname>.prunroottest1 --rootVer=6.20/06 --cmtConfig=x86_64-centos7-gcc8-opt ``` **replace nickname again** We have had to add certain arguments to the prun command, the root version and config, so that the same version of root is instantiated on the grid worker node. Note that certain files (such as root files -- e.g. foo.root --) are not automatically uploaded to the grid with the grid job. Once the job completes, you should be able to use rucio to download the dataset containing the root file output. Unfortunately the job will probably take longer than the length of the tutorial but if it does finish and you are having trouble with this step, let us know.