The future belongs to the few of us still willing to get our hands dirty.
Do we need to get our hands dirty and start processing the raw data? After all we did our fair share when we were post-docs. Just sit back, relax and listen Erik Satie, while students will create DSTs through blood, sweat and tears.
But OK, I know, I know you’re one of the small minority of scientists that can’t wait, who need to process subset of data quickly, define and customize data processing conditions, or you think you do, or (more accurately) feel like you do. I am happy to exploit that feeling. I want to be clear though: I am not here to tell you what you want. Still, something in our science DNA compels us to be honest about this: you should follow the following instructions to start data processing yourself.
Were do we start?
First you have to define if a) you are going to use Clara CLAS12 common (official) installation or b) you want to have your own installation (e.g. in your home institution or in your own laptop).
I want to have my own installation
Of course you do… with entire CLAS12 code base on your laptop. No problem!
No problem can withstand the onslaught of unrelenting thinking.
Just kidding… no thinking is required.
Clara (including the CLAS12 plugin) installation is a simple, one-step process, that is described in here.
I want to use common installation
Ok then, we have to perform the following steps:
Set environmental variable
CLARA_HOMEpointing to the common installation directory. Make sure you have read and execute permissions in that file system.
Set environmental variable
CLARA_USER_DATAthat points to a directory where you will store Clara application service composition and data-set description files, as well as data processing logs, and possibly actual data files.
Run the Clara CLI.
It is recommended for the
$CLARA_USER_DATA directory to have a specific file structure:
├── config ├── data │ ├── input │ └── output └── log
It is OK not to create the structure by hand. Clara CLI (command line interface) will check/create the necessary file structure at the startup.
Let us start Clara CLI by typing the following:
$ $CLARA_HOME/bin/clara-shell Warning: CLARA_USER_DATA environmental variable is not assigned. It will be set to point to the CLARA_HOME. Note that you might face permission exceptions.
Oops… I forgot to set
CLARA_USER_DATA environmental variable.
Let’s try it one more time.
$ $CLARA_HOME/bin/clara-shell ██████╗██╗ █████╗ ██████╗ █████╗ ██╔════╝██║ ██╔══██╗██╔══██╗██╔══██╗ 4.3 ██║ ██║ ███████║██████╔╝███████║ ██║ ██║ ██╔══██║██╔══██╗██╔══██║ ╚██████╗███████╗██║ ██║██║ ██║██║ ██║ ╚═════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝ Run 'help' to show available commands. clara>
It’s a success…
Now we can play by typing
help and learn more about this interactive
shell that makes data processing (including application customization,
data-set selection, application deployment, scaling, and monitoring,
and much more) simply an enjoyable experience.
Since we are not getting any younger, let us spare some time and show how we can reconstruct a given data file locally, i.e. on a local (on your own) computer.
Processing a data file
Clara considers default locations to access data files, the application
service-composition file (usually
services.yaml, yet it can have any arbitrary name),
data-set description file (e.g.
files.txt, a text file containing the
names of all data files) as well as log files (usually for every data processing
Clara creates two log files: DPE and workflow-manager logs). To see the default locations/settings,
type in the CLI the following command:
clara> show config servicesFile: "$CLARA_HOME/config/services.yaml" fileList: "$CLARA_USER_DATA/config/files.txt" inputDir: "$CLARA_USER_DATA/data/input" outputDir: "$CLARA_USER_DATA/data/output" outputFilePrefix: "out_" logDir: "$CLARA_USER_DATA/log" session: "gurjyan" description: "clara" farm.cpu: 0 farm.memory: 0 farm.disk: 5 farm.time: 1440 farm.os: "centos7" farm.node: "" farm.exclusive: "" farm.stage: NO VALUE farm.track: "debug" farm.scaling: 0 farm.system: "jlab"
show config will show full path with environmental variables resolved.
Note that these are the default settings,
and user can change them to point to different locations, with the
exception of the logDir during the farm deployment.
During the farm deployment the logDir will be defined buy the $CLARA_USER_DATA environmental variable. In case $CLARA_USER_DATA/log is not defined or accessible Clara will default to the $CLARA_HOME/log.
Now let us switch to other workspace, or just simple exit the Clara CLI (
First thing we have to do is to copy (or
jget) our data file
$ cp /work/clas12/data/clas_004013.0.hipo $CLARA_USER_DATA/data/input/
Next we create the
files.txt file in the
and add a single line to it:
$ cat $CLARA_USER_DATA/config/files.txt clas_004013.0.hipo
For this exercise we will be using the official (commonly distributed)
CLAS12 reconstruction application service composition file (
That’s it. Let us call again the Clara CLI and start the processing by typing
$ $CLARA_HOME/bin/clara-shell ██████╗██╗ █████╗ ██████╗ █████╗ ██╔════╝██║ ██╔══██╗██╔══██╗██╔══██╗ 4.3 ██║ ██║ ███████║██████╔╝███████║ ██║ ██║ ██╔══██║██╔══██╗██╔══██║ ╚██████╗███████╗██║ ██║██║ ██║██║ ██║ ╚═════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝ Run 'help' to show available commands. clara> run local Distribution : clara-cre-4.3.5 CLAS12 plugin : coatjava-5c.7.5 ========================================== CLARA FE/DPE ========================================== Name = 18.104.22.168%7220_java Session = gurjyan_clara Start time = 2019-01-31 16:35:22 Version = 4.3 Lang = Java Pool size = 10 Proxy Host = 22.214.171.124 Proxy Port = 7220 ========================================== ========================================== CLARA Orchestrator ========================================== Front-end = 126.96.36.199%7220_java Start time = 2019-01-31 16:35:23 Threads = 20 Input directory = /u/group/da/vhg/testbed/clara/work/data/input Output directory = /u/group/da/vhg/testbed/clara/work/data/output Output file prefix = out_ Number of files = 4 ========================================== 2019-01-31 16:35:23.692: Waiting for local node... 2019-01-31 16:35:23.841: Start processing on 188.8.131.52... 2019-01-31 16:35:23.841: Searching services in 184.108.40.206... 2019-01-31 16:35:23.851: Deploying services in 220.127.116.11... 2019-01-31 16:35:24: started container = 18.104.22.168%7220_java:gurjyan [HipoDataSync] ---> dictionary size = 120 [HipoDataSync] ---> dictionary size = 120 2019-01-31 16:35:24: started service = 22.214.171.124%7220_java:gurjyan:DataManager pool_size = 1 SVT geometry constants loaded ? true 2019-01-31 16:35:24: started service = 126.96.36.199%7220_java:gurjyan:HipoToHipoReader pool_size = 1 2019-01-31 16:35:24: started service = 188.8.131.52%7220_java:gurjyan:HipoToHipoWriter pool_size = 1 2019-01-31 16:35:25: started service = 184.108.40.206%7220_java:gurjyan:FTHODO pool_size = 20 2019-01-31 16:35:25: started service = 220.127.116.11%7220_java:gurjyan:MAGFIELDS pool_size = 20 2019-01-31 16:35:25: started service = 18.104.22.168%7220_java:gurjyan:FTOFHB pool_size = 20 2019-01-31 16:35:25: started service = 22.214.171.124%7220_java:gurjyan:FTEB pool_size = 20 2019-01-31 16:35:25: started service = 126.96.36.199%7220_java:gurjyan:DCHB pool_size = 20 2019-01-31 16:35:25: started service = 188.8.131.52%7220_java:gurjyan:FTCAL pool_size = 20 2019-01-31 16:35:25: started service = 184.108.40.206%7220_java:gurjyan:CTOF pool_size = 20 ... ...
If you get this console printouts then congratulations!!! You can go and have a drink now. No seriously. This will take a while. Go…
Are you back? While you were drinking, and having fun I was working. That’s fine… I like working, because it mesmerizes me. I can sit and look at it for hours :)
The processing is complete. If everything went smooth we will get a benchmark results on the CLI console, indicating total and average processing times in each service engine of the reconstruction application, average time spent by the entire application and time spent by workflow management system (orchestrator).
2019-02-01 09:19:09.169: Benchmark results: 2019-02-01 09:19:09.170: READER 2000 events total time = 0.21 s average event time = 0.10 ms 2019-02-01 09:19:09.171: MAGFIELDS 2000 events total time = 0.02 s average event time = 0.01 ms 2019-02-01 09:19:09.171: FTCAL 2000 events total time = 0.35 s average event time = 0.18 ms 2019-02-01 09:19:09.172: FTHODO 2000 events total time = 0.55 s average event time = 0.27 ms 2019-02-01 09:19:09.173: FTEB 2000 events total time = 0.18 s average event time = 0.09 ms 2019-02-01 09:19:09.173: DCHB 2000 events total time = 1088.39 s average event time = 544.20 ms 2019-02-01 09:19:09.174: FTOFHB 2000 events total time = 3.19 s average event time = 1.60 ms 2019-02-01 09:19:09.175: EC 2000 events total time = 2.57 s average event time = 1.29 ms 2019-02-01 09:19:09.176: CVT 2000 events total time = 93.01 s average event time = 46.51 ms 2019-02-01 09:19:09.176: CTOF 2000 events total time = 4.03 s average event time = 2.02 ms 2019-02-01 09:19:09.177: CND 2000 events total time = 17.64 s average event time = 8.82 ms 2019-02-01 09:19:09.177: HTCC 2000 events total time = 0.46 s average event time = 0.23 ms 2019-02-01 09:19:09.178: LTCC 2000 events total time = 0.43 s average event time = 0.21 ms 2019-02-01 09:19:09.178: RICH 2000 events total time = 0.92 s average event time = 0.46 ms 2019-02-01 09:19:09.179: EBHB 2000 events total time = 4.43 s average event time = 2.22 ms 2019-02-01 09:19:09.179: DCTB 2000 events total time = 281.84 s average event time = 140.92 ms 2019-02-01 09:19:09.180: FTOFTB 2000 events total time = 3.93 s average event time = 1.97 ms 2019-02-01 09:19:09.180: EBTB 2000 events total time = 6.48 s average event time = 3.24 ms 2019-02-01 09:19:09.181: WRITER 2000 events total time = 6.76 s average event time = 3.38 ms 2019-02-01 09:19:09.181: TOTAL 2000 events total time = 1515.40 s average event time = 757.70 ms 2019-02-01 09:19:09.182: Average processing time = 41.02 ms 2019-02-01 09:19:09.182: Total processing time = 82.04 s 2019-02-01 09:19:09.182: Total orchestrator time = 89.80 s 2019-02-01 09:19:09.183: Processing is complete.
Where is my output
Reconstructed file will be physically stored in the outputDir:
(Remember the CLI
show config command?).
clara> show outputDir total 81M -rw-r--r-- 1 gurjyan da 21M Feb 1 09:18 out_clas_004013.0.hipo
The data processing log files will be stored in the logDir:
clara> show logDir total 325M -rwxr-xr-x 1 gurjyan da 4.0K Jan 30 15:06 220.127.116.11_gurjyan_clara_orch.log -rwxr-xr-x 1 gurjyan da 89K Jan 30 15:12 18.104.22.168_gurjyan_clara_fe_dpe.log
Now a little bit about the file naming convention. As you can see the
reconstructed file name is created by adding the
out_ prefix to the
actual input data file name. This prefix is configurable, and can be
set/changed using the following CLI command:
clara> set outputFilePrefix myOwnPrefix_ clara> show config
Clara log files are critical for data preservation, monitoring and debugging. So, that is the reason log file names some of the important information, such as the node where the processing was performed (in our example node IP 22.214.171.124), data processing session(by default data processing session is set to be the user name. In our example session=gurjyan), and data processing description ( for this example description=clara). The data processing session and description are CLI configurable options and can be set by the following command set:
clara> set session myOwnSession clara> set description myOwnDescription clara> show config
We recommend the session and the description to be unique for every new data-set processing.
We can do a lot without exiting the Clara CLI. For e.g. we can analyse the log files using the following commands:
clara> show logDPE clara> show logOrchestrator
While CLAS12 reconstruction application is the best-in-class :), it won’t do you any good if it is missing one key feature. Clara provides application flexible customization to fit your needs. E.g. you want to pass engine specific configuration options or change application service composition by adding or removing service engines, etc. All this is possible with Clara. You will understand me if you ever played with LEGO. Ok let us see it on an example. Below is the official CLAS12 reconstruction application service-composition, that can be accessed through the following command:
clara> show services io-services: reader: class: org.jlab.clas.std.services.convertors.HipoToHipoReader name: HipoToHipoReader writer: class: org.jlab.clas.std.services.convertors.HipoToHipoWriter name: HipoToHipoWriter services: - class: org.jlab.clas.swimtools.MagFieldsEngine name: MAGFIELDS - class: org.jlab.rec.ft.cal.FTCALEngine name: FTCAL - class: org.jlab.rec.ft.hodo.FTHODOEngine name: FTHODO - class: org.jlab.rec.ft.FTEBEngine name: FTEB - class: org.jlab.service.dc.DCHBEngine name: DCHB - class: org.jlab.service.ftof.FTOFHBEngine name: FTOFHB - class: org.jlab.service.ec.ECEngine name: EC - class: org.jlab.rec.cvt.services.CVTReconstruction name: CVT - class: org.jlab.service.ctof.CTOFEngine name: CTOF # - class: org.jlab.service.cnd.CNDEngine - class: org.jlab.service.cnd.CNDCalibrationEngine name: CND - class: org.jlab.service.htcc.HTCCReconstructionService name: HTCC - class: org.jlab.service.ltcc.LTCCEngine name: LTCC - class: org.jlab.rec.rich.RICHEBEngine name: RICH - class: org.jlab.service.eb.EBHBEngine name: EBHB - class: org.jlab.service.dc.DCTBEngine name: DCTB - class: org.jlab.service.ftof.FTOFTBEngine name: FTOFTB - class: org.jlab.service.eb.EBTBEngine name: EBTB configuration: io-services: writer: compression: 2 # settings below are for GEMC, compatible with 4a.2.4 services: MAGFIELDS: solenoidMap: Symm_solenoid_r601_phi1_z1201_13June2018.dat torusMap: Symm_torus_r2501_phi16_z251_24Apr2018.dat solenoidShift: "0.0" torusXShift: "0.0" torusYShift: "0.0" torusZShift: "0.0" DCHB: useStartTime: "true" wireDistortionsFlag: "false" geomDBVariation: may_2018_engineers DCTB: geomDBVariation: may_2018_engineers mime-types: - binary/data-hipo
The command above printed the content of the Clara application composition YAML file, the location of which is configurable through the following command:
clara> set servicesFile /myOwnDir/myOwnService.yaml clara> show config
Noticed that I am calling
show config after every
set command just
to remind you about this useful command that shows data processing
application configuration options. It does not have any other purpose
other than that (arguably useful, ha ha).
Ok, let us examine CLAS12 reconstruction official
the default location being at:
clara> show config ... servicesFile: "$CLARA_HOME/config/services.yaml" ...
The service composition file consists of the following sections:
Stream builder and stream consumer services. This are IO services that access a data source and create a stream of data quanta that are dispatched to data processing services.
io-services: reader: class: org.jlab.clas.std.services.convertors.HipoToHipoReader name: HipoToHipoReader writer: class: org.jlab.clas.std.services.convertors.HipoToHipoWriter name: HipoToHipoWriter
Data processing services.
services: - class: org.jlab.clas.swimtools.MagFieldsEngine name: MAGFIELDS - class: org.jlab.rec.ft.cal.FTCALEngine name: FTCAL
We describe a service by defining the service engine clas and a by giving a name. So, this means that we can have multiple services with different names sharing the same engine. May be not so useful, but one can configure the same engine differently and build and stream events through a limited cycle loop (note: no programing is necessary).
Section describing configuration options for services. In this section users can describe configuration parameters for IO and general for all processing services, as well as configuration options for a specific service (e.g. parameter
useStartTimefor DCHB service).
configuration: global: magnet: torus: -1 solenoid: -1 ccdb: run: 101 variation: custom runtype: mc runmode: calibration io-services: writer: compression: 2 services: DCHB: useStartTime: "true" wireDistortionsFlag: "false" geomDBVariation: may_2018_engineers DCTB: geomDBVariation: may_2018_engineers
Data tyoe of the streaming event (data quantum).
mime-types: - binary/data-hipo
So you can add a new service to the application by providing
of the an engine and an arbitrary
name (preferably something descriptive).
Removing a service is as simple as deleting or commenting out the two lines in yml file, describing a service.
E.g. below shows the CLAS12 modified reconstruction application ,where we keep two standard services from the reconstruction official application and add two services, sharing the same engine. Here we demonstrate testing and debugging a new reconstruction engine with two different configuration options.
clara> set servicesFile /myOwnDir/myOwnService.yaml clara> show services io-services: reader: class: org.jlab.clas.std.services.convertors.HipoToHipoReader name: HipoToHipoReader writer: class: org.jlab.clas.std.services.convertors.HipoToHipoWriter name: HipoToHipoWriter services: - class: org.jlab.clas.swimtools.MagFieldsEngine name: MAGFIELDS # - class: org.jlab.rec.ft.cal.FTCALEngine # name: FTCAL - class: org.jlab.rec.ft.FTEBEngine name: FTEB - class: org.jlab.service.dc.DCHBEngine name: DCHB - class: my.own.service.dc.DCHBEngineVersion name: myDCHB1 - class: my.own.service.dc.DCHBEngineVersion name: myDCHB2 configuration: io-services: writer: compression: 2 # settings below are for GEMC, compatible with 4a.2.4 services: DCHB1: useStartTime: "true" wireDistortionsFlag: "false" geomDBVariation: may_2018_engineers DCHB2: useStartTime: "flase" wireDistortionsFlag: "true" geomDBVariation: feb_2019_engineers mime-types: - binary/data-hipo
Clara application composition YAML file is a representation of a directed graph, where data flows from services described at the top to the bottom.
Running on a farm
Old McDonald sends a farm job e-i-e-i-o…
Yup, this is that easy. You do not have to write a farm submission scripts, and farm deployment can be done without leaving the Clara CLI. Let us show it on an example.
Before we continue remember that a farm submission must have a unique session. This is critical in case farm control system lands your jobs on the same node.
I have a data set consisting of 4 files (I can see my data-set using the
show files. see below), and I want to process it on the JLAB farm.
clara> show files clas_004013.0.hipo clas_004013.1.hipo clas_004013.2.hipo clas_004013.3.hipo
The following settings will configure my farm deployment:
Make sure user data is accessible from the farm mounted file system. This includes input/output data directories, application and data-set description files, log directory and farm-job PBS/SLURM scripts.
clara> set inputDir /work/clas12/gurjyan/Testbed/clara/data/input clara> set outputDir /work/clas12/gurjyan/Testbed/clara/data/out clara> set servicesFile /work/clas12/gurjyan/Testbed/clara/services.yaml clara> set fileList /work/clas12/gurjyan/Testbed/clara/files.txt clara> set logDir /work/clas12/gurjyan/Testbed/clara/log
You can minimize manual settings in CLI (using default settings) by defining
CLARA_USER_DATAenvironmental variable pointing to a user-data directory that is visible to the farm system. Note that this must be done prior running the
Define the data-processing session and the description.
clara> set session gurjyanSession clara> set description gurjyanDescription
Set the vertical scaling parameter (so called multi-threading, i.e. how many threads you wish will process the data in parallel).
clara> set farm.cpu 8
Request the memory and the disk space for the job.
clara> set farm.memory 30 clara> set farm.disk 10
These are typical settings for 8 core jobs that will work on all JLAB farm nodes.
That’s it. Now we launch the farm job, yet,
it is a good practise check the settings before a farm deployment
(I am sure you remember the CLI command
To run a farm job execute the following command in the CLI:
clara> run farm Parsing script ... (it may take while) <jsub><request><index>30283608</index><jobIndex>63238147</jobIndex></request></jsub>
Now we can monitor the job submission by:
clara> show farmStatus JOB_ID USER STAT QUEUE EXEC_HOST JOB_NAME SUBMIT_TIME CPU_TIME WALLTIME ACCOUNT 63238147 gurjyan A priority -- ...urjyan-clara Feb 05 13:55 -- -- clas12
You can examine actual farm submission and shell executable scripts
created by Clara in the
(.jsub and .sh files).
$ cd $CLARA_USER_DATA/config $ ls farm_gurjyan_clara.jsub farm_gurjyan_clara.sh files.txt service.yaml
There are more farm-job control parameters that help users to further customize the farm deployments. E.g. user can process a given data set in multiple farm nodes in parallel (find more details in here).
clara> set farm.scaling 4
This command will divide entire data-set into groups of 4 files and will process each group in a different farm node. This is known as the Clara horizontal scaling.
Farm node flavors
Also user can request data processing on specific node flavor of the farm.
The JLAB scientific computing farm consists of the following hardware systems:
- farm18, 6148 CPU @ 2.4 Gz, cores = 80
- farm16, E5-2697 v4 @ 2.3 Gz, cores = 72
- farm14, E5-2670 v3 @ 2.3 Gz, cores = 48
- farm13, E5-2650 v2 @ 2.6 Gz, cores = 32
- qcd12s, E5-2650 0 @ 2.0 Gz, cores = 32
To run data processing on a specific farm hardware you need to set the farm.node parameter. E.g. the command below will request data processing only on farm.18 nodes.
clara> set farm.node farm18
Examples above will use farm in so called sharing mode, where in a single farm node there might be multiple jobs running owned by multiple users. In this mode data processing performance can not be properly optimized due to the unpredicted resource allocation requests. However, Clara can request an exclusive node for a data processing, where only your job will be running. In this case Clara will perform hardware level optimizations to achieve maximum performance.
E.g. the command below requests an exclusive access to a farm18 node:
clara> set farm.exclusive farm18
The exclusive mode works for only SLURM controlled farm nodes. For the exclusive mode on the SLURM farm, you do not have to define farm.memory and farm.cpu parameters, since Clara will set these values for you to guarantee maximum performance.
How I can get then farm node exclusively in PBS you might ask?
farm.exclusive parameter is inactive for PBS jobs, you can
request an exclusive node on PBS by setting proper memory and core requests.
E.g request the farm18 exclusive node on PBS:
clara> set farm.node farm18 clara> set farm.cpu 80 clara> set farm.memory 00 clara> set farm.disk 25
or request the farm16 exclusive node on PBS:
clara> set farm.node farm16 clara> set farm.cpu 72 clara> set farm.memory 60 clara> set farm.disk 25
One other important parameter to be considered is the farm.stage. This will tell Clara workflow management system to stage files one-by-one in the local file system of the farm computing node. This operation is critical for IO optimizations, since local IO is considerably faster that shared file system IO. E.g.
clara> set farm.stage /scratch/clara/gurjyan
/scratch/clara is the created directory on all JLAB farm nodes (much like
For the proper staging and file transfers user must request a subdirectory
specific for his/her processing (in the example the subdir is
In case you are running on JLAB farm, the easies and safest would be to request default staging directory: /scratch/clara/$USER.
clara> set farm.stage default
To get more information on farm deployment parameters here.
Choosing between PBS and SLURM
All your actions and created Clara CLI command sets will not change to launch
data processing jobs on a farm controlled by PBS or SLURM batch control systems.
Clara is transparent in this sense. If you want your jobs to end up on
SLURM or PBS controlled farms the only thing you need to do is to change
PATH variable in your startup script.
PATH variable in the
$ set path = ( /site/scicomp/auger-slurm/bin $path )
That’s it. now you are ready to process data on SLURM controlled farm.
The default farm job .out and .err files are no longer copied to the home
directory of a user (as in case of the PBS), instead it will be directly written to the central
location under /lustre/expphy/farm_out/
There are useful commands in SLURM, that are not ported into Clara CLI
(there is no intention to make Clara Jack of all trades),
slurmQueues, etc, that help to see available nodes,
running jobs and their statuses. Please refer to the
scicomp web site for more information.
This entire time we were typing CLI command in the Clara shell. What if I exit the shell? Do I need to type all this commands that I discovered and tested when I start a new Clara shell?
If this is the case you might suggest watering me twice a day.
But wait… do not try to get easy exercise by jumping into conclusions.
Before exiting the CLI I recommend you save whatever you typed in the shell as a clara script by:
clara> save myTest.clara
The next time you start a shell you can type something like:
$ $CLARA_HOME/bin/clara-shell myTest.clara
You can get more on Clara scripting in here