This document is now obsolete. For more up to date documentation see https://plone3.fnal.gov/SAMGrid/Wiki/D0Documentation/

Storing Group Data to SAM

 

1. Introduction

Many users and groups would like to store the data they are producing into the sam system in order to track  and make it available to other members of their group, as well as Dzero in general. It is important to make this data distinguishable from the data being produced on the production farms, or produced as a production effort by Dzero. It is also important to specify allocations of tapes/robot slots for each group so this kind of storage activity can be easily controlled.  Groups may want to recover storage space taken up by older data, and there must  be a mechanism for recovering the space used by deleted or obsolete files. All these issues are addressed below, along with the procedure and details of storing group data.

2. Directing Data to Specific Tapes

Data is directed to specific sets of tapes using the mechanism called SAM autodestination. Within this system, information contained within your description file is used to map each file to a predetermined location on tape. Users do not need to know what this location is, but a sam administrator needs to establish it before data can successfully be stored. We propose that the storage locations be split  for real  and MC data into two separate branches.  While we are using M2 tapes, we will not mix the different kinds of files on a given tape however,  we will begin using more reliable LTO storage media soon and this requirement may be relaxed. There are 13  group file family types below, which we would like to map to 7 or fewer tape families as shown by the numbers in ().  We specify the tiers for re-recoed and re-root-tupled data just to make them more identifiable. You will always be able to tell from the parentage of a file if it came from a reco or root-tuple file, but having this in the tier info is more obvious. We propose the  tiers and File Families  be as in the  following table with the special designator "-bygroup" indicating that this is data entered under the authority of a particular physics group.
 
 
Data Tier Tape File Family e.g.top group (tape #) Comment
generated production effort
generated-bygroup group-phase1_top_mc_generated(1) mc generated by group
simulated production effort
simulated-bygroup group-phase1_top_mc_simulated(2) mc d0gstar by group
digitized production effort
digitized-bygroup group-phase1_top_mc_digitized(3) mc d0sim by group
reconstructed production effort
reconstructed-bygroup group-phase1_top_mc_reconstructed(4),
group-phase1_top_data_reconstructed(5)
mc,data, reco'd by group
root-tuple production effort
root-tuple-bygroup group-phase1_top_mc_root-tuple(6),
group-phase1_top_data_root-tuple(7)
mc,data, root-tupled by group
trigsim productin effort
trigsim-bygroup group-phase1_top_mc_reconstructed(4) Trigger simulation by group
raw production raw data file
raw-bygroup group-phase1_top_mc_raw(8) raw data as streamed or picked by group

3. Using Phase to Recycle Tapes

Every file stored to tape will have a phase assigned to it. A general set of group phases will be used, and they will be like "group-phase1", "group-phase2", etc.   for all groups. The "group-" is  to distinguish them from any production processing. When a group decides it would like to move to the next phase, a new set of tapes will be assigned with the new phase in the file family name. Any tapes with the old phase can be completely re-cycled and recover the entire tape. In the future, we may have more extensive tools for removing individual files, and squeezing the tape, but not at this time.
 

4. Creating the Required Metadata

Data is added to the SAM system by providing the actual data, and a description file that completely describes the file. The description file is a python formatted file and the  attributes contained in are used to map each file to a unique physical location within the SAM domain.  Monte Carlo files have additional information in the form of an ascii file that can contain details that the user may want to retrieve later.The easiest method for adding data is to use the MCrunjob framework to generate the metadata and parameter files for you. You can edit your own description files based on the examples below, consult the mc_runjob documentation for rules regarding file names if you plan to have these files processed through the external production processing farms.If you are creating files using d0tools, or other d0framework programs, you can specify that the output be stored directly into SAM, and the metadata will be taken care of automatically, However, autodestinations is not implemented for this mode at present, so this is not a viable user mode of storing yet. The parameter files may contain any useful information you would like to save with the file. Mc_runjob generates the parameter files for you, if you are not using mc_runjob you will need to generate this yourself.
 

5. Storing or Declaring  Files to SAM


Each output file in a processing chain must be recorded in SAM, and in the order in which they were processed. However, if you do not wish to actually store the physical data, you can just store the meta data with a "declare". A typical example of a processing chain is  gen > sim > dig > reco > root-tuple. To add a data file into sam, use:

sam store --descrip=description_file.py --source=/path/to/file/ [--resubmit].

The --source flag is used to specify a source path so you do not need to be in the directory where the files are to store them. However, the file, description file, and the parameter file must all be in the same directory for the store to work. If your pwd is the location of these files, you must supply the flag "--source=.". The --resubmit flag is used to store a file that has failed part way through a store previously. Use this flag if you have trouble with the store and you do not fully understand the returned error message, but when you try the store again it says that the file is already in sam.

To just add the metadata, use:

sam declare description_file.py --source=/path/to/file/

This is sometimes needed to provide SAM with the information for the file parentage, but you do not want to store the actuall parents to tape.
 

6. MC_runjob file naming convention


To have files acceptable for farm operation you need to use a standard naming  convention. The character "_" is reserved in mc_runjob, so if you use it in the wrong location you can get problems. An example filename is

gen_test02_p10.06.00_lancs_pythia_hitv00+03+05+qcd-null-PtGt100.0_mb-none_282144137_2001
1   2      3         4     5      6                               7       8         9

1, tier of code used to produce file: gen for four vectors
2, phase: This would be a group specific as given by Lee
3, release version of code used to produce file
4, production facility
5, generator
6, cardfile and decay string, no underscores use + or -
7, mb overlaid must me mb-none for gen files
8, request id (doesn't mean much yet) time stamp
9, wrid, number.....
 
 

7. Examples of description files for several data tiers.

Following are working examples of descriptions files used to store files. Please note the line at the end of each description file (except ex 7.) there is a line with "-bygroup"  in it. This is required. You can check the syntax for the files you create by running them through python with the following procedure:

setup sam
python my_description_file.py

If this test produces no errors, then you are ready to store or declare with the commands given in section 4.

Example 1. store_generated_file.py

from import_classes import *
#
# Generated by runMCwin
#
my_generator   = AppFamily( "generator","p07.00.05a","pythia" )

class MyProcess(ProcFamily):
    group="higgs"
    origin_location="FNAL"
    origin_facility="d0mino"
    produced_for="Qizhong Li"
    phase="group-phase1"
    def __init__(self, stream, param_file, produced_by):
        self.stream=stream
        self.param_file=param_file
        self.produced_by=produced_by

class Generator(MyProcess):
    appfamily=my_generator

channel = Channel("bbh","bbbb")
minbi = MinBias("none","0.0")

gen_fil=Generator(stream="notstreamed", param_file="generator_test185201919.par
a
ms", produced_by="Avto Kharchilava")

gen_file_import = PrimaryMCFile("pythia_bbh_bbbb1.dat",
    gen_fil, 1234, Events(1, 500, 500),
   "07/02/2001 17:44", "07/03/2001 05:23",
    1.960, channel)

gen_file_import.tier="generated-bygroup"
 
 

Example 2. store_d0gstar_file.py

from import_classes import *
#
# Generated by runMCwin
#
my_d0gstar   = AppFamily( "simulator","p07.00.05a","d0gstar" )

class MyProcess(ProcFamily):
    group="higgs"
    origin_location="FNAL"
    origin_facility="d0mino"
    produced_for="Qizhong Li"
    phase="group-phase1"
    def __init__(self, stream, param_file, produced_by):
        self.stream=stream
        self.param_file=param_file
        self.produced_by=produced_by

class Simulator(MyProcess):
    appfamily=my_d0gstar

channel = Channel("bbh","bbbb")
minbi = MinBias("none","0.0")

d0g_fil=Simulator(stream="notstreamed",
                  param_file="d0gstar_test185201919.params",
                  produced_by="Avto Kharchilava")

d0g_file_import =SimulatedFile("d0g.pythia_bbh_bbbb1.dat",
    d0g_fil, 65123, Events(1, 500, 500),
   "07/03/2001 17:44", "07/04/2001 05:23",
    "pythia_bbh_bbbb1.dat", 1, 1, channel)

d0g_file_import.tier="simulated-bygroup"
 

Example 3. store_d0sim_file.py

from import_classes import *
#
# Generated by runMCwin
#
my_d0sim   = AppFamily( "digitizer","p07.00.05a","d0sim" )

class MyProcess(ProcFamily):
    group="higgs"
    origin_location="FNAL"
    origin_facility="d0mino"
    produced_for="Qizhong Li"
    phase="group-phase1"
    def __init__(self, stream, param_file, produced_by):
        self.stream=stream
        self.param_file=param_file
        self.produced_by=produced_by

class Digitizer(MyProcess):
    appfamily=my_d0sim

channel = Channel("bbh","bbbb")
minbi = MinBias("none","0.0")

dig_fil=Digitizer(stream="notstreamed",
                  param_file="d0sim_test185201919.params",
                  produced_by="Avto Kharchilava")

dig_file_import = DigitizedFile("sim.d0g.pythia_bbh_bbbb1.dat",
    dig_fil, 90346, Events(1, 500, 500),
   "07/04/2001 17:44", "07/05/2001 05:23",
    "d0g.pythia_bbh_bbbb1.dat", 1, 1, channel, minbi)

dig_file_import.tier="digitized-bygroup"
 

Example 4.  store_reconstructed_file.py

 from import_classes import *
#
# Generated by runMCwin
#
my_reco   = AppFamily( "reconstruction","p08.12.00","d0reco" )

class MyProcess(ProcFamily):
    group="higgs"
    origin_location="FNAL"
    origin_facility="d0mino"
    produced_for="Qizhong Li"
    phase="group-phase1"
    def __init__(self, stream, param_file, produced_by):
        self.stream=stream
        self.param_file=param_file
        self.produced_by=produced_by

class Reconstruction(MyProcess):
    appfamily=my_reco

channel = Channel("bbh","bbbb")
minbi = MinBias("none","0.0")

rec_fil=Reconstruction(stream="notstreamed",
                  param_file="d0reco_test185201919.params",
                  produced_by="Avto Kharchilava")

rec_file_import = ReconstructedMCFile("reco.sim.d0g.pythia_bbh_bbbb1.dat",
    rec_fil, 253859, Events(1, 500, 500),
   "07/05/2001 17:44", "07/06/2001 05:23",
    "sim.d0g.pythia_bbh_bbbb1.dat", 1, 1, channel, minbi)

rec_file_import.tier="reconstructed-bygroup"
 

Example 5.  store_reco_analyze_file.py

from import_classes import *
#
# Generated by runMCwin
#
my_recoA   = AppFamily( "analysis","p08.12.00","reco_analyze" )

class MyProcess(ProcFamily):
    group="higgs"
    origin_location="FNAL"
    origin_facility="d0mino"
    produced_for="Qizhong Li"
    phase="group-phase1"
    def __init__(self, stream, param_file, produced_by):
        self.stream=stream
        self.param_file=param_file
        self.produced_by=produced_by

class Reconstruction(MyProcess):
    appfamily=my_recoA

channel = Channel("bbh","bbbb")

recA_fil=Reconstruction(stream="notstreamed",
                  param_file="recoA_test185201919.params",
                  produced_by="Avto Kharchilava")

recA_file_import = RecoAnalyzedMCFile("recoA.reco.sim.d0g.pythia_bbh_bbbb1.root",
    recA_fil, 21600, Events(1, 500, 500),
   "07/05/2001 17:44", "07/06/2001 05:23",
    "reco.sim.d0g.pythia_bbh_bbbb1.dat", 1, 1, channel)

recA_file_import.tier="root-tuple-bygroup"
 
 

Example 6. store_reco_analyze_file.py (for multiple parents)

from import_classes import *
#
# Generated by runMCwin
#
my_recoA   = AppFamily( "analysis","p08.12.00","reco_analyze" )

class MyProcess(ProcFamily):
    group="higgs"
    origin_location="FNAL"
    origin_facility="d0mino"
    produced_for="Qizhong Li"
    phase="group-phase1"
    def __init__(self, stream, param_file, produced_by):
        self.stream=stream
        self.param_file=param_file
        self.produced_by=produced_by

class Reconstruction(MyProcess):
    appfamily=my_recoA

channel = Channel("bbh","bbbb")

recA_fil=Reconstruction(stream="notstreamed",
                  param_file="recoA_test185201919.params",
                  produced_by="Avto Kharchilava")

recA_file_import = RecoAnalyzedMCFile("recoA.reco.sim.d0g.pythia_bbh_bbbb1.root",
    recA_fil, 21600, Events(1, 500, 500),
   "07/05/2001 17:44", "07/06/2001 05:23",
    ["parent1.dat","parent2.dat","parent3.dat"], 1, 1, channel)

recA_file_import.tier="root-tuple-bygroup"

Example 7. store_raw_data_file.py

from import_classes import *
TheFile = ProcessedFile(
         name = 'pick_w32.dat',
         sizeK = 17544,
         events = Events(3654425, 3878001, 99),
         stream = 'pick-event',
         tier = 'raw-bygroup',
         start_time = '03/04/2002 23:07:57',
         end_time = '03/04/2002 23:29:16',
         pid = 578442,
         parents = ['all_0000142851_030.raw', 'all_0000142851_031.raw',
 'all_0000142851_032.raw', 'all_0000142851_033.raw', 'all_0000142851_034.raw',
 'all_0000142851_035.raw', 'all_0000142851_036.raw', 'all_0000142851_037.raw',
 'all_0000142851_038.raw', 'all_0000142851_039.raw', 'all_0000142851_040.raw',
 'all_0000142851_041.raw', 'all_0000142851_043.raw', 'all_0000142851_044.raw',
 'all_0000142851_045.raw', 'all_0000142851_046.raw', 'all_0000142851_047.raw',
 'all_0000142851_048.raw', 'all_0000142851_049.raw', 'all_0000142872_002.raw',
 'all_0000142872_003.raw', 'all_0000142872_004.raw', 'all_0000142872_005.raw'])

Note: This example is slightly different from the others as it is a description file generated by SAMManager and is produced automatically when you run a sam project, for example, with d0tools. In this case, you put in  your stream name and tier in the rcp parameters (see documentation on running sam projects).  The tier must have  the "-bygroup" ending in the name.  The "pick-event" stream is special for files produced by pick event.