Home

Automated Data Reduction Workflows for Astronomy

1. 2 2 Design of data organisation and data processing systems The data organisation discussed in Sec and the data pro cessing that follows the data organisation both describe relations among different categories of data These relations are interde pendent in the sense that a change in the selection of data might require some change in the data processing and vice versa The question therefore arises as to whether the best architecture is to derive these relations from a common source or whether the information recorded in these relations is sufficiently different to warrant their independent implementation In general a specific selection of data does not uniquely specify the data processing sequence Very different data pro cessing workflows can be constructed to use a given selection of data Only the most basic data processing follows the data organisation process one to one but this case is rarely used in practice The data processing part of the workflows are in gen eral more complex than the data organisation and are also more frequently subject to change and optimization during the data reduction process The purpose of a file records an aspect of the Freudling et al Data reduction workflows Classify raw files For each file the metadata are analysed following a scheme defined in the classification rules selection criteria used to include this file in a data set It is up to the workflow design to decide how this i
2. product processed_dark REFLEX CATG processed_dark action proc_flat select files as bias from inputFiles where REFLEX CATG bias select files as processed_dark from inputFiles where REFLEX CATG processed_dark and inputFile EXPTIME EXPTIME product processed_flat REFLEX CATG processed_flat action proc_image select files as bias from inputFiles where REFLEX CATG bias select files as processed_dark from inputFiles where REFLEX CATG processed_dark and inputFile EXPTIME EXPTIME select files as processed_flat from inputFiles where REFLEX CATG processed_flat and inputFile FILTER FILTER Notes The table lists an example of a set of OCA rules that can produce the data organisation shown in Fig 2 be referred to by other rules A simple example is that the proc_flat action needs a combined bias frame and produces a product MasterFlat These rules are sufficient to describe the data graphs dis cussed in Sec and shown in Fig I and 2 In Tab we show as a specific example the rules that describe the data organisation for an image that needs a flatfield and a dark frame for its processing as discussed in Sec and shown in Fig 2 The first block classifies available files as science_image flat or dark based on the header key word TYPE and as bias based on the fact that the value of the header keyword EXPTIME is 0 The next three se lect st
3. Hook R N Maisala S Oittinen T et al 2006 Astronomical Data Analysis Software and Systems XV 351 343 Jung Y Ballester P Banse K et al 2004 Astronomical Data Analysis Software and Systems ADASS XIII 314 764 Ludascher B Altintas I Berkley C et al 2005 Concurrency and Computation Practice and Experience 18 1039 McFarland J P Verdoes Kleijn G Sikkema G et al 2013 Experimental Astronomy 35 45 McKay D J Ballester P Banse K et al 2004 Proc SPIE 5493 444 Nastasi A Scodeggio M Fassbender R et al 2013 A amp A 550 A9 Peron M 2012 Astronomical Data Analysis Software and Systems XXI 461 115 Rose J Akella R Binegar S et al 1995 Astronomical Data Analysis Software and Systems IV 77 429 Schaaff A Verdes Montenegro L Ruiz J E amp Santander Vela J 2012 Astronomical Data Analysis Software and Systems XXI 461 875 Schmithuesen O Erben T Trachternach C Bomans D J amp Schirmer M 2007 Astronomische Nachrichten 328 701 Scodeggio M Franzetti P Garilli B et al 2005 PASP 117 1284 Scott D Pierfederici F Swaters R A Thomas B amp Valdes F G 2007 Astronomical Data Analysis Software and Systems XVI 376 265 Tody D 1993 Astronomical Data Analysis Software and Systems II 52 173 Tucker D L Kent S Richmond M W et al 2006 Astronomische Nachrichten 327 821 Tsapras Y Street R Horn
4. n IS un a Auxilliary and debug parameters please do not change GLOBAL_TIMESTAMP 2013 07 18T08 40 36 ESORexArgs suppress prefix TRUE END_PRODUCTS_SUBDIR 2013 07 17T17 54 09 SHOOT 2010 03 05T01 50 07 861_tp N_SELECTED_DATASETS 1 bi lt Fig 5 The Kepler user interface loaded with the Reflex workflow for ESO s X Shooter instrument The top section defines the input directories and user preferences It is usually sufficient to specify the raw data directory to run the workflow on a new data set The execution of a workflow is started with the run button in the top left panel The workflow includes 8 recipe executers that run the recipes necessary to reduce X Shooter data Actors with an orange background include interactive steps that can display the result of the recipe and allow for the optimization recipe settings The workflow includes an specially implemented actor called Flat Strategy that is specific to the X Shooter workflow This actor allows the user to select a flatfielding strategy Depending on the chosen strategy files will be routed differently operation trim modifies a purpose by removing the last action from a purpose that consists of at least two concatenated actions and sets a purpose that consists of a single action to universal Workflows are designed so that the input to any recipe consists of files with a single identical purpose These operations are suf ficient to design workflows that collect al
5. lection rules are defined in the action called proc_science Then the bias frame is selected for this flatfield based on properties of the flatfield e g observing date or read out mode This selec tion is defined in the action proc_flat The purpose of this bias frame as well as the flatfield is then proc_science proc_flat while the bias frame that matches the properties of the sci ence frame as well as the science frames themselves have the purpose proc_science The other biases in this example have the purpose proc_science proc_flat proc_dark and proc_dark The different biases have different purposes so that the workflow can process them separately A given file might have several dif ferent purposes if it is selected multiple times by the rules see Fig I An example of this is when the same bias frame matches the selection rules for both the flatfield and the science frames 2 4 Data processing workflows There are different ways to carry out the task of reducing as tronomical data even when the applications used for individual reduction steps are fixed One approach is to sort data by cate gory and process each category in sequence For example one might start by processing all the bias frames for all data sets as the very first step then proceed to subtract combined biases from all relevant data and continue with for example producing flat fields A different approach is to
6. fully process a single data set per forming all necessary steps to see the final result for the first data set in the shortest possible time Each intermediate product such as a combined bias is produced only when it is needed The former approach has the advantage that it simplifies book keeping in that the only necessary initial sorting is by file type Operations of the same kind are all performed together The parameters for every task are optimized by repeatedly inspecting results Once a good set of parameters is found it is applied to all files of the same kind This approach is efficient in the sense that identical operations are carried out only once while the ef fort for bookkeeping is minimal It is therefore often used when workflows are manually executed by scientists that call individ ual steps in sequence and book keeping is carried out ad hoc without software tools The advantage of the latter approach is that it allows for eas ier inspection of the impact of any change in parameters or pro cedures on the quality of the final target product of a workflow This is particularly important when data reduction strategies are still experimental and being tested This approach also delivers the results faster in that it only executes the steps that are needed for a given data set and thereby more quickly produces the target product for the first data set The advantages of both of these approaches can be combined with the foll
7. keywords and may include logical and arithmetic expressions The Jabels in the association rule are used for logging purposes and are usually set to the category of the file defined in the rule the rules are arbitrary and a change of those names should not impact the workflow execution 3 3 Reflex actors In order to implement these principles Reflex provides 17 es sential actors A complete list of actors is given in Appendix A and are described in detail by Forchi 2012 The actors can be grouped into the data organiser actors to process and direct to kens actors to execute data reduction recipes written in one of several supported languages and actors that provide interactive steps in a workflow In this section we discuss these features and options to illustrate how the principles laid out above can be implemented in concrete software modules 3 3 1 Data organiser and rule syntax For Reflex we opted to implement a program DataOrganiser that carries out the organisation of local data fully automatically us ing a set of user supplied human readable rules The input of the DataOrganiser is a set of FITS files and the classification rules the output is a collection of data sets The rules are based on the principles discussed above It should be re iterated that while definitions of actions could be used to define a data struc ture in sufficient detail to allow automatic derivation of a simple data processing workflow this
8. on a different data set Click on ROOT_DATA_DIR and set as appropriate All subdirectories of RAWDATA_DIR will be searched for data If desired change END_PRODUCTS_DIR Press the Run button OR cntri R to start the workflow Working Directories To monitor the progress of the workflow in more detail Open Window gt Runtime Window in top menu before starting the workflow Output ModifyPurpose E objectToText The X shooter workflow tutorial and demo data and the pipeline user manual can be found here SofCreator http www eso org sci software pipelines reflex_workflows SopCreator V Sj Workflow kar B DataFilter Step 1 Dataorganizer B DataSetChooser SE FitsRouter RecipeLooper SB SofAccumulator 38 SofCombiner SE sofSplitter gt L MyWorkflows b C Opendap DIR Step 2 Creation of Master Calibration Files Data Organisation and Selection snipe ay D tector Linearity Analysis bad pixel maps Dat Organiser single pinhole arc single pinhole flat spectral format table arc line list model configuration linearity frames config 1 pin Datatset Chooser PARITY Initiagise Current DataSet 249 at 0 results found RAWDATA_DIR ROOT_DATA_DIR reflex_input Xshooter CALIB_DATA_DIR scratch2 reflex24 devel install calib xsh 2 2 2 BOOKKEEPING_DIR ROOT_DATA_DIR reflex_book_keeping Xshooter LOGS_DIR SROOT_D
9. out by the data processing actor What remains is the task to select among all of the files of a given category those that should be processed together by the recipe In Reflex this is im plemented as an actor SOFCombiner that bundles the different input files for a recipe into a single SOF The SOFCombiner has two input ports one is for mandatory files and another one for optional files Both of them are multiple ports i e several rela tions can be connected to either input port The tokens sent via different relations to a multiple port are in different channels of the port The SOFCombiner creates a single output SOF that in cludes input files selected by purpose The selection rule is that only files with a purpose that is present at each of the input chan nels at the mandatory input port are passed The desired selec tion of all files with the appropriate purpose is achieved when at least one of the input channels includes only files that are the 10 necessary input for the recipe typically the trigger for the recipe All other channels can include any file and the SOFCombiner automatically selects the correct input for the recipe The algo rithm used by the SOFCombiner uses comparison of purposes as the only method and consists of the following two steps 1 Find purposes that are present at each input channel of the mandatory input A universal purpose counts as a match to any other purpose 2 Send all files both fro
10. proc_ a dark dark j processed_dark y processed_dark 7 4 7 ee I flat y I l 1 I I A l J proc_ image Fig 2 Data graph for a data set to process images as described in the text The symbols used are the same as in Fig 1 In the case shown here each file has a unique purpose and therefore no dot ted lines are used The action proc dark is used to select dif ferent darks for the flat frame and the science image Therefore it appears twice in the graph gets of the workflow In the current example the flatfield itself needs a dark frame for its processing and this dark frame needs to match the exposure time of the flatfield not that of the sci ence frame The science frame flatfield frame and dark frame in turn might all require their own bias frame for reduction The ac tions in this case are given specific labels namely proc_dark proc_flat and procimage Note that the action proc_dark is shown twice reflecting the fact that it is used twice once to select darks for the flatfield and a second time to select darks for the image We note that the topology of the graph might differ between data sets even for the same kind of data For example in one data set the input dark frames for the science and flat frames might be identical in another one they might differ The task of data organisation is to create such a graph for each individual data set
11. reduction process is therefore still es sential to obtain sufficiently high quality results even from fairly routine observations The general concept of astronomical data reduction that does not employ a fully integrated pipeline has not substantially changed in the past decades Researchers organise their data and use a mixture of general purpose and highly specialized tools inspecting the results of each step Such tools are avail able in environments such as MIDAS 1983 IRAF Tody 1993 and IpDt or as stand alone programs What has changed is the number complexity and interdependence of steps needed to accomplish the data reduction In this situation the ef ficiency of the data reduction process can be vastly improved by automating the previously manual workflow of organising data running individual steps and transferring results to subsequent steps while still using the same routines to carry out individual reduction steps The most commonly used approach to automate a data re duction workflow by individual researchers is to employ a script ing language such as Python e g 2013 This approach works well with a relatively small number of reduc tion steps and in situations where the data organisation and book keeping are fairly simple In more complex situations such scripts are themselves complex programs that cannot easily be modified In this paper we describe the usage of a general workflow engine to automat
12. set the file router 4 that directs dif ferent categories of files to their destinations a SOFCombiner that bundles the input for the science step and a data filter and product renamer 8 and D respectively that organise the output products from the workflow exists then the reduction step is not executed and instead the pre vious results are re used We call the feature to re use products created by previous executions of a procedure the lazy mode There might be cases when such a re use of products is not de sired A lazy mode should therefore always be an option of each individual step in a workflow An example for the efficiency gain from using this mode is a set of combined biases that are used by the science and flatfield frames of a data set and in addition by the calibration frame of another data set The combination of biases is carried out only once and is used in three different places One advantage of our approach is that it is as efficient as the first of the above approaches but produces the science results quickly and provides the user experience of the second approach Another advantage is that subsequent runs of the workflow can use this database of intermediate products to redo the reduction with changed parameters in a very efficient manner If a param eter or input file for any step changes then the result for this step will change The change in one of the intermediate products might require the re executio
13. such as calibration files that are rou tinely collected for a given instrument Hereafter we refer to the input files for the data processing as raw files as opposed to files that are created during the data processing and that we will refer to as products We use the term calibration file for any raw file that is not a science file In order to discuss data reduction in general terms we in troduce the following terminology The goal of data reduction is to process sets of files which we refer to as the targets of a data reduction workflow The result of this processing is to cre ate a target product In most cases the targets of a data reduction workflow will be the science files and the target product is then the science data product to be used for scientific analysis The target files can be naturally grouped into sets that are reduced together Such a group of target files together with other files needed to process them is referred to as a data set A data set is complete when it contains all necessary files to reduce the targets and incomplete if some of those files are missing Data organisation is the process of selecting data sets from a larger collection of files and recording information on the type of files and the reasons for selecting them This larger collection of files might be the result of a pre selection process that as sures that low quality or defective data are not considered at
14. 1311 5411v1 astro ph IM 21 Nov 2013 arX1V Astronomy amp Astrophysics manuscript no ms22494 November 22 2013 Automated data reduction workflows for astronomy The ESO Reflex environment W Freudling M Romaniello D M Bramich P Ballester V Forchi C E Garcia Dablo S Moehler and M J Neeser European Southern Observatory Karl Schwarzschild Str 2 85748 Garching Germany e mail wfreudli eso org To appear in Astronomy amp Astrophysics Volume 559 November 2013 A96 ABSTRACT Context Data from complex modern astronomical instruments often consist of a large number of different science and calibration files and their reduction requires a variety of software tools The execution chain of the tools represents a complex workflow that needs to be tuned and supervised often by individual researchers that are not necessarily experts for any specific instrument Aims The efficiency of data reduction can be improved by using automatic workflows to organise data and execute a sequence of data reduction steps To realize such efficiency gains we designed a system that allows intuitive representation execution and modification of the data reduction workflow and has facilities for inspection and interaction with the data Methods The European Southern Observatory ESO has developed Reflex an environment to automate data reduction workflows Reflex is implemented as a package of customized components for the Kepler workflo
15. ATA_DIR reflex_logs X shooter TMP_PRODUCTS_DIR ROOT_DATA_DIR reflex_tmp_products Xshooter END_PRODUCTS_DIR ROOT_DATA_DIR reflex_end_products spectral format table flux std star frame flux std catalog high abs window table optimized model RecipeFailureMode Ask Global parameter for the behaviour when a recipe fails Ask means that each time a recipe fails the choice to continue or stop wil be presented Continue means that the workflow will ignore errors and continue Stop means it will stop Change EraseDirs to true to erase BOOKKEEPING_DIR TMP_PRODUCTS_DIR and LOGS_DIR each time the workflow is run Lazy Mode will not work anymore EraseDirs false fits viewer to use for the inspection of input output products FITS_VIEWER fv Step 3 2D map resampling Spectral Power computation flexure correction Step 4 Response computation Science reduction Step 5 Output Organisation Flat Strategy Data Filter y UMNI ponse ProgyictRenamer instrument response dispersion table optimized model y config 9 pin asl order tab edgq multi pinhole frame spectral format table arc line list AFC arc line list spectral format table master response i y science frame slit stare mode science frame slit offset mode science frame slit nod mode Clos DataSet 9 Tw y Pi Lt ay s E N
16. a host of associated information Apart from the intended signal such raw data include signatures of the atmosphere and the instrument as well as noise from var ious sources Before any scientific analysis of the data a pro cess called data reduction is used to remove the instrumental signature and contaminant sources and for ground based obser vations remove atmospheric effects Only then can the signal of the target source be extracted In general data reduction also includes a noise model and error propagation calculations to es timate uncertainties in the extracted signal In recent years astronomical data reduction and analysis has become increasingly complex The data from modern instru ments can now comprise dozens of different data types that in clude both science and calibration data For example the reduc tion of data from ESO s X Shooter instrument uses almost 100 different data types for its three simultaneously working arms The reduction of such data in general includes a large number of complex high level algorithms and methods The data types and methods are interdependent in a complex web of relations It is therefore increasingly difficult for an individual researcher to understand execute and optimize a data reduction cascade for complex instruments This situation has led to the appear ance of specialized highly integrated data reduction pipelines that are written by specialists and can reduce data with
17. active actor to inspect products produced during a current or previous run of the workflow PythonActor Interface to configure and execute Python scripts RecipeExecuter Interface to configure and execute CPL recipes RecipeLooper Actor to implement looping over one or sev eral recipes SofCreator Actor to create a Reflex Set of Files SOF to ken from a directory with files SopCreator Actor to create Reflex Set of Parameter SOP tokens SOFAccumulator Actor to create a single SOF out of sev eral input SOFs that arrive in sequence SOFCombiner Actor to create a single SOF out of several SOFs that are available simultaneously SOFSplitter Actor to split an SOF by file category
18. age Reflex to implement the design discussed in this paper using Kepler workflows Reflex consists of a collection of actors that support the execution of astronomical applications A shared characteristic of commonly used astronomical applications is that they read data and meta data from FITS files are configurable with parameters and pro duce output FITS files called products Reflex supports any ap plication of this kind that can be started from a command line Hereafter we refer to such applications as recipes The pri mary task of Reflex is to route the necessary input files to the recipes This includes both files in a data set and files created during execution of the data processing workflow In addition Reflex is able to create and send lists of parameters to recipes To achieve these tasks Reflex uses two kinds of objects called set of files hereafter SOF and set of parameters hereafter SOP These objects are used as tokens in a workflow A SOF contains a list of files The record for each file consists of the file name the checksum the category and a list of purposes for that file A SOP contains a list of parameters and the record for each parameter consists of its name and value Reflex actors use and process these objects The construction of an input SOF to be fed to a recipe needs to consider the category and the purpose of a file For every file in a data set these file properties are determined d
19. atements define the three actions proc_dark proc_flat and proc_image and their triggers flat dark and sci ence image What follows are the association rules that spec ify that the action proc_dark needs a bias as input and out puts a product called processed_dark the action proc_flat needs this processed_dark and a bias and outputs a pro cessed_flat Finally the action proc_image needs a bias the processed_dark and the processed_flat The association rules also specify that darks are selected to match the exposure time of the science image or the flat and flats are selected to match the filter of the science_image The application of these rules can lead to a data set organized as shown in Fig 2 However the same rules can also lead to a data graph with different topology for a different data set For example if both the science image and the flat have the same exposure time the application of the rules might select the same dark frame for both the flat and the science image The power of the data organiser is to use such abstract rules to select optimal data sets based on the metadata of the available files 3 3 2 Actors for data processing The purpose of a data processing workflow is to execute a series of recipes The recipes can be written in any language but must accept the basic input and provide
20. dentify the raw files to trigger any previously identi fied actions using criteria that are specified in the rules Identify other raw files and products needed by those actions using criteria that are specified in the rules Add all identified raw files to the data sets Is the action for each product identified Fig 3 High level flow chart of a data organiser If any step in a shaded box fails for any given data set then this data set is marked as incomplete 2 3 Functionalities of a rule based data organiser A software program that uses rules to organise data as advocated above can produce the data graph discussed in Sec 2 1 by a set of steps shown in the flow chart of Fig 3 The output of the data organisation is a list of data sets A data set is marked as complete if there are files that satisfy the criteria used in steps shown in shaded boxes in Fig 3 It is marked as incomplete if any one of those criteria are not satisfied by any existing file Each file in the output data sets is described by the file name the category of the file as defined in the rules and the purpose of the Freudling et al Data reduction workflows file The purpose of the file is recorded as the concatenation of the names of the actions that link the file to the target action In the example discussed in Sec the flatfield is selected based on properties in this case the filter of the science frame The se
21. e gories based on any logical combination of conditions on FITS keywords The syntax of the classification rules is given in row of Tab 1 The classification defines the key word REFLEX CATG as the category of the file and this keyword can be used like any other FITS keyword in the header by other rules A simple example for the usage of a classification rule is to assign to a file the category bias if the header keyword EXPTIME is set to the value 0 The classification rules also define whether a set of files is the target of the workflow or not 2 Organisation Rules Organisation rules define actions and the groups of files that trigger them The rules define a name for each action so that it can be referred to by other rules The syntax of the organisation rules is given in row 2 of Tab The rules include an optional specification of the minimum number of files needed to trigger the action This minimum number is used to determine whether a data set is complete or not There is no maximum number because there are no de fined criteria to select among files that match the condition A simple example is to group at least 3 dome flat frames by filter and trigger an action called proc_flat that combines flatfields 3 Association Rules Association rules define associated files 1 e input files and products that are needed by an action in addition to the trigger The syntax of the associ ation rules is give
22. e K et al 2009 Astronomische Nachrichten 330 4 Wells D C Greisen E W amp Harten R H 1981 A amp AS 44 363 White R L Greenfield P 2002 The PyRAF Tutorial stsdas stsci edu stsci_python_sphinxdocs_2 13 docs pyraf_tutorial pdf Zampieri S Chuzel O Ferguson N et al 2006 Astronomical Data Analysis Software and Systems XV 351 196 Zampieri S Forchi V 2012 OCA User manual ESO internal document VLT MAN ESO 19000 4932 Zhou G 2004 Master Thesis University of California at Berkeley 12 Appendix A List of Reflex actors A complete description of Reflex actors is given by 2012 Here we list the standard Reflex actors in alphabetical order DataFilter Interactive actor to inspect and select FITS files DataOrganiser Implementation of the rule based data or ganiser as described in text DataSetChooser Interactive actor to inspect files in a data set edit the selection and select data sets to be reduced FitsRouter Actor to route files by category IDLActor Interface to configure and execute IDL scripts IsSofEmpty Actor that checks whether an SOF contains files This actor is used to implement different data flows de pending on the availability of some data QObjectToText Actor to present Reflex tokens in human readable form ProductRenamer Actor for renaming FITS files based on keywords of the file ProvenanceExplorer Inter
23. e the data reduction workflow for astronomical observations While this approach is relatively new for the field of astronomy e g Ballester et al 2011 Schaaff et al 2012 it has been widely used in other fields of science including biol ogy and chemistry Abhishek and Sekharb meteorology 2009 and economics Ludascher et al 2005 For that reason we discuss in detail the methods and function alities that are necessary to use such a system for astronomical data reduction and present ESO s new Recipe flexible execu tion workbench Reflex environment as a specific implemen tation of such a system The structure of the paper is as follows In Sec 2 we de scribe the main principles and architecture of our design inde pendent of a particular implementation In Sec 3 we discuss how these principles can be implemented in a specific work flow application using our Reflex implementation as an exam ple Finally in Sec 4 we conclude with a discussion of the 1m pact of performing data reduction in this way 2 Architecture of astronomy data reduction workflows 2 1 Data organisation Astronomical data consist of collections of files that include both the recorded signal from extraterrestrial sources and metadata such as instrumental ambient and atmospheric data Such a col lection of files is the raw output from one or several observing runs and consists of science files that contain the primary
24. ement conditional and or iterative branches in a workflow and stand alone actors to manipulate the category or purpose of a file are either provided by Reflex or can easily be implemented e g as a Python script As em phasized earlier in any manipulation of the purpose the purpose should never explicitly be called by name within the workflow to avoid unnecessary dependencies of the workflow on syntax choices in the rules For example the case discussed above that a flatfield file is selected to be taken close in time to the science spectrum but is used to flatfield the flux calibration file can eas ily be implemented with such a customized purpose processing script 3 3 4 Interactive actors One reason why automated workflows are an efficient way of data reduction is that the user can intercept the processing at any stage and interact with the workflow A major contribution to the interactive user experience comes from the workflow applica tion that provides tools to monitor pause and modify the work flow itself Additional tools are needed to provide application specific ways to inspect and influence the execution of the work flow Reflex provides several interactive actors and a Python li brary to implement actors that can create customized plots and allow recipe parameters to be modified during the execution of a workflow An example of an interface created with this library is shown in Fig 7 Ready to use interactive actors
25. emented some of the concepts discussed in this paper in several open source work flow engines In the end we decided to use the Kepler workflow application to implement Reflex because of its large suite of available components and its robust sup port for conditional branching looping and progress monitor ing In this section we introduce the terminology and summa rize the most important features of Kepler For more details see the Kepler User Manual 3 1 The Kepler workflow engine A workflow application is a software system designed to run se quences of stand alone programs where the programs depend on the results of each other Components of a workflow are rep resentations of these programs as well as elements that manage the results and communication between them In Kepler compo nents of the workflow are called actors In the graphical inter face actors are represented by green boxes see Fig 5 Actors have named input and output ports to communicate with each other The communication is implemented by exchanging ob jects called tokens that travel along connections between the output port of one actor to the input port of another actor These connections are called relations and are represented by lines Output ports emit tokens after the execution of an actor is fin ished and input ports consume tokens when execution of the actor starts The availability of tokens is a crucial factor in deter m
26. essary re runs of steps Recognizing those steps and re executing them is fully automated in Reflex The execution time of the data organiser strongly depends on the complexity of the rules the total number of files and the files in each category For data from a typical observing run the first time data organization might take on the order of a minute on a typical desktop workstation In subsequent runs the lazy mode will reduce this time by a very large factor The execution of the data processing workflow itself adds a fraction of a second to the stand alone execution of each recipe The default mem ory allocation for Reflex is 1536 MB in addition to the memory requirement of the recipes This allocation can be reduced for simple workflows if necessary So far Reflex workflows have been developed for the most commonly used instruments on ESO s Very Large Telescope VLT namely FORS2 SINFONI UVES VIMOS and X Shooter as well as the newly commissioned KMOS They are 11 Freudling et al Data reduction workflows distributed to users to provide them with a pre packaged work flow that works out of the box to reduce VLT data All ESO Reflex workflows are intuitive to understand as each includes a detailed tutorial and a comprehensive demonstration data set As such even novice users can easily modify and experiment with the workflows Reflex workflows are bundled with the corre sponding instrument pipelines and the Reflex envir
27. his is helped by clearly separating the data organi sation from the data reduction steps We therefore advocate a design that not only separates the implementation of the two steps but also uses a different methodology to define the two tasks Each of them should be geared towards the specific needs of each step The data organi sation is usually closely related to the instrument properties the observing strategy and the calibration plan The strategy for data organisation therefore rarely changes after the observations have taken place Interactivity in that part will create overheads that do not outweigh the expected benefits On the other hand the data processing is in general highly interactive and experimen tal and the final strategy is rarely known at the time of obser vation The best values for data reduction parameters and even the chosen strategy might depend on the properties of individual data sets An efficient way to implement a data organisation is there fore a rule based system that can accommodate complex instrument specific rules and can be run to organise either lo cally stored data or data extracted from an archive repository using pre defined rules The syntax of the rules must be able to describe the method of creating data graphs such as the ones discussed in Sec 2 1 Such data organisation is particularly efh cient if it is carried out by raw data archives that have any poten tially useful calibration file a
28. ining the sequence of triggering the actors The relations between actors themselves do not define the temporal sequence of the execution of actors A scheduler is required to trigger the execution of each actor A scheduler in a Kepler workflow is called a director The terminology of Kepler follows the metaphor of film making where a director in structs actors that carry out their parts Reflex uses the Dynamic Data Flow DDF director that allows the workflow execution to depend on the results of actors and supports looping and it erating The basic algorithm used by the DDF director is to re peatedly scan all actors and identify those that can be executed because all of the necessary input is available It then selects one of the actors for execution based on minimizing unused to kens and memory The details are extensively discussed in Zhou 2004 It should be noted that an actor of a workflow can itself be a sub workflow Such composite actors might include their own directors 3 https code kepler project org code kepler docs trunk outreach documentation shipping 2 4 UserManual The Kepler workflow application provides a graphical inter face to create edit and execute workflows A large number of general purpose actors are bundled with the environment There are several ways to monitor the progress of a workflow pause it or stop it 3 2 The Reflex Environment We have produced the software pack
29. is not the approach that we adopt here Instead the rules are used only to organise the data while the workflow to reduce the data is not constrained to using the selected data in any particular manner The DataOrganiser is the first actor after the initialization in any Reflex workflow It organises the input FITS files according to the workflow specific input rules and the output are data sets that are either marked complete or incomplete see Sec 2 Ip The execution of a Reflex workflow is triggered by sending an input token to the DataOrganiser actor The DataOrganiser recognizes rules that use the syntax of a special language called OCA The OCA language has been developed at ESO and is designed to de scribe the Organisation Classification and Association of FITS files based on their FITS header keywords 1981 OCA is used for multiple purposes within ESO s data flow sys tem 2012 and interpreters are embedded in a number of applications Therefore OCA rules to organise data are avail able for most instruments on ESO s telescopes The details of the language are described in Zampieri and Forchi 2012 The language has all of the features needed to define rules for data or ganisation Here we summarize a subset of the OCA language that is useful for the data organisation discussed in this paper The OCA language recognizes three types of rules They are 1 Classification Rules Classification rules define file cat
30. l any recipe that has no impact on the data selection should pass through the purpose of the input files Finally the set to universal operation can be used for files with a unique category that can be processed independently of their usage in the workflow For example a bad pixel map that is used by many different recipes in a workflow can be given the purpose universal to simplify the workflow design These three operations allow an efficient and elegant assem bly of input files for recipes Different operations might be nec essary under special circumstances and a flexible system will allow these to be implemented An important design principle for any operation on the purpose of a file is that it should never explicitly use the name of the purpose The names assigned by Freudling et al Data reduction workflows Table 1 OCA rules syntax if condition then REFLEX CATG category REFLEX TARGET T F select execute actionname from inputFiles where conditions group by keyword list rule type syntax Classification Organisation minRet i Association action actionname minRet i maxRet j select files as label from inputFiles where conditions product label REFLEX CATG category closest by keyword Notes The table lists a simplified version of the OCA rules syntax appropriate for data organisation in Reflex workflows The conditions define categories of FITS files by their header
31. l necessary input files for each recipe by selecting all files with identical purposes to be processed The usage of these operations can best be explained with ex amples The most commonly used operation is trim In Sec 2 3 we already used the example of bias frames with the pur pose proc_science proc_flat and proc_science and flatfield and science frames with the purpose proc_science When the flatfield 1s to be processed the workflow selects the file with the category flat and all files with identical purpose In our example these are the bias and flatfield files with purpose proc_science proc_flatfield The output product i e the pro cessed flatfield should be assigned the trimmed input purpose In our case the purpose proc_science proc_flat is reduced to proc_science In a subsequent step a science recipe collects all the files with purpose proc_science for the input This will in clude the processed flatfield file the bias frame selected to match the properties of the science frame and the science frame itself The operation pass through is used for recipes that only use intermediate products as inputs If such a recipe is needed in the chain e g to smooth the flatfields in the above example this recipe should pass through the purpose of its input file to the product file so that the purpose of the smoothed flatfields is still proc_science In genera
32. low engine The key advantages of automated workflows over alternative methods such as scripting or monolithic data processing programs are the built in tools for progress monitor ing and the ability to modify the data organisation and data flow efficiently The specific advantages of our Reflex implementation in clude 1 Selecting and organising the input data is a significant task for any astronomical data reduction A rule based data or ganiser is used to formalize the selection criteria and to fully automate the organisation of data The automated data or ganisation can be followed by an interactive step to inspect and modify the chosen data sets 2 Reflex allows users to monitor the progress of data reduc tion interact with the process when necessary and modify the workflow A graphical user interface can be used to de velop and experiment with workflows At the same time workflows can be executed in a completely non interactive batch mode to allow processing of large data sets and or for computational time intensive processing 3 Re reduction after a change in input files or parameters is efficiently carried out by only re running those steps that are affected by this change A modern reduction process might use hundreds of files with dozens of different categories and any number of data reduction steps Changing a sin gle parameter in one of the steps or switching a single input file might trigger a complex cascade of nec
33. m the mandatory and optional port that match any purpose found in step to the output SOF Again a universal purpose counts as a match to any other purpose This simple but powerful algorithm assures that the files of the same purpose in the output SOF are necessary and sufh cient to run the intended data processing recipe This fact is then used by a combination of two actors called SOFSplitter and SOFAccumulator The former splits an input SOF by purpose and emits a separate SOF for each purpose The latter collects several SOFs that arrive at the same port and combines them into a single SOF These actors are used in combination with and always bracket a data processing actor see Fig 6 The net effect of this combination is that the recipes called by the data processing actor are executed multiple times and the result is a single SOF that includes all of the products This SOF can then be used by the next SOFCombiner to select the files for the next data processing step The algorithms discussed above are an elegant and efficient way to implement the most common routing needs without rep etition of information In addition explicit operations on file properties can be used to implement special needs A com mon application is conditional routing of files i e workflows in which files are routed differently depending on some data properties or user choices Kepler provides a large number of general purpose actors to impl
34. n in row 3 of Tab 1 There is an unlimited number of select statements that define conditions to se lect files In addition to the conditions a closest by state ment can be used to select those files that have a value for a given keyword that is as close as possible to that of the trigger If there is no closest by statement then the time of observation will be used to select among several files that satisfy the conditions Each select statement can be preceded by an optional specification of the minimum and maximum number of files needed for each category This mechanism allows to define optional input files that are not essential for a workflow but that will be used if present The association rules also define names for categories of products that can Freudling et al Data reduction workflows Table 2 Simple Example of OCA rules Classification rules if TYPE OBJECT then REFLEX CATG science_image if TYPE FLAT then REFLEX CATG flat if TYPE DARK then REFLEX CATG dark if TYPE CALIB and EXPTIME 0 then REFLEX CATG bias REFLEX TARGET T Organisation rules select execute proc_dark from inputFiles where REFLEX CATG dark select execute proc_flat from inputFiles where REFLEX CATG flat select execute proc_image from inputFiles where REFLEX CATG science_image Association rules action proc_dark select files as bias from inputFiles where REFLEX CATG bias
35. n of some but not all of the subse quent steps The database can be used to automatically identify products that can be re used from previous runs and the steps that need to be repeated The implementation of this workflow design requires three levels of grouping of data A schematic diagram of such a work flow is shown in Fig 4 The highest level of grouping are the data sets as discussed in Sec This task is carried out by a data organiser step 2 in Fig 4 Subsequently the files in each data set are sorted by category and directed to the reduc tion steps that need this particular category of files a step that Freudling et al Data reduction workflows is performed by a file router step 4 This is the level that de scribes the data reduction strategy and is shown in the design of a workflow Each reduction step might be called repeatedly with different input files For that purpose a third level of grouping is needed to group files that are processed together with separate calls of the reduction step This is part of the functionality of the reductions steps 5 and M 3 Implementation While the principles discussed in this paper do not depend on a specific software implementation it is useful to discuss them in the context of and with the terminology used in a specific environment Several software environments to design and exe cute workflows exist e g 2008 For the Reflex project we evaluated and partially impl
36. nformation is used For example a category of files such as a flatfield might be selected to match the date of the science frames and is there fore assigned a corresponding purpose This does not neces sarily mean that these flatfields are exclusively used to flatfield the science frames but the data processing workflow might also use them to flatfield standard star flux calibration data For spec troscopy it is not always clear whether the best flatfields for the flux calibrator are those that are taken close in time to the target spectrum or those taken close in time to the flux calibrator This decision depends on a complex set of circumstances A work flow might include conditional and or interactive parts to help the user make that decision Another difference between data organisation and data pro cessing is that while some steps in the data processing are closely related to a specific selection of data others are com pletely independent of it For example a step that only modifies intermediate products has no impact on the data selection or or ganisation Steps that make small adjustments to intermediate products are often added or removed during data reduction Any system that mixes the data selection and data processing work flows is then necessarily much more complex than either of the two components individually One design goal for a workflow system is to make modification of the data reduction as simple as possible T
37. onment The whole package can be installed with a single installation script available at http www eso org reflex ESO expects to develop Reflex workflows for all future VLT instruments Acknowledgements The Kepler software is developed and maintained by the cross project Kepler collaboration which is led by a team consisting of several of the key institutions that originated the project UC Davis UC Santa Barbara and UC San Diego We acknowledge useful discussions with Reinhard Hanuschik on some concepts discussed in this paper References Altintas I Berkley C Jaeger E Jones M Ludascher B Mock S 2004 in Scientific and Statistical Database Management 16th International Conference 423 424 Abhishek T Sekharb A 2007 Computational Biology and Chemistry 31 305 Ballester P Bramich D Forchi V et al 2011 Astronomical Data Analysis Software and Systems XX 442 261 Banse K Crane P Grosbol P et al 1983 The Messenger 31 26 Barseghian et al 2009 Ecol Inform 5 3 Biretta J A Baggett S M MacKenty J W Ritchie C E amp Sparks W B 1994 Calibrating Hubble Space Telescope 8 Cavanagh B Jenness T Economou F amp Currie M J 2008 Astronomische Nachrichten 329 295 Curcin V and Ghanem M Biomedical Engineering Conference 2008 CIBEC 2008 Cairo International 2008 1 Forchi V 2012 Reflex User Manual ESO internal document VLT MAN ESO 19000 5037
38. out su pervision e g Biretta et al 7 2004 Schmithuesen et al Cavanagh et al Tsapras et al For efficient large scale data reduction such pipelines often run in custom made environments For ex ample ESO employs a system for quality control that automat ically associates calibration data and processes them as soon as they arrive from the telescopes The results are then stored in its data archive Other examples of such event driven data reduction environments are NOAO s High Performance Pipeline 2007 STScI s OPUS system Rose et al 1995 and the Astro WISE pipeline McFarland et al 2013 Automatic pipelines work best for data from long term projects that use stable instrumentation aim for a well defined set of similar targets observed at similar signal to noise ratio and in situations where the impact of ambient conditions is rel atively small and highly predictable These conditions are often met for example in space based telescopes However the sit uation is often different for the reduction of data from ground Freudling et al Data reduction workflows based observatories The reasons for this include the complexity of general purpose instruments that are now routinely employed the rapid upgrade pace necessary to exploit advances in tech nology and science goals and the variety of effects imposed by varying atmospheric conditions In many cases supervision and interaction with the data
39. owing design As in the second approach data are processed one data set at a time Data reduction steps that need to be executed several times with different input files from the same data set are carried out in succession For example the step to combine bias frames is executed for the biases to be ap plied to the science frame and immediately afterwards the bias frames to debias the flatflields are processed and so on The in puts and outputs of each individual data reduction step are stored in a database for re use later Whenever a reduction step is called this database is checked for previous calls to the reduction step with the same input files and parameters If such a previous call Calibration Recipe n Data Set Chooser SOF Combiner Product Renamer Fig 4 Example of a basic Reflex workflow The figure uses the graphical elements of a Kepler workflow Sec 3 The lines in dicate the flow of files and are labelled by their contents The optional science files are files that are used to process the sci ence data but the processing can proceed even if they are not available see Sec 3 3 3 The workflow includes two data pro cessing steps one for calibration and one for science processing labelled D and M respectively The elements of the workflow are an initialization I that sends the input directories to the data organiser the data organiser 2 a data set chooser B that allows interactive selection of a data
40. pes receive files of the same but arbitrary purpose Actors com pare and manipulate but never decode the purpose There are three standard operations on the purpose they are called pass through set to universal and trim The operation pass through simply reads the purpose of a file and passes it on without any modification The operation set to universal replaces an exist ing purpose with a new one with the protected name universal The universal purpose is a wildcard that may be substituted by any other defined purpose depending on the circumstances The Freudling et al Data reduction workflows File Edit View Workflow Tools Window Help QRQRQPHe drier io h a e Components Data Outline Workflow Search Components Poa al 7 7 E f PEP X shooter Workflow for Physical Mode Slit Data Reduction v 2 2 2 Advanced Sources All Ontologies and Folders gt B Components v Workflow Instructions Setup Directories Global Parameters actor with interactive option gt E Projects To run this workflow on the demo data Input gt B Statistics ay gt Actors Turn on highlighting Choose Tools gt Animate at Runtime ROOT_DATA_DIR scratch2 reflex24 devel data_wkf from top menu and set it to 1 b 2 Press the Run button OR cntri R to start the workflow gt L Directors V Eso reflex V Gi Scripting kar IDLActor B ProductRenamer A pythonactor v 4j Utilities kar CurrentDataSet IsSofEmpty To run
41. properties of the target product have no impact on the data selection Each raw file is the origin of at least one path along the di rection of the links that lead to the target action This reflects the fact that data sets only include raw files that are needed to process the targets of the workflow A path runs either directly from the files to the target action or passes through other actions on its way We refer to such a path as one of the purposes of a file The purposes of a file are important information for the data processing see Sec 2 3 In Fig 2 we show the data graph for a specific example with the same symbols used in Fig 1 The simple example is an im age that needs a bias frame a flatfield and a dark frame for its processing The flatfield needs to be taken with the same op tical filter as the science frame whereas the dark frame needs to be taken with the same exposure time as the science frame Therefore flatfields and dark frames with these properties must be identified among available files and one of each must be se lected according to criteria such as the closeness of the time of observation to that of the science frame After this step more calibration files need to be added to the data set that are used to reduce the calibration files The selection criteria for those files depend on properties of calibration files instead of the tar Dataset associated calibration files science file a proc_
42. s are provided that can be added to any program 3 3 3 Actors for file routing The top level task when designing a data processing workflow is to decide on the cascade of file processing i e the routing of files by category In Reflex users of a workflow are presented with a visual diagram that shows the directional flow of files with dif ferent categories to the corresponding data processing actors see Fig 4 The data sets created by the DataOrganiser are SOFs that contain a full set of file categories An actor is needed to direct the different categories of files in a data set to the respective data processing actors In Reflex this actor is called the FitsRouter It takes a single data set SOF as input and creates SOFs that con tain input files selected by category from the data set Different output SOFs are emitted from separate ports that are connected to data processing actors For each output port one or several file categories sent to this port are explicitly specified by name The primary use of the FitsRouter is to select the categories of raw files in a data set that are needed for each data processing actor whereas products needed as input arrive directly from the data processing actor via dedicated relations The routing by category assures that a recipe receives all nec essary file categories at the time that it is executed If there are files with categories that are not needed by the recipes they can be filtered
43. sci ence observations to be analysed IDL is a trademark of Research Systems Inc registered in the United States Dataset associated calibration files f trigger he y associated ac I files 2 eo product P f science file target trigger action 1 r i associated files as A up target action Fig 1 Example of a simple data set and its organisation The data set contains all files necessary to produce the science data product of the workflow This includes the science associated calibration files These files are organised using a set of actions that are shown as shield shaped symbols The target files are di rectly connected to the target action that is the root of the graph Files that are connected to an action with a solid line are the trig ger for that action Properties of the triggers are used to select associated files for an action The associated files are connected to an action with dashed or dotted lines To highlight files that are connected to more than one action a dashed line is used for one of these connections and a dotted line for the other one The purpose of a file is the connection between the file and the target action Symbols with tinted background indicate files that have multiple purposes i e there are multiple paths from the file to the target action In addition it might include files that are not directly related to the current observations
44. sed for the processing are stored in the database For each file the file name and the checksum are recorded The two main uses of the database are the implementation of the lazy mode described in Sec 1 e the keeping track of products for later re usage and the organisation of the output files in a user friendly way For the lazy mode checksums and creation dates can be used to detect changes in input files of the same name The main output of a workflow are the files pro duced by the final science processing step 1 e the science data products of a workflow Intermediate products produced by pre vious steps are often needed to evaluate the science data prod ucts troubleshoot or investigate the optimization of the prod ucts For that purpose each science data product should be as sociated with the input data and parameters used to generate it The input files might themselves include products from previous steps that are associated to the input of that step At the conclu sion of a workflow all files used and produced during its exe cution can be organised in a directory tree that can be browsed either with a specialized tool or with a file browser 4 Summary and conclusions In this paper we describe how a workflow application can be used to automate an astronomical data processing workflow We propose a specific design for such a workflow and present the application Reflex that implements this design within the Kepler workf
45. set and we refer to them as actions The targets of the workflow connect directly to the root node action 1 that is therefore called the target action Each action has several incoming files connected to it Some of those are used to define selection properties of other input files to that action For example an action might specify to select flat field images that use the same filter as the science image We use the notation that the files that are used to define properties of other files in our example the science files are the trigger for that action and their links are shown as solid lines in Fig I The trigger of the target action are the targets of the workflow All actions other than the target action have one or several outgoing links that connect them to subsequent actions These outgoing links pass on metadata that are extracted from the in put files to the next actor They are therefore called products of an action These products do not necessarily correspond to actual physical products produced during data reduction and the actual physical products created during data reduction do not necessar ily appear in the data organisation graph Instead the products in the data organisation graph are used as a logical scheme to define the selection of data For example for the purpose of data organisation it is not necessary to define a target product even when the data processing workflow creates one This is because the nature and
46. that have been developed for Reflex include the DataSetChooser to interactively inspect and select data sets the DataFilter to inspect and filter Freudling et al Data reduction workflows gt Ca v a X shooter interactive Object Reduction NIR Arm Slit Nodding a o B Linear extracted and Merged Spectrum SN 208 23 1514 1548 nm Recipe Parameters E general stack removecrhsingle localize gt 100000 decode bp 2144337919 extract method NOD Total Flux ADU max slit 5 7 min slit 5 3 1200 1400 0 1800 2000 2200 2400 z Wavelength nm combinenod method MEAN Linear Extracted Merged and Flux Calibrated Spectrum correct sky by median TRUE stdextract interp hsize 30 Ls E N Continue Wkf Total Flux erg s cm2 A o o o o He oa fo 1200 1400 1600 00 2000 2200 2400 Wavelength nm Re run Recipe Help 2D Merged Object Spectrum ppi Disable this window in subsequent iterati This data belongs to dataset XSHOO 2010 10 06T08 57 10 811 tpl Position Along Slit arcsec Wavelength nm x 2368 26 y 21545 7 L Fig 7 Example of an interactive interface created with the Reflex Python library The plots on the left hand side can be in teractively manipulated to inspect the data products in different ways The panel on the right hand side allows the user to modify recipes parameters and re execute the recipe or continue with the
47. the basic output information needed to run the workflow In particular the recipes must accept FITS files that are categorized and generate products as FITS files and the information to categorize them Reflex provides three actors to execute recipes They are called PythonActor IDLActor and RecipeExecuter The PythonActor is used Freudling et al Data reduction workflows data processing SOFSplitt sof_in 1 er SOFAccumulator sof_out groups groups n n Fig 6 A data processing actor embedded in a SOFSplitter and SOFAccumulator to manage repetitive executions for files with different purposes to run Python scripts that in turn can call for example shell commands IRAF tasks via the PyRAF interface White and 2002 or MIDAS programs via the pyMIDAS in terface Hook et al 2006 The IDLActor is used to run IDL programs and the RecipeExecuter executes CPL recipes The basic function of all three actors is to filter and send the files of the input SOF to the recipe and create and emit the output SOF with the products of the recipe The purpose of the product files is constructed from those of the input files using one of the stan dard operations described in Sec All CPL recipes can be queried to report their input and output in a well defined format This feature is used by Reflex to automatically generate param eter lists and ports For Python and IDL simple interface
48. this stage Organising data is a complex and time intensive proce dure that is typically among the first tasks of a data reduction workflow e g Scodeggio et al 2005 Hereafter we will refer to the whole data reduction workflow including data organisa Freudling et al Data reduction workflows tion simply as a workflow whereas we will use the term data processing workflow for the processing of data that follows the data organisation The first step in data organisation is to classify files 1 e to determine the data content of each file from its metadata The goal of classification is to assign a category to each file An ex ample of such a category is flatfield for filter I The next step is to identify the targets and group them into data sets that are incomplete at this stage Subsequently calibrations are added to the data sets Calibration files for each data set are selected by analysing the metadata of the targets and that of other available files that potentially qualify for inclusion in a data set This cascade of selection criteria naturally maps into a data graph as illustrated in Fig 1 The links between elements of the graph show the flow of metadata that originates from the raw files The graph is directed 1 e links between elements have a di rection to distinguish between incoming and outgoing informa tion The nodes of the graph define necessary procedural steps in the assembly of a data
49. uring file or ganisation according to the pre defined rules For products these properties have to be determined during the execution of the workflow It is important to note that these two file properties are handled differently Every recipe needs to be aware of the file category of its input files For example a recipe that com bines flatfields might use dome flat exposures and bias frames as input files and these files need to be identified to the recipe The mechanism to identify files to the recipes is different in different environments For example IRAF uses different input parame ters for different file types whereas ESO s CPL recipes use text files with file tags to identify the file types In both cases Reflex uses the category to identify these file types Reflex workflows therefore need to explicitly use the exact names known to the recipes for its categories The data organisa tion rules have to generate these exact names In contrast recipes are oblivious to the purpose of a file The recipe to combine flat field frames does not need to know how and where the combined flatfields will be used Therefore the processing of the purpose is completely handled by workflow actors As discussed in Sec 2 3 a purpose is a concatenation of ac tions used to organise the input data The name of an action is arbitrary and therefore it is never used explicitly in the work flow Instead Reflex uses the overriding principle that reci
50. vailable for retrieval For example ESO offers an archive service calselector that selects organ ises and provides access to data in a manner similar to the one described above In contrast data processing after the data organisation ben efits from interactive graphical and dynamic elements An effi cient way to provide this is to use a workflow application that allows the implementation of workflows that can be easily mod ified for experimentation and optimization during data process ing It is important that these interactive features can be turned off once a workflow has been tuned and optimized in order to al low time intensive processing to be carried out in a batch mode http www eso org sci archive calselectorinfo html Identify the targets for a work flow from the classification rules Group the targets that need to be processed to gether as defined in the rules for the target ac tion Each group of targets defines a data set Identify calibration files needed by the target actions using criteria that are specified in the rules These files can be mandatory or optional Identify products that the target actions need using criteria that are specified in the rules These products can be mandatory or optional Identify the action to define each product using crite ria that are specified in the rules Actions are optional if their product is the optional input to another action I
51. w engine Kepler provides the graphical user interface to create an executable flowchart like representation of the data reduction process Key features of Reflex are a rule based data organiser infrastructure to re use results thorough book keeping data progeny tracking interactive user interfaces and a novel concept to exploit information created during data organisation for the workflow execution Results Automated workflows can greatly increase the efficiency of astronomical data reduction In Reflex workflows can be run non interactively as a first step Subsequent optimization can then be carried out while transparently re using all unchanged intermediate products We found that such workflows enable the reduction of complex data by non expert users and minimizes mistakes due to book keeping errors Conclusions Reflex includes novel concepts to increase the efficiency of astronomical data processing While Reflex is a specific implementation of astronomical scientific workflows within the Kepler workflow engine the overall design choices and methods can also be applied to other environments for running automated science workflows Key words Methods data analysis Techniques miscellaneous Astronomical databases miscellaneous Virtual observatory tools ESO 2013 1 Introduction Astronomical observations produce data streams that record the signal of targets and carry associated metadata that include ob servational parameters and
52. workflow SOFs and the ProvenanceExplorer to inspect the provenance of a product and its history from repeated runs of a workflow All interactive actors and features can easily be turned off when starting Reflex to allow a workflow to be run in batch mode once it has been adapted and optimized 3 4 Modularity of Reflex workflows Kepler provides an easy way to create modular workflows A composite actor is an actor that itself contains a workflow and composite actors can be nested to arbitrary depth Placing each data processing actor together with its supporting actors into a composite actor leads to a clean and intuitive view of the whole data processing workflow The layout of a workflow whether it is modular or not does not uniquely define a sequence of actor executions For example a scheduler might decide to alternate processing of actors con tained in different composite actors However workflow execu tion is more intuitive when each composite actor is completed before processing proceeds to other actors In Kepler this can be achieved by placing an appropriately configured director into each composite actor 3 5 Book keeping and product organisation The efficiency of the workflow execution relies on rigorous book keeping that stores relevant information in a database for easy retrieval and processing During execution of the workflow the input and output files of each step in the workflow as well as all parameters u

Automated Data Reduction Workflows for Astronomy

Contents

Download Pdf Manuals

Related Search

Related Contents