Skip to content
Patricia Herterich edited this page Jul 21, 2015 · 2 revisions

Notes from the first CAP meeting on Monday, 9th of February 2015, 4-6PM CET, slides are available at https://indico.cern.ch/event/371432/

Clarification: this is the meeting to take the next steps in the development of the Data Analysis Preservation Framework [DAPF; now called CAP]. This is a platform which allows capturing data, software, documentation and whatever else might be needed for your analysis in a CLOSED setting. There will be measures to allow users to publish captured materials easily (e.g. through the Open Data Portal), so the default setting is CLOSED.

For the time being we use [DAPF] for Data Analysis Preservation Framework. Terminology is under discussion.

Introduction

In summer 2014 a first meeting took place to gather feedback on the development of the Analysis Preservation Framework. A simple prototype was made available, but towards the end of the year the team focused on the launch of the Open Data Portal (November 2014), so the time is now right to restart the DAPF development

Current status: First clickable prototype available. Back then first submission forms were developed, tailoured to the needs of the individual collaborations. They need to be revisited.

It is envisioned that the development of this service will happen throughout 2015. Similar to the development of the Open Data Portal, the development shall happen “in the open”, i.e. on github. When integrating internal (confidential) tools, this will be taken into account, of course. Everyone interested in contributing to the development, is very much invited to do so. This could happen on various layers, e.g. technical/coding, metadata definitions or testing (just to name some examples).

Plans and requirements from collaborations and partners:

ALICE:

  • started discussing last year to define the analysis steps to store. This needs to be revisited now

  • want to have the possibility to change the sw (for the analysis)

  • will provide more detailed feedback on the existing submission form and expected content to come

  • have a student to experiment with the different steps

  • in parallel there is a big public data release ongoing for the CERN Open Data Portal

ATLAS:

  • formed a panel to define the goals and discuss different types of methodologies. A report is expected to be shared in summer. The panel is mandated to reach out to other collaborations and to the DAPF team to study the next steps and to support the decision making.

  • interest of having the Higgs legacy analysis as a bench-mark

  • interest to build use cases, e.g. possibly more narrow use cases than available through DAPF. There is an interest in intermediate tools like RIVET and RECAST

  • working on control centre for rivet analyses, now prototyping a full ATLAS example, including simulation and reconstruction, which does not require the original data, but some restricted final products. DAPF could serve as a test bed for more complex cases.

CMS:

  • pragmatic approach, different analyses as use-cases, ranging from the open data example analysis (very simplified - no MC, but complete) to complete, complex analyses.

  • hoping to get an attractive tool for physicists that goes beyond existing platforms in CMS

  • well-defined storage for intermediate data available at the time of active analysis work

  • search functionalities (final states, triggers) should be available

  • this requires sophisticated metadata. Such metadata is usually available, but not necessarily exposed. We would need to work on the standardized preparation and exposure.

  • flexibility from full reproducibility to just final analysis steps is required

LHCb:

  • many people are asking for a solution for archiving code and data after the analysis

  • interest not only on archiving, but also for an area where the preserved code can be run

  • provides detailed feedback on the existing submission form

  • would like to discuss what is feasible: i.e. how much data we can archive, which level, how much code, can we produce and archive a complete VM environment, what are the resources?

DASPOS:

  • big metadata effort, working closely with CODP group

  • from technical side: trying to understand what to put behind the workflows (definition of a workflow container).

  • One core question for DASPOS work: How does one instantiate workflows using DAPF, RECAST or RIVET?

  • Continued interest in contributing to the underlying infrastructure to complement things IT is working on

  • received a workshop grant to run cross disciplinary multi stakeholder workshops. Topics will also include knowledge preservation and representation.

DPHEP:

  • important area to work on, requested by funders ⇒ DPHEP very interested to contribute

  • we need to be clear about scope, resources, expectations

  • do we preserve or capture (and connect to preservation services)? Needs to be clarified.

  • don’t hard code details

RECAST:

  • RECAST (in its latest version) was just released on https://recast-demo.cern.ch

  • RECAST will “engage” with the data available on the DAPF, thus using it for requests. Work has already started.

  • RECAST will need a place to preserve the outputs, also it is needed to mint DOIs for the results. [CERN mints DOIs for data]

  • do we need an analysis catalogue? where would that sit? For content openly accessible, it could for example be done through INSPIRE which is a content aggregator anyway.


To be done now:

each collaboration: please provide us with the content specifications that should go into this (closed) framework:

  • What should be connected

  • Where does this content [data, code, text, documentation, discussion] live currently?

  • What will we find there? [size, file formats, what metadata available?]

  • What functionalities would you need? [actions on demand, automatic operations]

  • Access restrictions?

Please provide this within the next 2 weeks. We are happy to help you in putting this list to together. partner initiatives: please let us know if you require specific interfaces/APIs/… and please share your timeframes with us, so we can take that into account. And, of course, let us know if you wish you to contribute to a specific tasks



Further concrete actions for the time to the next meeting:

change URL to reflect the “closed” environment and to distinguish it more clearly from the CERN Open Data Portal [update on Feb 10: now available at http://analysis-preservation.cern.ch/]

  • reflect on name for the framework/portal

  • reach out to Rivet to understand potential collaboration

set up an e-group for everyone on the mailing list. It should be noted that during the development it will be necessary to create sub-groups, e.g. to discuss metadata challenges for example. If this is done, a note should always be sent to the overall group.

Next general meeting will take place in 4 weeks time [date or doodle to come]. S. will compile a list of overall requirements based on the lists provided by the collaborations and partners. In addition, we will create some ‘user stories’ to illustrate the functionalities of the framework better. It is planned to circulate the more detailed requirements openly upfront the next meeting.