RO evolution

Skip to end of metadata
Go to start of metadata

Comments by Pique in blue.

Comments by GK in green

Comments by Marco in orange

Terminology

Research Object Evolution refers to the ability of managing changes in a research object and its aggregated resources by creating and maintaining different versions of the research object during its life-cycle. It provides a detailed description of the changes in these resources, tracking contributions reused from other sources. It, thus, enables tracing the progress of a Research Object and accessing concrete versions of a Research Object (and their aggregated resources) or individual resources (notably workflows). 

The Research Object Evolution model enables the representation of the different stages of the Research Objects life-cycle, their dependencies, as well as the corresponding versions of Research Objects and their aggregated resources, with the associated changes in these resources. The concrete realization of this model is the roevo Ontology and it is built on top of the core ro ontologies.

A Research Object Version is a specific form or variant of a Research Object that is normally created after applying changes to an existing variant. A version has an associated state, in our case, snapshot or archived. Versions are related to each other via direct relationships (e.g., priorVersion) or via contribution dependencies (e.g., derivedFrom, relatesTo).  

Research Object Life-cycle refers to the stages that the research object transition since its conception until its conclusion. A Research Object can be in one of the following stages:

  • LiveRO:  represent a work in progress. They are thus mutable as the content or state of their resources may change. That is,
    • They are private to the researcher or they are shared with others collaborators via mail, in a common space/server, or with the help of collaborative tools like Dropbox, SVN, Github, etc.
    • They may be versioned with tools like Github or SVN (e.g., as it happens with educational resources as described in http://mashe.hawksey.info/2012/03/do-you-git-it-open-educational-resourcespractices-meets-software-version-control/)
    • They are potentially under the control of multiple owners and may fall under mixed stewardship, raising issues of security and access control; however, these issues are addressed by the collaborative and versioning tools used for maintaining LiveROs (out of scope of Wf4Ever).
    • When a LiveRO at a certain point in time during its evolution needs to be disseminated or preserved, a SnapshotRO or ArchivedRO needs to be made (see below). 
  • SnapshotRO: are intended as a record of past activity, ready to be disseminated as a whole. They are immutable, and reflect the state of the Live RO at a certain time.

    My work with Marco suggests that snapshots *are* mutable, in that they are taken to provide a stable base, independent of ongoing research activity, for some purpose of reviewing and/or creating a publication.  This would be analogous to a branch in a SCM repository.  If you really mean for a snapshot to be immutable, then we may need to find another term ("baseline", "branch"?) to describe a copy of a Live RO to be used for some purpose other than ongoing research. Perhaps we should specify the purpose of the snapshot and its life cycle, if it has one. In the case of the review/publication process, a snapshot starts a live of its own, bouncing back between reviewer and researcher. Should we give this type of SnapshotRO a different name (see publicationRO below)? Should we reserve guaranteed immutability only to archived ROs?

    • Successive snapshots are related to each other via versioning relations. (Is this necessarily so?)
    • They are stored in a Digital Library (RO DL and potentially Journal DL infrastructure) (Is this necessarily so?)
    • They are shared among a group of selected users, for instance with the research group for internal review or with external referees for external review, but they are not publicly available by default 
    • Because some of their components or the whole RO may be shared among restricted users, they are raising issues of security and access control. (Is access control not something to be considered as applying to all kinds of RO? - yes, I think we should have a fine-grained model that could be applied to all types of ROs and their components: any user domain may have different customs for applying them) - The point here was to highlight that LiveRO control access mechanisms are out of our scope, they are managed by other tools (e.g., Git, SVN)
    • They must be referenceable 
  • ArchivedRO: represent the final stage of a Research Object where it has either reached a version that the author prescribes to be stable and meaningful and is appropriate for publication and long term preservation. They are therefore immutable, with no further changes or versions allowed, or has been deprecated (deprecated? why? - Maybe later in the project, after doing the major types, we can think of an ontology by which we would like to be able to label ROs; 'deprecated' could be just one of them).

    (I think this conflates at least two issues:  archival, meaning preserved for the long term, and publication, meaning made publicly available and citable - I am not sure if the conflation was intended, but I agree that we best keep archival and publication separate.)

    • They are usually stored in a Digital Library (RO DL or Journal DL infrastructure) (*softened)
    • They are publicly available (typically, but they may also be restricted to a certain group of people) 
    • Desirably curated (as described in http://www.dcc.ac.uk/digital-curation/what-digital-curation)
    • They must be citable 
    • They must be preserved 
  • PublicationRO  (See also the Review/publication showcases. I took the liberty of adding this here hoping to help distinguish snapshotRO/archivedRO/PublicationRO. I am not sure how much difference there would be between publication RO types and curation RO types. - Marco)
    Publication here represents the process by which a value of trust is added to a scientific investigation represented by a RO. The process conforms to common practices within a research community, which is typically more than simply making the RO publicly accessible. In biology (and astronomy?) this means some form of peer review.
    During review, additional RO types may be discriminated (Marco: I don't know if you would call these version of the PublicationRO or ROs with a first class type):** ToBePublishedRO (owner: researcher): represents an RO that will be prepared for publication. It is initially a Snapshot of a LiveRO, but starts an independent live bouncing back and forth between researcher(s) and internal reviewers before it is submitted. It may or may not be archived/preserved.** ReadyForPublicationRO (owner: researcher): represents an RO that is ready for undergoing the publication process (typically by external review). It may or may not be archived/preserved.** SubmittedRO (owner: researcher): represents the RO that is sent off for review (curation?). It is based on the ReadyForPublicationRO.** UnderReviewRO (owner: reviewer): represents the RO when it is in the hands of reviewers (typically external reviewers). The reviewers may or may not make copies. Its reference is not for the general public by default.** UnderRevisionRO (owner: researcher): represents the RO that is back in the hands of the researcher after review, when it needs revisions. The researcher may revise this RO or (if that is more practical) restart from a LiveRO (the typical case when a submittedRO was rejected)** AcceptedRO (owner: researcher): The RO that is accepted for publication.** PublishedRO (owner: publisher): The RO that is published. This may use the ArchivedRO paradigm to guarantee its preservation. It has the added value of having been positively reviewed (curated?)

My comments here are informed in part by work I do with the Oxford Bodleian Libraries; among other things they keep "dark archives" of materials that are strictly *not* for publication; politicians memoirs, etc.  This could include private data underpinning healthcare studies.  Separately, a researcher may decide to publish something on the web without making provision for its preservation.  I think there's a reasonable expectation that both published and archived ROs are immutable.

Maybe we should be more specific about mutable/immutable. It think we should allow ROs to be annotated after preservation, without this necessarily meaning that we are mutating the original RO. People tend to extend the models by which they annotate over time.

Clarification (from Pique): There has been some confusion around the published stage and versioning/evolution of ROs because published has been understood as a publicly available and final stage of an RO, as it is the case in paper-prints. But because we understand Wf4Ever also as a working platform where Live ROs are "published" as Snapshots while they grow, and shared under very specific conditions to a restricted community/group, we have adopted the term Snapshot in the evolutionary model to replace the term Published. That is: Live RO -> RO Snapshot -> Archived RO is the same as Live RO -> Publication RO -> Archived RO

Marco checked until here (except one comment under lifecycle) - 7/4/2012

Lifecycle

  • The life of a Research Object starts with the creation of a Live RO. 
  • During the dynamic life of the LiveRO, i.e. during the maturation of a LiveRO while a researcher is designing and executing her investigation, 0 to many snapshotROs are made from the LiveRO. SnapshotROs have four purposes (at least):** A frozen copy of the state of an investigation that can be preserved for the researcher and/or research team.** Internal review (e.g. between supervisor and student). The evaluation of the snapshot is then used for the further development of the LiveRO. Typically, the snapshotRO is kept by the researcher for later reference.
    • The basis for an archivedRO (or do we go directly from LiveRO to ArchivedRO?)
    • The starting point of the publication process (see Review+publication showcases). This is one of few cases where a snapshotRO starts a life of its own.
  • An Archived RO is an endpoint in the life of an RO. It does not evolve further. However, earlier stages of the RO may have produced clones (forks). Typically, it is a stable and meaningful version appropriate for public release and long term preservation. It may at some point acquire a status of 'deprecated' or abandoned. (I think deprecation and abandonment needs more thought, imo the essential aspect is preservation)
  • In a typical scenario, a Live RO has produced a number of Snapshots before producing an Archived RO.
  • A NEW Research Object (with a new Identity) may be created  (Should the creation of a duplicate that may have its own life be considered the birth of a new RO?. E.g. the snapshot that is used for publication may have a life of its own.)
    • from scratch
    • by forking a Live RO, e.g., split/replicate a Live RO in order to explore different hypothesis
    • by forking a Snapshot RO, e.g., recover and re-use an old Snapshot to produce a different Archived RO, or 
    • by re-using an Archived RO, e.g., continuing the research after an Archived RO is produced to produce new results and make progress in the research

These new ROs should keep the history of its ancestors when it is created from forking or re-use by the same author. Re-use by another user would imply the creation of a new RO with no history associated to this new author. There is a history related to the "author-RO couple" where the user is very interested in (evolution of integrity, stability, completeness, quality, etc.) and a history related to previous lifes of an RO. It should be possible to recover and old RO Snapshot following this "author-RO couple" history.

The life cycle of ROs during a review/curation process is addressed elsewhere in more detail. E.g. see ROpublicationLifeCycle.pdf in https://www.dropbox.com/home/Wf4Ever/Taskforces/User/UserMaterial

Specification

RO Evolution scenario

Each RO may have a label/Tag*, identifying a more detailed information of its state, e.g., 'internally-reviewed', 'in progress of internal revision', 'in progress of modifications', 'peer-reviewed', 'ready-to-publish', etc. See for example the above diagram instantiated for a RO through a peer-reviewed publication process.

* Would it be necessary to make it more formal (semantically)? I have the same question. Does it also relate to the question which are virtual copies of ROs and which are real with potentially a different owner.

As a comparison to software evolution, in the diagram above the main Live RO is similar to the Trunk, each Snapshot RO is similar to a Tag, and the two forked Live ROs are similar to branches, however in contrast to software evolution, each fork becomes a new RO, they have a different purpose (e.g., explore different hypothesis).

Workflow Evolution

Evolution in Vistrails

  • Vistrails captures changes to parameters values and to workflow definitions (i.e., changes to workflow instances and workflow specifications). In particular, the following change operations are captured: 
    • adding, replacing and deleting a module
    • adding and deleting connections
    • setting parameter values
  • So, instead of storing a set of related dataflows, they store the operations or actions that are applied to the dataflows
  • The semantics of an action are not considered, i.e., add a concrete module (e.g., volume renderer or isosurface extraction) instead of just add a module.
  • A vistrail (visualization trail) captures the evolution of a dataflow --- all the trial-and-error steps followed to construct a set of visualizations.
  • A vistrail consists of a collections of dataflows---several versions of a dataflowand its instances.
  • A vistrail is essentially a tree in which each node corresponds to a version of a dataflow, and each edge between nodes P and C, where P is the parent of C, corresponds to one or more actions applied to P to obtain C.
  • A vistrail node (a version) can optionally have a name (a tag that describes the version).
  • They use an XML schema to represent dataflow information.

Hence, in Vistrails each action creates a version, but only versions with special meaning to the user are tagged. This process is similar to ontology change management approaches (see my thesis). However i see the following cons for Wf4Ever:

  • This applies for the design/execution environment of the workflow (e.g., Taverna), which is not the focus of Wf4Ever
  • The set of operations supported identified is very simple and focused on data exploration through visualization. In our case, operations with scientific workflows are potentially more and more complex. This is being analyzed by Pinar.
  • Vistrails evolution is only about dataflows, not the whole experimentation.

Evolution in Workflow Evolution Framework - EVF (Trident)

  • In EVF two dimensions of workflow evolution are addressed: 
  1. Direct Evolution: Happens when a user of the workflow perform one of the following actions:
    1. Changes the flow and arrangements of the components
    2. Changes the components within the workflow
    3. Changes inputs and/or output parameters or configuration parameters to different components within the workflow
  2. Contributions: track the components re-used from previous systems (e.g., a module from other workflow, new branches for new lines of work derived from a particular research)
  • Each workflow has links to the direct evolution (unless it’s the first workflow in the evolution), which will point to the next version of the workflow, if any, and to the contributions.
  • A new version of a workflow will be saved, creating the next version when the user explicitly decides to save the workflow. But information about workflow instances and data products are saved automatically. Hence versioning of workflows and related artifacts is done at three separate stages:
    • User explicitly saves the workflow;
    • User closes the workflow editor;
    • Executing a workflow in the editor:
  • This granularity may not capture all edits
  • EVF framework can work with different versioning systems to support different versions of the data products

Similar as the previous approach, applies mainly to the design/execution environment. Couldn't find an explicit change model.

Workflow Development primitives for Taverna Workflows 

Within the Taverna workbench workflows are created and evolved through the application of the following operations:

  • Add/Delete Workflow input/output ports
  • Add/Delete Processor
  • Add/Delete Processor input/output ports
  • Add/Delete data flow links between ports
  • Add/Delete Mergers of multiple data flow links
  • Add/Delete control flow links ("run after")
  • Update processor 
    • Update Processor Name and Annotation
    • Update Processor execution configuration (set parallelization, looping, retry, list handling strategies)
    • Update Processor security configuration (for web service type processors)
  • Update Input/Output ports
    • Update Port Name and  Annotation

Unlike Vistrails and Trident systems, the Taverna workbench is not instrumented for tracking changes in workflows, and does not provide workflow versioning function. In practice, people use their own versioning scheme for keeping track of versions while editing / completing a workflow. I recommend the use of subversion. As described below, myExperiment can also be used, because it keeps versions. I recommend the use subversion for the editing phase, before sharing and myExperiment versioning after sharing and also for the more significant changes. myExperiment gives you a unique URL reference for each version. This influences how myExperiment can be used. You may find smaller updates in myExperiment versions, because workflow creators may be unfamiliar with using dedicated versioning systems and because if people are like me, they always spot the little mistakes just after an upload ;-)

An Empirical Analysis of Workflow Evolution in the myExperiment Repository

myExperiment  is the largest repository of scientific workflows. It contains workflows from various systems; with the majority of workflows being the product of Taverna workbench. There are two functionalities in myExperiment that allow users to flag workflow evolution and re-purposing:

  • Versioning: myExperiment provides versioning support. Workflows are published with an initial version of 1, which is incremented in each subsequent publication. We should note that these versions of workflows are not working versions like the ones in Vistrails or Trident. Scientists publish their work to myExperiment, when they deem that the workflow is either 1) performing a desired function as part of their investigation 2)  demonstrating a domain specific or generic data analysis pattern to serve as an example or as a training artifact. Versioning in myExperiment is similar to "Direct Evolution" of Trident system, but only over published/snapshot workflows.
  • Attributions: myExperiment allows for generating attributions among artifacts such as Workflows, Files and Packs. Even though the attribution relationship is not typed, one usage of attribution is for referring to workflows, from which a contribution has occurred. Contributions could be in the form of re-use and re-purposing. Re-use refers to copying and pasting fragments of workflows from one to the other, whereas re-purposing refers to taking one workflow as a base and adjusting it to fit another purpose. This is somewhat similar to the "Contribution" type evolution of Trident. 

 In this empirical analysis we have looked at versioning data within myExperiment. We have extracted existing workflow data within myExperiment using the REST API (As per 20 Feb 2012). The data is comprised of a list of all workflows and their versions. When users publish new versions of workflows they typically provide textual comments, namely "revision comments" on the changes that has been made. The user supplied descriptions has been analyzed and the goals of users in changing their workflows have  been observed in the following categories:

  • FIX**  GENERAL MAINTENANCE (due to changing external resources) 40%**  BUG FIX 11%
  • IMPROVEMENT** DOCUMENTATION/METADATA IMPROVEMENT 18.5%**  SAMPLE/TEST DATA ADDITION 2%**  FEATURE ADDITION**  WORKFLOW SIGNATURE EXPANSION (NEW OPTIONAL INPUT FORMATS, NEW OUTPUT FORMATS) 12%** USABILITY IMPROVEMENT (addition of human interaction steps) 6.5%
    • OTHER FEATURE ADDITIONS (making wf less platform specific, data cleaning, data dissemination steps) 6.5%
    • PERFORMANCE IMPROVEMENT 1%
    • REFACTORING SIMPLIFICATION 2.5% 

Changes are generally performed in order 1) to repair a broken workflow and 2) to improve an existing operational workflow.

FIXES and REPAIRS: When we look at the distribution of changes we see that the majority are "General Maintenance" activities. These are attempts to fix workflows that are broken due to a changing environment. The changes in external services and resources are due to the following:

  • Identification, Naming schemes introduced (Consolidation in identification).
  • Nomenclature or Data Schema Changes.
  • Prototype Data Set --> Full Data Set
  • Staging Server --> Production Server 
  • Grid infrastructure upgrades
  • Local tooling version upgrades e.g. local library version
     

 General Maintenance type changes would typically cause a Processor in the workflow graph to be replaced by another one referring to the new/updated resource. In addition to general maintenance, we also observe bug fix, scientists can detect, or can be notified of bugs in their workflows after they have published them, and upload new versions with fixes

IMPROVEMENTS: Another category of changes are Improvements to a workflow. Improvements can be in several aspects, but the majority is on providing better documentation on the workflow that has been published. It is likely that workflow creators perform this documentation incrementally as a response to the information demands of other users of the workflow. Other than documentation, another significant category of improvement is expanding workflow signature so that it accepts input or produce outputs in several formats. Other improvements such as refactoring, usability improvements and performance improvements are observed with a much fewer frequency.

(Maybe see my notes under the previous paragraph, they are in line with the myExperiment evaluation I think. An important point to make is that usage statistics are confounded by the effect of the features that tools like Taverna and myExperiment offer and how they offer them, i.e. they do not give an unbiased view of what users would do.  E.g. Taverna's annotation features are not easy to reach, which is one aspect of under-annotation of its workflows.)

Exploiting Workflow Evolution Information for Workflow Preservation

In the above sections we have reviewed how evolution information is collected either at development time in a fine-grained manner as in the case of Vistrails and Trident or at publishing time in a user-driven coarse-grained manner as in the case of myExperiment. In the context of the project goal, i.e. workflow preservation, evolution information could be put to use in the following ways:

1. Repair propagation. Certain evolution information is a signal of change in external resources, and  also a record of repair. For example, a particular WSDL Processor in a workflow is replaced by another one. The WF Repository or the RO DL could use this information to automatically flag other workflows, which contain the replaced processor, as potential holders of dangling service references.

2. Workflow dependency maintenance. In the project requirements it has been stated that the RO would contain typed links between entities that are aggregated by the RO. One category of those links is the dependency information.  When a workflow is aggregated within an RO, it is contextualized, one activity for contextualization is extraction of workflow dependencies i.e. links the external resources that the workflow consults. Evolution information could be used for maintaining these dependencies.

3. Other ways???

Implementation 

Summary of v0.6.1

Same as v0.6, but removing import of PROV ontology

Summary of v0.6

Added labels and descriptions

Changes in v0.6

  • Added rdfs:label to each roevo entity
  • Added rdfs:comment to each roevo entity

Summary of v0.5

Minor changes compared to v0.4.

Changes in v0.5

  • Added import of PROV ontology (in evaluation)
  • roevo:fromVersion subproperty of prov:used
  • roevo:toVersion subproperty of prov:generated

Summary of v0.4

Version v0.4 of roevo has two main goals: (i) align with provenance vocabulary (PROV ontology) and (ii) simplify taxonomy of changes based on user feedback in latest plenary meeting (Manchester 06-2012). On the one hand, we reused when possible classes and properties from PROV ontology, including the definition of subsumption relationships (subclassOf & subpropertyOf). On the other hand, we simplified changes into additions, removals and modifications, which are the ones absolutely required by users, and left the option to create extensions of roevo to model more detailed taxonomies of changes for required resources, such as workflows and annotations.

Changes in v0.4

  • rename class roevo:Version to roevo:VersionableResource.
  • change class roevo:VersionableResource equivalent to union of ro:Resource, ro:AggregatedAnnotation, roevo:SnapshotRO and roevo:ArchivedRO.
  • added class roevo:ChangeSpecification to aggregate changes between versions.
  • added property roevo:fromVersion 
  • added property roevo:toVersion
  • added property roevo:wasChangedBy (roevo:VersionableResource to roevo:ChangeSpecification) 
  • added property roevo:wasSnapshotedBy (roevo:SnapshotRO to prov:Agent) as subproperty to prov:wasAttributedTo
  • added property roevo:wasArchivedBy (roevo:ArchivedRO to prov:Agent) as subproperty to prov:wasAttributedTo
  • added property roevo:snapshotedAtTime (subpropertyOf prov:generatedAtTime) to roevo:SnapshotRO
  • added property roevo:archivedAtTime (subpropertyOf prov:generatedAtTime) to roevo:ArchivedRO
  • replace foaf:Agent for prov:Agent
  • replace roevo:contribution properties: 
    • rename roevo:contribution to prov:wasDerivedFrom
    • remove roevo:relatesTo
    • use subproperties of prov:wasDerivedFrom: prov:hadOriginalSource, prov:wasQuotedFrom
  • remove property roevo:hasPreviousVersion. Use prov:wasRevisionOf 
  • remove property dcterms:created for roevo:Change. Use properties prov:startedAtTime and prov:endedAtTime
  • remove property dcterms:creator for roevo:Change. Use prov:wasAssociatedWith 
  • remove property dcterms:creator for roevo:VersionableResource. Use prov:wasAttributedTo
  • change property roevo:hasChange domain (roevo:ChangeSpecification)
  • change property roevo:relatedResource domain and range (roevo:Change to roevo:VersionableResource) 
  • simplification of change taxonomy: 
    • remove previous subclasses of roevo:Change
    • add subclasses roevo:Addition, roevo:Modification, roevo:Removal
  • two extensions of roevo, for workflows and annotations, which speciaize roevo:Modification - (to be discussed what can be a ro:Resource, e.g., ports, processors, etc.).

Mapping of RO evolution model to PROV ontology 

  • roevo:VersionableResource subclassOf prov:Entity
  • roevo:LiveRO subclassOf Prov:Entity
  • roevo:SnapshotRO subclassOf prov:Entity
  • roevo:ArchivedRO subclassOf prov:Entity
  • roevo:ChangeSpecification subclassOf prov:Activity
  • roevo:Change subclassOf prov:Activity
  • roevo:wasChangedBy subpropertyOf prov:wasGeneratedBy
  • roevo:wasSnapshotedBy subpropertyOf prov:wasAttributedTo 
  • roevo:wasArchivedBy subpropertyOf prov:wasAttributedTo 
  • roevo:snapshotedAtTime subpropertyOf prov:generatedAtTime
  • roevo:archivedAtTime subpropertyOf prov:generatedAtTime
  • replace foaf:Agent for prov:Agent
  • replace roevo:contribution properties to prov:wasDerivedFrom and subproperties prov:hadOriginalSource, prov:wasQuotedFrom
  • use property prov:wasRevisionOf instead of roevo:hasPreviousVersion
  • use properties prov:startedAtTime and prov:endedAtTime instead of dcterms:created for roevo:Change
  • use property prov:wasAssociatedWith instead of dcterms:creator for roevo:Change
  • use property prov:wasAttributedTo instead of dcterms:creator for roevo:VersionableResource 
  • roevo:relatedResource subpropertyOf prov:used.
  • use properties prov:startedAtTime and prov:endedAtTime for roevo:ChangeSpecification
  • roevo:fromVersion subproperty of prov:used
  • roevo:toVersion subproperty of prov:generated

roevo Ontology

See full size of this version (v0.4) (or previous versions v0.3 or v0.2 or v0.1).

The ontology is available in OWL and Turtle

ImplementationDetails
roevo is built upon ro models and prov-o, but v0.4 does not import them (owl:imports). It reuses only the statements needed, specially for performance (e.g., importing wf4ever ontology requires importing 12 ontologies (3 direct and 9 indirectly)). In v0.5 and v0.6 we import PROV ontology because roevo is an extension of PROV, but this is still under consideration. Hence, we keep v0.6.1, which is the same as v0.6 without the import.

Extensions

Workflow Evolution Extension

The ontology is available in OWL and Turtle.

Annotations Evolution Extension

The ontology is available in OWL and Turtle.

Usage of lifecycle properties for the different RO types

Live RO

The live research object SHOULD have the following properties:

  • roevo:hasSnapshot, for each snapshot
  • roevo:hasArchive, for each archive

It MAY additionally have the following properties:

  • prov:wasDerivedFrom or its subproperties (prov:hadOriginalSource, prov:wasQuotedFrom, prov:wasRevisionOf)

Snapshot RO

The snapshot research object SHOULD have the following properties:

  • roevo:isSnapshotOf
  • roevo:snapshotedAtTime
  • roevo:wasSnapshotedBy
  • roevo:wasRevisionOf, if there existed a previous snapshot

It MAY additionally have the following properties:

  • prov:wasDerivedFrom or its subproperties (prov:hadOriginalSource, prov:wasQuotedFrom, prov:wasRevisionOf)

Archived RO

The archived research object SHOULD have the following properties:

  • roevo:isArchiveOf
  • roevo:archivedAtTime
  • roevo:wasArchivedBy
  • roevo:wasRevisionOf, if there existed a previous snapshot

It MAY additionally have the following properties:

  • prov:wasDerivedFrom or its subproperties (prov:hadOriginalSource, prov:wasQuotedFrom, prov:wasRevisionOf)

Sources

Jira Issues

Discussions

Last discussion summary:

1.  Marco: Should we disambiguate the term 'version' itself here?  For instance, do we distinguish versions related to edits of an RO (a bug fix, or adding something for completeness), versions related to the status of an RO (as below), or 'versions' denoting the relation between an RO that was based on another, but with a new goal. Also some edits have more value to users than others. For instance, this week I discussed the removal of a whole branch of a workflow with my student Eleni, which removed one type of analysis from an experiment. For editing/debugging workflows I use subversion, but for these types of design decisions, I would use myExperiment. I imagine that Snapshots would be used in these cases. I would prefer that we model the types of relations between 'versions' of ROs and then possibly use the (ambiguous) term 'version' only in user interfaces. 

Raul: see Research Object Version definition at the top

2. All Snapshots should be cited, even Abandoned ROs. (What is meant here by "Citable"/"cited" - to me this has implications of something far stronger than mere reference; Pique, can you say why you think all snapshots should be cited/referenced?) In my view, they should be 'referenceable', such that we can refer to them in our own digital notebooks or in another LiveRO (i.e. within a group of collaborators, not the public). Archived and reviewed ROs should be citable by anyone in the public. - CHANGED TO REFERENCEABLE (i think we can remove the discussion of this point)

3. (I don't agree that archival ROs are necessarily curated - of course, it depends on what you mean by "curated". - I agree: reserve archival for archival; make it a best practice that archived should be curated; I think this keeps it flexible for any user domain. NB Graham: have we conflated curation/review and publication in the review showcase?)

Old discussion summary:

Marco: I can imagine that one Live RO can be used for more than one published RO. Both options seem thinkable:

  • Closing a live RO upon publishing, and spinning off several published ROs from a long-living RO. 
  • It may be good as a best practice to close a live RO when you publish it.

In both cases the spun-off RO for archival/publication is a copy that starts a life of its own. I imagine it is up to the user to decide what to do with the live RO after that. A published RO should not depend on a Live RO.

  • Pique: If we decide to bring back to life an Archived RO to re-use it for a new purpose (this is the general case) we will get different Life ROs for everyone of them, this is because it is the purpose of the new experiment that makes the identity of the new Life RO. 
  • Marco: Should we just take from this discussion that our tools and models should be flexible enough to accommodate different scenario's? E.g. allow users to open and close live RO's as they see fit?
  • Marco: ...'best practices' that each domain would have for using ROs
  • Question: can the published RO (snapshot in diagram) as well be a publication?

Pique: the published RO is just a snapshot to track versioning, but it has not the status of a final "serious" publication. The archived RO has this status of immutable template (preservation, science museum, mammut). The published RO is related to a Live RO that evolves (conservation, zoo, baby elephants :) Personally I am not very happy with the terminology used for the snapshots (published RO), it induces to confusion to the published paper analogy

Raul: So, in short, for you the published RO could not be published (to the community), if i understand correctly, right?

  • Pique: in the diagram there is no difference between creating a snapshot and publishing the work
  • Question: What about this situation:

A situation in which a scientist is working on the Live RO, after some time he creates a snapshot for internal reviews (as in the diagram), then he continues working on the Live RO and after some work he publishes it as a work-in-progress/position RO paper (as in the diagram), he receives reviews and makes some more changes and publish a final version of the RO, but instead of going into archived RO, he only makes another snapshot (published RO), and he continues working on the Live RO and then after he reaches the mature level he repeats a similar process for a conference/journal publication, and at the end of this process is when he archive it.

Pique: It should be up to the user to decide when an RO is mature enough to be archived. I would say that feedback for a snapshot is provided inside the research group and collaborators while reviews are provided by the whole community only to archived ROs 

Raul: This is fine, but the answer to the question is not still clear for me. So, you think you cannot have a published RO published as a work-in-progress/position paper?

  • Question: What about this situation:

The publications (snapshots) cover only parts of the RO (e.g., due to the scope of the conference/journal). So, the scientist starts as in the diagram but he makes two (or more)  publications (snapshots) covering the different parts of the RO, and after he published everything, he archive it.

Pique: If I understand well this case is covering "forking" of ROs, publishing different parts of the RO (you use the word different) in several journals. I would ssay this is re-use of a Live RO giving life to different new Live RO that are published later. Because the purpose of the experiment is different (the scope of the conference/journal), the ROs are also different (even if they are similar) and have different identities.So, I guess the question here rises the concept of similarities among ROs and provenance by re-use.

Raul: Good, that is you will have a Live RO for each purpose (different scopes of the same experiment). Mainly i had this question because in the golden exemplar from you, you provided some documents derived from the RO as "Papers that have made use of this RO or some of its components", which is a bit different view

Dave: Here some functions that have to do with Inter-RO evolution and PROVENANCE OF DESIGN

  • In myExperiment, when people upload a workflow they have the opportunity to indicate where it came from – perhaps they had previously downloaded someone else's workflow, or were inspired by others, and this is about recording the provenance of the workflow design and giving due credit. I would very much like Wf4ever to make this process "assistive" so that when a workflow/pack/RO is uploaded the system provides recommendations for the provenance metadata.  There are many ways this could be done.  
    • One, in the case of Taverna, is that the workflow representation itself carries an internal identifier which helps establish its provenance. 
    • It could also be done by matching the uploaded workflow against the content (RO store) and suggesting the antecedents of the design, for example using using subgraph matching.  For those of you concerned with RO internals, please note this requirement for metadata that helps establish the provenance of the design.  I deliberately say "assistive" rather than "automated" - someone could download a workflow and edit it into something completely different, in which case the link to the original workflow might not be valid.
  • An important adjacent use case is the tooling to support software development.  This permits multiple people to develop software components with excellent change tracking, and with an established culture of paying due attention to copyright, licence and intellectual rights flow. One piece of software may be an assembly of parts with very different provenance (think RO). I believe there is a close parallel to our work. Indeed, as a thought experiment at least, it's worth thinking how we'd do Wf4Ever using tools like Github.  If we are not already, perhaps we should also be drawing on the work on provenance in software engineering (e.g. at Irvine).
  • Another adjacent use case is of course academic papers and the culture of authorship, citation and acknowledgement (I find it useful to think of ROs as both software and papers!)  This links into the work on provenance in scientific discourse, which is a world I believe Wf4Ever is already well connected with. Indeed we mustn't forget that external graph – e.g. There are papers citing myExperiment workflows, which we track by self-reporting and by a discovery process (a.k.a. Google!)  I met this week with the British Library about getting DOIs for myExperiment workflows (and potentially Wf4Ever ROs) which will help this interconnection.
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. feb 20, 2012

    Esteban García Cuesta says:

    Reading all the comments I think that would be good to define the different stat...

    Reading all the comments I think that would be good to define the different states and how to get from one to the other but moreover linking it to RO properties and resources, the question goes to Pique: what RO properties or restrictions should have a RO to make a scientist to decide that is ready to be archived? reproducibility? or to be ready for a snapshot?repeatibility?

    I think that this type of questions can help to ground the ideas and defined the states in a closer way than just because the author wants because in that way the states of different research objects are not going to be equivalent. does it make sense to you?

         

    1. mar 13, 2012

      Jose Enrique Ruiz says:

      A scientist may decide to 'copy' a Live RO as an RO Snapshot because of ...

      A scientist may decide to 'copy' a Live RO as an RO Snapshot because

      • of simple backup needs
      • he thinks the actual version may be helpful in the future, then he makes a Snapshot before making progress in the Live RO in another slightly different direction
      • he wants to share/exchange it with someone in a "clean and more understandable" state
        • he needs feedback/internal review from collaborators (reproducibility is needed, but more important is that it should be easily understood) 
        • he needs collaborators or students to continue making progress in the RO
        • to provide feedback or progress asked by collaborators/supervisor
      • he thinks the actual version provides substantial changes wrt. to the previous one that makes it worth to keep a new version cleaned in a formal way
        • backup
        • the new version makes things in a slightly different way
          • then he would like to compare results issued from these two versions
          • reproduce results with this version in order to compare them with future versions coming
      • to abandon the experiment and keep it safe somewhere, someday he will have time to finish it :)
      • to keep in a formal way a mockup of an idea for a future experiment
      • to share/expose a tutorial on how to do things with a complex technology, archive, algorithm, service, etc..
      • ...