- Introduction
- Main Goals
- Main Outcomes
- Story:
- SPARQL endpoint
- Demo1: Algorithm to order a provenance trace by querying wfprov
- Demo2 : Algorithm to query provenance and obtain workflows with a common process
- Demo3 : Some general SPARQL queries
- Adding data to the 4Store endpoint
- Generating a RO including wfprov information from Wings (taken from Graham's mail)
- Validation
- Discussion review/reflection
- Links
Introduction
This page is a summary of the showcase 22 at sprint2 phase which corresponds with theJira story (WFE-204).
There exists different types of provenance which are being under work in the project [26] [27]. Among others it is important to capture the trace of the workflow run in order to find out some characterisitics or properties of the research object or to allow its reviewing. The terminology related to that type of provenance can be found at [27] and it relies on wfdesc and wfprov ontologies [28]. Some examples can be also found at [29] indicating where each one of the ontologies is applied and how.
The main purpose of this showcase is to provide a way of consulting that provenance of the workflow results by a SPARQL endpoint. This data will be used later, among other funcionalities, for addressing some of the issues related with the evaluation of the quality of a research object and some of its properties (e.g. reproducibility), or to provide deeper information about an specific execution of a scientific experiment and its information (e.g. to provide ofther workflows which also uses some of its processes).
There are also relations between this provenance and other types of provenance but the study of these relations is out of the scope of the showcase.
Regarding the conversion from the two main types of workflow sources , it has been choosen to use PROV-O [31] as bridge towards taverna -> wfprov conversion, and a direct OPMW -> wfprov for WINGS thouhg the alignment between OPMW -> PROV-O has also been studied.
Main Goals
- Export 1 example of provenance of workflow execution/run from Taverna and Wings. (Longer road would be to populate massively the wf4ever portal with the data)
- Allow to query the examples imported by using a SPARQL endpoint
- Identify the examples to be exported
- Selection/identification of the models to be used
- Incorporate the selected provenance into a concrete RO
- Allow a simple visualization format of the provenance of workflow execution/run
Main Outcomes
- WINGS -> OPMV -> wfprov [25]
- Taverna -> PROV-O -> wfprov (set of toolfs for conversion [5])
- Set of tools for ordering wfprov data [23]
- Identification of a provenance scenario [7]
- Populate the repository with 2 examples of wfprov and 1 complete RO [13][21](the population of wf4ever portal should be now more straihgtforward)
- Initial SPARQL endpoint [8]
- Queries over the examples [21]
- Algorithm to order the traces of provenance using wfprov format [30]
Story:
A workflow example that has been used in many demonstrations is a Protein Discovery workflow. It follows the basic text mining procedure to produce proteins that were found together with the terms in the input query in the abstracts of biomedical papers. It may be a useful example for proofs of concept: http://www.myexperiment.org/workflows/74.html. Make sure to set the maximum number of abstracts to parse to a low number for quick results. An example input is given in the input description.
Story: As a researcher, I will be able to select the output of a workflow run, and then obtain the information that shows that this output was used to provide output on the 12th of December 2011, that it did so as part of protein discovery workflow, that the purpose of this service was to suggest biological concepts involved in Metabolic Syndrome, and the steps taken to achieve those results were optimize_for_medline, aida_retrieve_documents_in_parts, etc. This could for instance follow after clicking on a workflow run reference in uvp1.1 [7]
Technical implementation: The different items of the RO are listed showing their URIs in a human readable form. One of the items of the RO is the provenance of the workflow execution which is an annotation including the associated file. The provenance information is capture with wfprov ontology and captures the run of the experiment. As a user I would like to be able to see the different steps of the execution in a ordered form and also the generated intermediate and final results. All the processes and results have to be accesible trought a SPARQL endpoint and also have to be displayed in a printer format (or similar) for reviewing of the end-user (some information related with the run to be shown, timestamp, and agent who executed).
SPARQL endpoint
We have installed and used an SPARQL endpoint (4store [18]) in order to test different content related with the showcase (specially wfprov). The endpoint has available information about three different workflow executions: two of them come from WINGS [15][16] and the other one is a Taverna Workflow. The SPARQL endpoint is accesible at [8] and users can test some queries by using the its UI [17].
Demo1: Algorithm to order a provenance trace by querying wfprov
1) As a researcher I want get the information of a RO and select one of them to see/review its provenance, then I want to choose one of its executed processes and wants to see what other workflows contain also that process.
In order to test this story, which makes use of wfprov, we have developed an algorithm and a small Java app that shows a trace of provenance ordered by execution process.
The algorithm retrieves all the runned processes of a workflow execution. Once we have them we get the inputs and outputs of each process. Then we set the relations between them and we start to create a provenance trace based on the inputs that are available (the processes that are executed generate outputs that become available inputs for following processses that need them, and so on). The source code is stored at github [19].The app allows the user to select one of the three cases of execution stored at the repository and shows the results step by step. The executable app is also available at github [20].
The queries and results can be found at [32] ?and [35] for protein case obtained from taverna, but the output of following step by step the workflow run is also shown in the next. Each step is a process and it also gives information about the inputs that has used and its outputs.
Provenance trace: Ligand Binding Sites Comparison Step 1: PROCESS: http://wings.isi.edu/opmexport/resource/ProcessInstance/COMPARELIGANDBINDINGSITESV211332778606534 INPUTS: http://wings.isi.edu/opmexport/resource/ArtifactInstance/68ED7970F5C2CA17B6F867B0F223D194 http://wings.isi.edu/opmexport/resource/ArtifactInstance/272CE70CCBB30666D4310D15280C405B http://wings.isi.edu/opmexport/resource/ArtifactInstance/66D929138F1D484C8D8ADC86D5BE7477 OUTPUTS: http://wings.isi.edu/opmexport/resource/ArtifactInstance/6B7AB2E53A9186CACAD94833DD34EF8E http://wings.isi.edu/opmexport/resource/ArtifactInstance/65E39FED3439AA6650E6AE55314BE6AD Step 2:... ... Step 6: PROCESS: http://wings.isi.edu/opmexport/resource/ProcessInstance/RAWINTERACTIONNETWORKMERGER1332778606534 INPUTS: http://wings.isi.edu/opmexport/resource/ArtifactInstance/38BB806F498FA762C244979C80412F0D http://wings.isi.edu/opmexport/resource/ArtifactInstance/0FD08D3BDAC0216F1F892D8BB9B3E7E1 OUTPUTS: http://wings.isi.edu/opmexport/resource/ArtifactInstance/CC53D47E72A3C6C144DEEF5272CEF5BA
Demo2 : Algorithm to query provenance and obtain workflows with a common process
The demo show the following scenario, as a researcher I want get the information of a RO and select one of them to see/review its provenance, then I want to choose one of its executed processes and wants to see what other workflows contain also that process[7]. The queries and one example can be found at [22].
Demo3 : Some general SPARQL queries
Here we show some SPARQL queries (which also can be consulted at [34]) that might be useful for users to test the content of the repository regarding the updated provenance data.
Prefixes used:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX wfprov: <http://purl.org/wf4ever/wfprov#> PREFIX wfdesc: <http://purl.org/wf4ever/wfdesc#>
1) Get all the wfRuns stored at the endpoint
SELECT ?wfRun WHERE {
?wfRun a wfprov:WorkflowRun}
2) Get all the processes used in a wfRun
SELECT DISTINCT ?process WHERE {?process wfprov:wasPartOfWorkflowRun <ID WfRun>.}
3) Get all the inputs/outputs that take part in a WfRun
Select DISTINCT ?input ?output where {
{?output wfprov:wasOutputFrom ?process. ?process wfprov:wasPartOfWorkflowRun <ID WfRun>}
UNION
{?process wfprov:usedInput ?input. ?process wfprov:wasPartOfWorkflowRun <ID WfRun>}}
4) Get all the inputs of a specific runned Process
SELECT ?input WHERE {<ID Process> wfprov:usedInput ?input}
5) Get all the outputs of a specific runned Process
SELECT ?output WHERE {?output wfprov:wasOutputFrom <ID Process>}
6) Get the necessary inputs to get a specific Output
Select ?input where { <ID Output> wfprov:wasOutputFrom ?process.
?process wfprov:usedInput ?input}
7) Get the wfRuns where a specific Input has been involved
Select DISTINCT ?wfRun where {?process wfprov:usedInput <ID Input>.
?process wfprov:wasPartOfWorkflowRun ?wfRun}
8) Get the wfRuns and its runned processes where a specific input has been involved
Select DISTINCT ?wfRun ?process where { ?process wfprov:usedInput <ID Input>.
?process wfprov:wasPartOfWorkflowRun ?wfRun}
9) Get the wfRun and its runned process which generate a specific Output
Select DISTINCT ?wfRun ?process where { <ID output> wfprov:wasOutputFrom ?process.
?process wfprov:wasPartOfWorkflowRun ?wfRun}
10) Get the process that describes a runned process.
Select ?process where { <ID Process execution> wfprov:describedByProcess ?process}
11) Get all the wfRuns where a process has been involved
Select DISTINCT ?wfRun where { ?wfRun wfprov:describedByWorkflow ?wf.
?wf wfdesc:hasProcess <ID Process>}
12) Get workflows that have a process
Select DISTINCT ?wf where { ?wf wfdesc:hasProcess <ID Process>}
13) Get all the processes that are part of a workflow
Select DISTINCT ?process where { <ID workflw> wfdesc:hasProcess ?process}
14) Get all the processes and its executions from a workflow
Select DISTINCT ?process ?processRun where { <ID Workflow> wfdesc:hasProcess ?process.
?processRun wfprov:describedByProcess ?process}
Adding data to the 4Store endpoint
These are some comments from Aleix about how to upload content to the repository. The method that used cannot be used by others because permissions are needed. It is possible that one can use methods explained at [x2] if you want to upload more rdf's.
How to add RDF's to the 4store repository:
Given RDF's that contain the needed information about a workflow (wfrpov and wfdesc) we can update the our database by adding them to the repository.
End the current running 4store (repository) and stop the backend:
killall 4s-httpd sudo pkill -f ' ^4s-backend wf4ever$'
Add the rdf's:
4s-import -v wf4ever Name1.rdf Name2.rdf ... NameX.rdf
Rerun again the backend and the server:
sudo 4s-backend wf4ever 4s-httpd \-p 8000 wf4ever
More info about the repository at [x1] and about the Sparql server at [x2].
[x1] http://4store.org/
[x2] http://4store.org/trac/wiki/SparqlServer
Generating a RO including wfprov information from Wings (taken from Graham's mail)
The sequence of queries used to extract wfdesc/wfprov information for the Wings workflow example have been automated using a bash script + cURL.
It's all part of the ro-catalogue entry at https://github.com/wf4ever/ro-catalogue/tree/master/v0.1/WingsProvenanceExample
The files needed to re-run the process are:
- getWingsData.sh (shell script)
- prefixes.sparql (file of common prefixes for SPARQL queries)
The scripts used to create an RO structure with these data are:
- makeresourcelists.sh (uses Jena tools for querying local RDF)
- make.sh (uses ro-manager to create RO structure for the wings example data)
These files are all part of the resulting RO.
[36][37] are two examples created by this procedure.
Validation
The validation and feedback provided by Maro can be found at: http://www.wf4ever-project.org/wiki/display/docs/2012/04/08/RO+provenance+query+tests+by+users+%28Showcase+22+validation+by+Marco%29
Discussion review/reflection
The next is the summay of the review/reflection:
In general all the team memebers we agreed in the following:
- Need of a better coordination at the begining and definition of tasks (some lack of communicacion initially)
- End-users stories more difficult to match than expected
- Towards the end, sprinting seemed to be working a lot better
- Problem of scarce resources over-committed elsewhere (Stian, Marco).
- Stand-ups worked well though some people was not able to attend (overcommited) but cathed up later.
- Open skype-chat seemed to work very well for updating and punctual questions. It turned out that it was sometimes the best way of sharing knowledge.
- Very good overall collaboration and team work.
Identified point than cat be applied for next sprints as improvements:
- Stand-ups and optionally to use telco too if seems neccessary
- Maintain open the skype chat in order to allow updates, to catch up, and puctual questions.
- Add google-calendar stand-ups schedule
Some technical issues:
- Resources of resources of a RO should be linked directly including them as part of the RO or indirectly through the resource? (Open question from Khalids) IMO Indirectly to avoid duplicates and overloading the RO with extra data which it might be to big.
- Taverna RO links created for I/O of the processs executed are not created (a techie approach should be adopted here)
- Sharing is very ad-hoc (inconsistencies in Wings data extraction - suggest more use of scripting for repeatability)
- To see more use of github
Agreed post-sprint actions:
- Finish users-feedback by developing a simple visual interface to show the two demos
- Wrap-up everything and specially the technical tools wich have been use to create the dataset of provenance of workflow results and procedures to do the demos/testing
- Gather proposals by team members for next possible actions to be done in order to achive
- Assisting detection and explanation of workflow decay
- RO checklisting (including wfprov possibilities)
- Provenance support for first steps towards replayability and reproducibility
Links
[1] http://www.wf4ever-project.org/wiki/display/docs/Review+and+publication+with+ROs
[2] http://www.wf4ever-project.org/wiki/display/docs/RDF+encoding+using+the+RO+Model
[3] https://github.com/wf4ever/ro-catalogue/tree/master/v0.1
[4] http://wind.isi.edu:10035/catalogs/java-catalog/repositories/WINGSTemplatesAndResults (WINGS ENDPOINT)
[5] http://www.wf4ever-project.org/wiki/display/docs/RO+Examples
[6] http://www.wf4ever-project.org/wiki/display/docs/RDF+encoding+using+the+RO+Model
[7] http://www.wf4ever-project.org/wiki/display/docs/Selection+1+of+User+View+on+Provenance
[8] http://test-wf4ever.isoco.com/sparql/
[9] http://myexperiment.org/workflows/74
[10] https://github.com/wf4ever/ro-catalogue/blob/master/v0.1/simple-requirements/minim-checklist.sh
[11] http://www.mygrid.org.uk/dev/wiki/display/developer/5.+Running+Taverna+from+Eclipse
[12] https://jira.man.poznan.pl/jira/browse/WFE-327
[13] https://github.com/wf4ever/ro-catalogue/tree/master/v0.1/wf74
[14] https://github.com/wf4ever/ro/tree/master/mapping/prov-o
[15] http://wings.isi.edu/opmexport/page/resource/Account/ACCOUNT1332778606534
[16] http://wings.isi.edu/opmexport/page/resource/Account/ACCOUNT1332778615941
[17] http://test-wf4ever.isoco.com/test/
[18] http://4store.org/
[19] https://github.com/wf4ever/testing-wfprov
[20] https://github.com/wf4ever/testing-wfprov/tree/master/wfprov%20Executable
[21] https://github.com/wf4ever/ro-catalogue/tree/master/v0.1/WingsProvenanceExample
[23] https://github.com/wf4ever/testing-wfprov
[24] http://www.wf4ever-project.org/wiki/display/docs/RO+interoperability
[25] http://www.wf4ever-project.org/wiki/display/docs/Research+Object+Provenance+Types
[26] http://www.wf4ever-project.org/wiki/display/docs/Varieties+of+Provenance
[28] http://www.wf4ever-project.org/wiki/display/docs/Research+Object+Vocabulary+Specification+v0.1
[29] http://www.wf4ever-project.org/wiki/display/docs/Workflow+templates%2C+instances+and+runs
[30] https://github.com/wf4ever/testing-wfprov
[31] http://dvcs.w3.org/hg/prov/raw-file/tip/ontology/ProvenanceFormalModel.html
[32] https://github.com/wf4ever/testing-wfprov/blob/master/Demo%20results/demo1.txt
[33] https://github.com/wf4ever/testing-wfprov/blob/master/Demo%20results/Demo2.txt
[34]https://github.com/wf4ever/testing-wfprov/blob/master/Demo%20results/General_queries.txt
[35] https://github.com/wf4ever/testing-wfprov/blob/master/Demo%20results/Demo3%20%28taverna%29.txt
[36] https://github.com/wf4ever/ro-catalogue/tree/master/v0.1/WingsProvenanceExample
[37]https://github.com/wf4ever/ro-catalogue/tree/master/v0.1/wf74