Förderjahr 2017 / Stipendien Call #12 / ProjektID: 2418 / Projekt: Decentralised Data Provenance based on the Blockchain
The Blockchain does in general not represent a silver bullet to every problem and like with every new technology it also brings its own limitations to the table. One such limitation specifically when working with data provenance is the threat of duplication. This special kind of attack can be explained by a simple example. For example, imagine some scientist working with a data provenance enabled software. The scientist performs some experiments and some results get produced. Besides these results, the provenance enabled software also produces some provenance data which gets saved to the Blockchain. So far everything is fine as can be seen in the figure below.
However, the scientist is not entirely happy with the results and by manipulating the experiment and rerunning it the scientist is able to produce a new set of results backed by a new set of provenance data that also gets stored in the Blockchain. This can be seen in the following figure.
The scientist now essentially was able to duplicate the provenance data and hide one set of provenance data behind another one as can be seen in the next figure.
To find the duplication attempt one would need to crawl the Blockchain for such data provenance duplications. Although theoretically possible in cases where the provenance data is indeed saved in the chain it becomes very difficult for solutions which make use of off-chaining and store only the hash in the chain since we can not tell what data is represented by this hash and if it is provenance data in the first place.
To solve this issue and some other we introduced provenance networks. Provenance networks are inspired by trust networks from the domain of trust propagation and like trust networks, they are directed weighted graphs where each node is a smart contract. These smart contracts are used to store provenance information and also to propagate trust towards other smart contracts in the provenance network. We call them also provenance contracts. We call the edges between our provenance contracts, links. They have a weight which expresses a trust level and a direction expressing from which contract the trust gets propagated towards which other contract.
The figure above shows an example of a provenance network. The blogger contract in this example "trusts" the contract of the informatics institute since it has some reputation and is also trusted by the TU Wien main contract which essentially makes it a part of the TU Wien. However, the informatics institute does only "know" the bloggers' provenance contract since there is no reliable source propagating any trust towards this particular contract. Propagating trust is important since everybody is able to create new provenance contracts. This means that malicious users could simply try and replicate contracts or whole parts of the network. However, well-known anchor points with high reputation, like in our example the TU Wien contract can be used to propagate trust towards correctly behaving contracts. Furthermore, in such a provenance network even without any anchor points by searching the whole network users will always be able to find all the provenance regarding a certain resource and domain experts will then be able to distinguish the correct provenance data from the duplication attempt.
Provenance networks fulfill also a second very important task, they allow us to search for provenance information across use cases and domains. Different provenance use cases often have different provenance models which makes it hard to co-operate between them. However, provenance networks are not dependent on one specific provenance model allowing us the create a common search space between different provenance domains and use cases. For more details on this topic relate to my master thesis or to the project prototype at Github.
Svetoslav Videnov
My master thesis aims to combine the advantages of the blockchain with data provenance. The blockchain is a distributed ledger which allows persisting data in an unchangeable way. Data provenance is an approach to track what happened to data and by this allowing to build trust into this data.