Friday, March 7, 2014

Some thoughts on linking data sources / Bringing down the data silos

Agriculture and silos are two terms which play nice together, when referring to agricultural products; silos provide a nice mean of storing large volumes of harvested crops and provide a controlled environment for their post-harvesting management. However, when referring to agricultural data, one may safely claim that the data silos are dead. In fact, they exist but it is only a matter of time before they are either linked with existing backbones or they eventually disappear. Nikos Manouselis has already presented this "data silos" issue very nicely in a really interesting presentation - don't you agree?


Let me express my personal experiences here: My first contact with EU funded educational and research projects was the Organic.Edunet eContentPlus project, which managed to create a network of content providers on organic agriculture, agroecology and other green topics. These content providers followed a unique methodology for creating metadata records for their educational resources (=harmonization) and these metadata became available through a single point of access, which is the Organic.Edunet Web portal. This was a case of harmonization, networking and public exposure.




Then other projects (ICT-PSP, FP7) came in which I was also involved, like VOA3R, Organic.Lingua and agINFRA. What do these projects have in common? All of them were based or at least included large volumes of work on metadata harmonization, linking between different data sources, making data and metadata public. They managed to interconnect various digital data sources like institutional repositories, digital libraries, databases and educational repositories, applying a harmonization layer (e.g. the application of a common metadata standard/schema, the use of common vocabularies and other KOSs etc), providing a linked data layer for linking heterogeneous data sources and aggregating data and metadata from the homogeneous ones. In fact, this linked agricultural data layer is in my opinion one of the most interesting and important outcomes of the agINFRA project. Using KOS (Knowledge Organization Systems) as the backbone, various heterogeneous data sources can be linked as long as they are published online. Another related case was the mapping between the Organic.Edunet ontology and the AGROVOC thesaurus, which took place in the context of the Organic.Lingua project, which was another step in the direction towards linked data. I also feel really glad to be (even partially) involved in a work that it taking place towards the publication of germplasm and other biodiversity data as linked data, something that will allow the linking of these resources to other types of data like bibliographic and educational resources.



There are also cases of linking on a higher, global level compared to the project-based one; the case of the Research Data Alliance (RDA which aims to enhance the accessibility of research data and enable all stakeholders to get access to them. RDA provide a mean for projects like the ones mentioned earlier and other initiatives (like FAO, CIARD, IFPRI and INRA, just to mention a few) to join their forces, share the effort and resources and make a leap forward. Another case is the Global Food Safety Partnership (GFSP), which aims to provide a centralized mean of access to food safety capacity building, by engaging stakeholders from both the public and the private sector. Global Open Data for Agriculture and Nutrition (GODAN) is another global initiative which aims to support global efforts to make agricultural and nutritionally relevant data available, accessible, and usable for unrestricted use worldwide through the participation of public and private sector bodies. The G8 International Conference on Open Data for Agriculture which took place in April 2013 boosted the development and progress of such initiatives by highlighting the need for opening access to data related to agriculture by setting the landscape and define possible next steps in this direction. It managed to identify the needs and engage key stakeholders, among others.


Taking all these into consideration, it is hard for anyone to believe that in this era of linking and interlinking there is still space for data silos. While there are also cases where data cannot be publicly exposed and shared (e.g. patents, privately funded research work, personal data to name a few), the approach of linking and openly publishing/exposing data seems to be the only way towards ensuring the sustainability of these data and the involvement of all stakeholders. In the end, it is up to each data manager individually to decide if he/she will jump on the train and be a part of the future or just remain a part of the history. ;-)