Whyqd data interoperability platform

23 September 2023

Marseille | Research Data Alliance and EU Open Science Cloud

whyqd (/wɪkɪd/) is a curatorial toolkit intended to produce well-structured and predictable data for research analysis. It provides an intuitive method for schema-to-schema data transforms for research data reuse, and for restructuring ugly data to conform to a standardised metadata schema. It supports data managers and researchers looking to rapidly, and continuously, ensure schema interoperability for tabular data using a simple series of steps. Once complete, you can import wrangled data into more complex analytical systems or full-feature wrangling tools.

Whyqd data interoperability platform

Spreadsheets are the most common form of data storage and exchange for the public and private sector. Lack of standardisation in nomenclature, especially in highly complex organisations, often produces hard to reconcile data structures. Before these diverse, poorly-structured and scattered tabular data can be reused, they must first be transformed to conform to a standardised structural schema. Time and complexity for transformation at scale is a major obstacle to discovering whether these data are useful in the first place.

Current data tidying toolkits are data-centric, promoting workflows where curators restructure data directly at row and column level, potentially interacting directly with database environments. This is labour- and skill-intensive, and often accomplished through time-consuming development of data structuring scripts and source code, and often sensitive to small format changes.

Our solution

whyqd was created to serve a continuous data wrangling process, including collaboration on more complex messy sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our database, back to source. It is a spinout of the work performed as part of our openLocal.uk commercial location data project.

We generalised the work demonstrated here:

whyqd’s focus is on auditable restructuring of complex and scattered tabular data to conform to a single destination schema. Validation is supported, but not its purpose. It is available both as a Python package, and as a “no-code” visual web-based application.

whyqd received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical development support is from EOSC Future through the RDA Open Call mechanism, based on evaluations of external, independent experts.

Outcomes

whyqd went live in January 2024 and is now core to our openLocal.uk project. It has also been instrumental in other RDA-EOSC projects, supporting data interoperability. Our work was presented as a paper and at the European Spreadsheet Risks Interest Group Proceedings held in London in 2024.

Related projects

RDA MOMSI multi-omics metadata standards dashboard
31 January 2025

Multi-Omics Metadata Standards Integration (MOMSI) Research Data Alliance Working Group wanted to build a machine-actionable, query-based, interactive dashboard. The dashboard will render information from their existing Landscape Review, currently contained in a Google Sheet format.

Assessment of impact of proposed UK business rates
24 January 2025

In its Autumn Budget, the UK Government made a commitment to transform the business rates over the parliament into a fairer system that supports investment and is fit for the 21st century. Businesses have raised concerns that the business rates system disincentivises investment and is slow to respond to changing economic conditions. They have called for response.

RDA FairTracks schema interoperability
30 November 2024

Omnipy and whyqd (/wɪkɪd/) are independently-developed Python libraries offering general functionality for auditable and executable metadata mappings. In this project, we will integrate Omnipy and Whyqd to develop executable mappings that transform existing metadata from biodiversity projects, such as ERGA, to conform to the FGA-WG metadata model, kickstarting the process of FAIRifying genome annotation GFF3 files.

essential