Spreadsheets are the most common form of data storage and exchange for the public and private sector. Lack of standardisation in nomenclature, especially in highly complex organisations, often produces hard to reconcile data structures. Before these diverse, poorly-structured and scattered tabular data can be reused, they must first be transformed to conform to a standardised structural schema. Time and complexity for transformation at scale is a major obstacle to discovering whether these data are useful in the first place.
Current data tidying toolkits are data-centric, promoting workflows where curators restructure data directly at row and column level, potentially interacting directly with database environments. This is labour- and skill-intensive, and often accomplished through time-consuming development of data structuring scripts and source code, and often sensitive to small format changes.
Our solution
whyqd was created to serve a continuous data wrangling process, including collaboration on more complex messy sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our database, back to source. It is a spinout of the work performed as part of our openLocal.uk commercial location data project.
We generalised the work demonstrated here:
whyqd’s focus is on auditable restructuring of complex and scattered tabular data to conform to a single destination schema. Validation is supported, but not its purpose. It is available both as a Python package, and as a “no-code” visual web-based application.
whyqd received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical development support is from EOSC Future through the RDA Open Call mechanism, based on evaluations of external, independent experts.
Outcomes
whyqd went live in January 2024 and is now core to our openLocal.uk project. It has also been instrumental in other RDA-EOSC projects, supporting data interoperability. Our work was presented as a paper and at the European Spreadsheet Risks Interest Group Proceedings held in London in 2024.