Until very recently, scientific manuscripts did not provide direct visibility onto the underlying data. The result was a break in the chain between an assertion and the evidence upon which it rests. One of the most challenging obstacles to scientific data sharing was that academia is a “publish or perish” culture. To address this, funding agencies are now implementing incentives that encourage data sharing and reuse. Another challenge was that electronic data collection methods supported a wide range of forms and formats. The heterogeneity of scientific data and therefore the magnitude of this problem should not be underestimated. The work required to prepare data for sharing and reuse will be extensive and ongoing. The flexibility afforded by electronic data is a feature that enables researchers to ask questions unbounded by the data they will need to answer them. At the same time, it is a bug that results in a body of scientific evidence that is essentially un-sharable in its raw form. Good news: 1) it is possible to write computer programs to harmonize data post-hoc and 2) all scientists use metrics that either are or are intended eventually to translate into scientific units. New data sharing platforms create an ecosystem where developers can require contributors to upload already standardized data or raw data with a method to standardize them. The scope of potential questions, however, will usually remain limited by the scope of data available on that platform. In other words, data siloes still exist, even if they are now wider and better connected.
We are proposing to create an e-commerce site that allows data creators and others to sell data recipes that will harmonize a data set. Harmonizing can include tasks like downloading, calibrating, resampling, appending and joining data. Data recipes will be transparent, repeatable and all source data and metadata will be traceable back to their origins. We will provide tools so that authors can write the steps necessary to harmonize a data set (recipe) and customers can purchase those recipes for use on their own cloud-analysis platform: As long as they have appropriate access to the data, the output will be as described.
Our system will create an incentive system for people to perform the labor-intensive task of data harmonization. Our system is built upon open-source software packages that are both well-placed and supported by active communities. We have minimized the custom written portions to only those necessary to make the system work. As a result, our system will be durable, relatively inexpensive to maintain. Once operational, our system would be funded through user fees (a percentage of the cost of each purchase), so we can interact with our users in a fully transparent way. Additionally, it will be easier to operate without sustained funding from the government or philanthropic sources.