We were founded in 2014 with a mission to address the problem of low reproducibility in scientific studies. Our idea was that, if researchers had the tools to better communicate their actual results (as versus summarized), they could fix misunderstandings and mistakes faster. Through market research, however, we discovered that there was an underlying sociological problem: Scientists don’t want to share data. In our society, researchers are tasked with generating new knowledge. The system that we have to assess whether an Academic scientist is doing their job is known as “Publish or Perish.” If we momentarily set aside all of the external communication requirements of Academic life (grant writing, meetings, teaching, publications) the research cycle can be characterized by two primary phases: data generation and data analysis. The relatively new field of data science overlaps with the data analysis phase in science (very poor naming conventions). A problem arises because data scientists primarily come from the field of computer science while scientists are subject matter experts. Scientific data sets tend to be highly variable in structure, using different formats, labels and units. Additionally, raw data often employs alternate (e.g. uncalibrated) metrics and non-standard methods for noting mistakes or missing data. Without the proper tools or the skills to use them, a scientist will find it difficult to analyze “other” data, even if those data are freely available. The fact that data scientists are going to be faster at standard analyses and are not burdened by the other requirements of Academic life serves as a strong disincentive for Academic researchers to participate in a data sharing economy.
The Data Science Platform for Scientists (DSP) is designed to serve the needs of the scientific community. Our software automates data wrangling, enabling those without extensive computer skills to mine, refine and enrich data sets that are available over the internet. Users can augment their own data, explore new hypotheses, build AI/ML pipelines and more easily identify gaps in the knowledgebase. We provide the tools to search, download and harmonize certain data sets as well as the basic tools to analyze and visualize the results. Once ready, users can download the data and results to their own computer for additional analyses. The DSP also provides a mechanism to collaborate and share results and a publication agnostic mechanism to cite data. The DSP is built around projects that tackle complex, multidisciplinary questions to which members of a community can donate their expertise. Individuals receive credit for their data and analyses while the entire community can benefit from their advances. The DSP also includes a mechanism by which small businesses or entrepreneurial labs can benefit from the data economy.
The design and features included in the DSP were developed by conducting market research including participation in an NSF I-Corps short course, interviewing Chief Data Officers and Program Managers in the Federal Government. We also attended meetings, conducted interviews and one-on-one conversations with individual Academic scientists, government and non-profit researchers and trialed a number of technological solutions. Along the way, we have developed pilot projects and garnered support from researchers in disciplines including Earth Sciences and the Neurosciences. For all of this work, however, the DSP itself remains unfinished. To complete the system, we are seeking partnership with or sponsorship by the Government. We anticipate that the DSP will require constant growth and maintenance and success depends on sustainability. Government has a large number of established intramural data generators. Since Government researchers are not under the same pressures as Academic researchers and now operate under the requirements of the Evidence Act, their data can serve to build a critical mass upon which multidisciplinary problem sets can be built. A wider range of data offerings enables a wider range of data explorations.
Comparable data wrangling technology is now available commercially. However, a user should be mindful of their cost model. The cost model employed by the vast majority of commercial systems is unsuitable for the sciences as they rely on collecting and selling user data. Recommender engines were initially developed to address the challenges of big data and are most easily recognized in industries such as social media. The results returned to the user have been filtered and sorted using algorithms that predict what the user wants to see. This technique generally makes it easier for the user to find what they want and has been extremely profitable for marketing. The problem for science is that users are presented with an unknown unknown result. They do not know which data they are not seeing and they do not know why. This is antithetical to the concept of controlled experimentation. Imagine a situation where a pharmaceutical company wants to obtain approval for their drug and a for-profit data company XYZ. The pharmaceutical company has a financial incentive to pay XYZ to deemphasize untoward data, placing only favorable data in front of the FDA regulators. It is left to the reader to decide the features of XYZ that are acceptable, but the design of the Data Science Platform for Scientists is meant to offer an alternative.
The DSP was conceived to support a data sharing economy where the responsibility and credit for data generation belongs to the data generator while the responsibility and credit for the data analysis belongs to the analyst. In such a system, the role of the DSP is to ensure that the data are correctly and securely processed per user specification. As long as the source data are the same, the result is consistent. Success metrics would include the establishment and routine examination of independent audit trails, customer questionnaires and surveys.
We are now focused on building better decision-making tools. One way to reduce bias of the type described above is to abandon recommender engine technology entirely. Before the advent of big data, there were a range of algorithms including MapReduce. From a bias standpoint, the MapReduce algorithm is advantageous because it returns all data fitting the search criteria. This reduces the unknown unknown problem to a known unknown. The traditional implementation of MapReduce, however, sacrifices some important time-saving advances such as cloud computing. Accordingly, some commercial offerings have implemented MapReduce-like algorithms in a distributed environment. Given the rate of growth of the internet, we believe it is important to continue searching for opportunities to improve.
Modern AI is based on the idea of a directed acyclic graph (DAG). During normal operations, information will flow in one direction (input to output). Deep learning models can add complexities but the fundamental flow of information remains the same. In (real) neural circuits, we know that complex behaviors can be encoded in relatively compact architectures. From a machine learning standpoint, a more compact architecture will save time and power.
During development of the DSP, we implemented a library of tools that we are now using to build biologically inspired neuron-network architectures. We are implementing a distributed auditory system model to demonstrate how biomimetic circuits can filter data more efficiently than deep learning circuits. The overarching goal is to develop and test efficient and transparent mechanisms for searching large data sets but this technology is applicable to a wide array of other applications.