Overview

Large amounts of data are produced each second; the challenge that many organizations face is how to provide quick and easy access to this data while ensuring the data is in a safe and flexible format. The web is a convenient way to distribute data. SODA attempts to make that job easier by providing a tool which sits between a distributed databases and the Internet, producing a generic query interface to these datasources. Additionally, it manages the metadata, and uses global parameters to represent the data (called primitives) to semantically identify what is being stored. It can produce output to the screen for viewing, or create a flat file that can be downloaded and easily imported into your favorite statistical package. It provides a template-based datafile upload mechanism so repositories are easy to maintain as new data is collected. Essentially, SODA produces a data warehouse of your scientific data, an publishes the metadata in a human and machine readable format.

The purpose of SODA is to make scientific monitoring data more accessible across the Internet. Because the codebase is Open Source, it is the perfect tool for academic institutions or research laboratories to release their data to the scientific community, providing a greater opportunity for this data to be utilized, in perhaps global contexts. Data forms the basis for our scientific understanding, and this "base element" needs to have a global availability.

The SODA module is a “plug-in” component that provides a web-based query interface and access to distributed scientific monitoring data. SODA resides behind a web portal (e.g. Drupal or Joomla!) and is a mediator to disparate data sources, providing access to the data for viewing or download purposes. New data sources are registered both through an interactive web interface, and by an automated harvester. The SODA central registration database is a repository of metadata about the distributed data sources. Subsequently, multiple sources can be queried and the results combined when the data has similar semantics.

By default, SODA is accessible by everyone. This module does not keep it's own list of users, and manage user security. It is the responsibility of the content management system (CMS) to privatizing access to certain data, if that is desirable. Most CMS have a mechanism to grant permission to certain pages based on roles and login credentials, and one can assign restriction to the pages that SODA uses to view and download the data. This document does not go into detail on these mechanisms since that is the job of the CMS, and out of SODA's scope.

Application Scope and Scale

In some ways, the biggest experiment with this project is seeing if this data filtering model is flexible enough to produce interesting results, but providing some rudimentary operations as part of the data-storage base (like change tracking). Admittedly, one can select data from a database in numerous ways. This program creates a domain around each of the input parameters and can then work within its "realm" of knowledge. It has a certain, quantifiable knowledge about its domain, and it can publish that. This can be vary valuable when searching millions of such datasets, and applying more ontological processes on top.

I believe that SODA has enough flexibility to handle many scales, but only the presence of lots data will determine its limitations in scale, speed, size, and searchability.

Events

An "event" is a calculable occurrence that can be measured in time and space (2-d or 3-d); it sits at the crossroads of the measurements of time, place, and matter. Time is controlled by creating categorical frequencies, removing its continuous nature (sad, I know). Place is represented by a "station" which is a point of data collection somewhere. Matter is what is being measured, such as temperature or growth rate. The combination of these three things are what SODA calls an event, and it stores scientific data accordingly.