NCSA Brown Dog
Brown Dog's goal is to prototype a highly distributed and extensible science driven Data Transformation Service (DTS). As a component of a national research cyberinfrastructure Brown Dog aims at making past and present research data more accessible and more useful to scientists while also enabling novel science and scholarship on top of such data.
Rather than attempting to construct a single piece of software that magically understands all data, Brown Dog leverages and coordinates every possible source of automatable help already in existence (e.g. software, tools, libraries, and even other services) in a robust and provenance preserving manner to create a service possessing the union of their capabilities that can deal with as much of this data as possible. Brown Dog, a “super mutt” of software, serves as a low-level data infrastructure coordinating software capabilities with a users data needs to facilitate data reuse and through that enable a new era of science and applications at large. The broader impact of this work is in its potential to serve not just the scientific community but the general public as a “DNS for data”, transforming data on the fly to more accessible forms through a distributed and extensible collection of data manipulation tools, moving civilization towards an era where a user’s access to data is not limited by a file’s format or un-curated collections.
Brown Dog is part of the DataNet/DIBBs program funded by NSF beginning in 2008. DataNet was conceived to address the increasingly digital and data-intensive nature of science and engineering research and education. Digital data are not only the output of research but provide input to new hypotheses, enabling new scientific insights and driving innovation. Therein lies one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams. DataNet addresses that challenge by creating a set of exemplar national and global data research infrastructure organizations (dubbed DataNet Partners) that provide unique opportunities to communities of researchers to advance science and/or engineering research and learning.
Brown Dog is, more specifically, part of a follow-on effort called DIBBs (Data Infrastructure Building Blocks), focused on building software cyberinfrastructure to support current and foreseen scientific data needs, stuff lots of people can use. All of the DIBBs projects are meant to provide complementary services, each building on the others capabilities.
NSF Program: DIBBs
NSF Program: DIBBs
NSF Program: DIBBs
Software: DTS, Clowder, Polyglot
NSF Program: DIBBs
Software: SkyServer
Data: Sloan Digital Sky Survey
NSF Program: DIBBs
Software: HUBzero
NSF Program: DIBBs
Software: SLASH2
NSF Program: DataNet
Software: DataONE
Data: Biology and Environmental
NSF Program: DataNet
NSF Program: DataNet
Software: ACR, Virtual Archive
Data: Social and Environmental
NSF Program: DataNet
Software: iRODS
Data: Ocean Observatory, Hydrology, Genome, Social Science, Education
NSF Program: DataNet
Data: Census/Survey, Remote Sensing, Climate
Richard Marciano, Ph.D.
Professor of Information Studies, Director Digital Curation Innovation Center, UMD
Pongsakorn (Tum) Suppakittpaisarn
Graduate Student, Landscape Architecture, UIUC
Norma Kenyon, Ph.D.
Professor, Surgery, Medicine, Microbioligy and Immunology and Biomedical Engineering, University of Miami
Tschangho Kim, Ph.D.
Professor of Civil, Environmental, and Infrastructure Engineering, George Mason University
Bringing Long-Tail Data
Into the Light
Much of the data generated by science, social science, and the humanities is smaller, unstructured, un-curated and thus not easily shared. Taken together, however, this “long-tail” data, both past and present, represents a vast amount of research data with the potential to greatly impact future research in many areas of study.
Research Data Management and the Clowder Supported Communities
Long Tail Data in Ecology and Global Change BiologyData on the abundance, species composition, and size structure of vegetation is critically important for a wide array of sub-disciplines in ecology, conservation, natural resource management, and global change biology. However, addressing many of the pressing questions in these disciplines will require that terrestrial biosphere and hydrologic models are able to assimilate the large amount of long-tail data that exists but is largely inaccessible. The Brown Dog team in cooperation with these researchers will facilitate the capture of a huge body of smaller research-oriented data sets collected over many decades such as historical vegetation data embedded in Public Land Survey data dating back to 1785. Data such as this will be used as initial conditions for models, to make sense of other large data sets and for model calibration and validation. Overall, Brown Dog supports the PEcAn ecological modeling community in its data transformation needs linking needed datasets to community based ecological models.
Designing Green Infrastructure Considering Storm Water and Human RequirementsThis case study involves developing novel green infrastructure design criteria and models that integrate requirements for storm water management and ecosystem and human health and wellbeing. To address the scientific and social problems associated with the design of green spaces, data accessibility and availability is a major challenge. This study will focus on identified areas of the Green Healthy Neighborhood Planning region within the City of Chicago where existing local sewer performance is most deficient and where changes in impervious area through green infrastructure would be beneficial to underserved neighborhoods. Brown Dog will be used to extract long-tail experimental data on human landscape preferences and health impacts. This data will be used to develop a human health impacts model that will then be linked together with a terrestrial biosphere model and a storm water model using Brown Dog technology.
Development and Application for Critical Zone Studies Critical Zone (CZ) is the “skin” of the earth that extends from the treetops to the bedrock that is created by life processes working at scales from microbes to biomes and it supports all terrestrial living systems. Its upper part is the biomantle. This is where terrestrial biota live, reproduce, use and expend energy, and where their wastes and remains accumulate and decompose. It encompasses the soil, which acts as a geomembrane through which water and solutes, energy, gases, solids, and organisms interact with the atmosphere, biosphere, hydrosphere, and lithosphere. A variety of drivers affect this biodynamic zone, ranging from climate and deforestation to agriculture, grazing and human development. Understanding and predicting these effects is central to managing and sustaining vital ecosystem services such as soil fertility, water purification, and production of food resources, and, at larger scales, global carbon cycling and carbon sequestration.
The CZ provides a unifying framework for integrating terrestrial surface and near-surface environments, and reflects an intricate web of biological and chemical processes and human impacts occurring at vastly different temporal and spatial scales. The nature of these data create significant challenges for inter-disciplinary studies of the CZ because integration of the variety and number of data products and models has been a barrier. On the other hand, CZ data provides an excellent opportunity for defining, testing, and implementing Brown Dog technologies through support for the Critical Zone Observatory community. In this context “unstructured” data is viewed broadly as comprising of a collection of heterogeneous data with formats that reflect temporal and disciplinary legacies, data from emerging low cost open hardware based sensors and embedded sensor networks that lack well defined metadata and sensor characteristics, as well as data that are available as maps, images, and text.
General Public Use CaseIn the same way the Internet has opened up information sharing for people around the world, the broader impact of Brown Dog will be to make the ever-growing stores of data on the web as easy to search and access as a webpage is now.
Brown Dog’s DTS will allow users to seamlessly sift through and access data that would otherwise be difficult to navigate and/or unreadable on their client devices. Similar to an Internet gateway or Domain Name Service (DNS), the DTS configuration would be entered into a user’s machine settings and forgotten thereafter. From then on, with support from a variety of clients as well as browsers, data requests over HTTP would first be examined by the DTS to determine if the native file format is readable on the client device. If not, the DTS would be called in the background to convert the file into the best possible format readable by the client machine. Alternatively, the user would have the option of specifying the desired format themselves, instead of the DTS doing it automatically.
Further the DTS, will allow users to search collections of data using an existing file to discover other similar files in the data. Again, once the machine and browser settings are configured, a search field can be appended to the browser where example files can be dropped in by the user. Doing this triggers the DTS to search the contents of all the files under a given URL for files similar to the one provided by the user. For example, while browsing an online image collection, a user could drop an image of three people into the search field, and the DTS would return all images in the collection that also contains three people. The DTS will also perform general indexing of the data and extract and append metadata to files and collections enabling users to gain some sense of the type of data they are encountering.
Overall, the DTS will greatly expand general access and understanding of data on the web.
Data Transformation Service
Mouseover the diagram for more details
Clowder
A web based research data management framework, Clowder supports data curation, analysis, and publication on top of the more traditional data sharing capabilities. Clowder serves as a framework for storing and running analysis tools within Brown Dog which are then used to examine the contents of a file for the purpose of automatically creating metadata or numerical signatures capturing some aspect of the file’s contents.
Repeatable Workflow
Because reproducibility is critical to scientific study, Brown Dog maintains a digital signature for each request for data that can be re-run to produce the same, exact results.
Super Mutt of Software
Based on the source format and the target format that the user specifies, Polyglot will send the request to a Software Server somewhere through a chain of applications that will run the conversion and seamlessly present the converted data back to the user.
Polyglot
Polyglot aims to make the Internet agnostic to file formats, automating input/output and open/save capabilities within arbitrary applications and chaining them together to carry out a wide breadth of conversions as needed by a user.
Metadata Enables Searching
Think “keywords” or “tags” on an image. The purpose of this tool is to automatically generate some of that for data of all types. This metadata can then be used to index or search through the data, or enable some other form of analysis over the data.
A Science Driven
Partners
University of Illinois at Urbana-Champaign
Boston University
University of Maryland
Southern Methodist University
This material is based upon work supported by the National Science Foundation under Grant No. ACI-1261582.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.