Data Webs FAQ
Q. What is a data web?
A. A data web is a web based infrastructure for accessing, analyzing and mining remote and distributed attribute-based data. Data webs can be implemented using general web services and protocols such as XML and SOAP or more specialized protocols and services designed for working with large, remote data sets. Conceptually, data webs are designed to facilitate simple, easy access, integration, analysis, and mining of remote and distributed in the same way the web is designed to make browsing remote documents as simple as possible.
Q. How can I found out more?
A. Some technical articles can be found later on this web page. The DataSpace Project develops open source clients and servers to create data webs and additional information can be found on its home page www.dataspaceweb.org. The rest of this web page answers some basic questions about data webs.
Q. What are web services?
A. Web services based upon SOAP and XML are a rapidly maturing infrastructure which can be used for accessing XML-based data. SOAP enables the serialization of XML-data so that it may be transported using TCP or HTTP. SOAP-based services can be described using the Web Services Description Language or WSDL, while the Universal Description, Discovery and Integration (UDDI) provides a simple mechanism for the discovery of web services. SOAP/XML based web services are designed to deal with general XML based data.
Q. What is a data grid?
A. Data grids combine authentication, authorization and access (AAA) controls with resource managers so that arbitrary computations can be done using distributed computational and data resources belonging to a virtual organization. Good examples of data grids are the data grids developed by physicists or astronomers to process the data collected by the collaboration's instruments. Data in data grids is stored in files and transported using GridFTP.
Q. What is a semantic web?
A. The Semantic Web extends the web's HTML infrastructure to include semantic information defined by XML and the Resource Description Framework (RDF). RDF views information as a directed labeled graph and serializes it in XML. Less formally, RDF codes information using subject-verb-object triples. As a very simple hypothetical example, the triple (www.ncar.ucar.edu/ccm/1/1, Temperature, 45.5) is a subject-verb-object triple giving the Temperature for a particular data record specified by the URL. RDF can be used to encode much more complicated assertions about data, metadata and relationships defined from them. The semantic web also supports ontologies so that data taxonomies can be used, which is very important for many data analysis and data mining applications.
Q. I'm confused. Why are there so many distributed infrastructures for working with remote and distributed data?
Working with remote data is more complex than simply browsing remote documents and there are several different philosophies of how to proceed. Roughly speaking, data webs are designed for the exploration and integration of remote and distributed data, just as the web today is designed to browse remote documents. Data grids are designed so that virtual organizations can access specialized distributed computational and data resources. The semantic web is designed to enable working with knowledge defined using W3C's RDF and related standards. From another point of view, data webs are designed to work with distributed attribute-based data, while data grids are designed to work with remote file-based data.
| View | Mine/Discover | Compute | |
|---|---|---|---|
| Knowledge | Digital Libraries | Knowledge Grids | Semantic Webs |
| Attribute-based data |
OGSA/DAI | Data Webs | Data Grids |
| Files | Persistent Archives | Distributed Data Mining | Grids |
Table 1. Data webs, data grids, and semantic webs can all be used to provide access to remote numerical data. Data webs provide direct access to distributed attribute-based data. Data grids enable large scale resource sharing of computational and data resources. Semantic webs provide knowledge based access to data using ontologies, RDF and agent based architectures.
Q. What is DataSpace?
A. DataSpace is an open source implementation of a data web. Just as the web today enables easy access to remote multimedia documents, a data web enables easy access to remote and distributed data.
Q. What is the DataSpace Transfer Protocol (DSTP)?
A. DSTP is a protocol for moving data over the web, similar to HTTP. It can be used directly, but these days it is generally deployed as a web service using SOAP/XML. DSTP is specifically design for working with data: it knows about data, metadata, keys, and data attributes. DSTP also runs over specialized network protocols such SABUL/UDT which are designed for connecting clients and servers over high performance SONET and optical networks.
Q. What are the advantages of using DSTP?
A. DSTP provides a simple way to publish data on the web and allow others to access it, analyze it, and mine it easily. Working with remote and distributed data will become much easier as DSTP or similar protocols become accepted, just as HTTP made working with remote documents easier. DSTP is unique in that it supports a simple way based upon universal correlation keys, which merge distributed data and overlaying remote data over local data
Q. How do data webs support data mining?
A. A DSTP client can easily access other peoples data and metadata. Once data is retrieved from one or more sites, data mining algorithms and exploratory data analysis can be done as usual. DataSpace is designed to interoperate with proprietary and open source data mining tools. In particular, the open source statistical package R has been integrated into the current version of DataSpace. DataSpace also works with predictive models in PMML, the XML markup language for statistical and data mining models.
Q. What is the Tera Wide Data (TWDM) Grid Mining (TWDM) Testbed?
A. The TWDM is a testbed for data webs over high performance routed networks and optical-switched networks. The TWDM Grid testbed links clusters at StarLight in Chicago, the University of Illinois at Chicago, and SARA in Amsterdam. The clusters are connected with 10 GE links.
Q. What standards are being used?
A. Dataspace is built on open protocols and standards. Queries are done using SOAP/XML. The metadata is generally in XML. Data mining is done using the Data Mining Group's (DMG) Predictive Model Markup Language (PMML).
Q. How is DataSpace being commercialized?
The commercialization model is an open source one. The standards are all open. The DSTP clients and servers are open source. Project DataSpace is encouraging companies to use its open source clients and servers.
Q. What scientific applications are running in DataSpace?
A. There are several data sets on DataSpace today, including earth science data from NCAR, protein data from the Protein Data Bank, and astronomical data from the SLOAN Digital Sky Survey. Any DSTP Clients can view, retrieve, visualize and explore this data.
Q. What business applications are running in DataSpace?
A. DSTP Clients can be used to build virtual data warehouses. Virtual data warehouses leave the data in place and use high speed networks and the DSTP protocols to create data warehouses on the fly, view by view. In addition, SABUL/UDT are currently being tested to provide business continuity and diaster recovery services for data centers using high performance SONET and optical networks. This allows business to replicate crucial data in real time in distant locations and switch over to alternate sites within seconds.
Q. Where can I get a DSTP client or server?
A. DSTP clients and servers are open source and can be found at Source Forge (sourceforge.net/projects/dataspace).
Q. Who is supporting the project?
A. This project is supported in part by the National Science Foundation under Grants Number 9977868 and 0129609. Any opinions, findings, and conclusions or recommendations expressed here are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Q. Who is the Project Director?
A. Robert Grossman is the Project Director. He holds two positions. He is the Director of the National Center for Data Mining (NCDM) at the University of Illinois at Chicago. He is also the President of Open Data Partners.
Q. How do I find out more?
A. The project web site is www.datspaceweb.net
Technical information about DataSpace and DSTP clients and servers can be found in the following technical reports:
- Robert Grossman, and Marco Mazzucco, DataSpace - A Web Infrastructure for the Exploratory Analysis and Mining of Data, IEEE Computing in Science and Engineering, July/August, 2002, pages 44-51. htm draft
- Ian Foster and Robert L. Grossman, Data Integration in a Bandwidth Rich World, Communications ACM, Volume 46, Issue 11, November, 2003, pages 50-57. ACM abstract
- Asvin Ananthanarayan, Rajiv Balachandran, Yunhong Gu, Robert Grossman, Xinwei Hong, Jorge Levera, Marco Mazzucco, Data Webs for Earth Science Data, Parallel Computing, Volume 29, 2003, pages 1363-1379. pdf
- Donald Hamelberg, Pavan K. Kasturi, and Robert L. Grossman, Data Webs for Bioinformatics Data, Information Sciences, to appear. pdf
- Robert Grossman, Donald Hamelberg, Pavan Kasturi, and Bing Liu, Experimental Studies of the Universal Chemical Key (UCK) Algorithm on the NCI Database of Chemical Compounds, Proceedings of the 2003 IEEE Computer Society Bioinformatics Conference (CSB 2003), IEEE Computer Society, Los Alamitos, California, pages 244-250. pdf
- A DataSpace Infrastructure for Astronomical Data, Robert Grossman, Emory Creel, Marco Mazzucco, Roy Williams in R. L. Grossman, C. Kamath, W. Philip Kegelmeye, V. Kumar, and R. Namburu, Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, 2001, pages 115-123. pdf
The material above is based upon work supported in part by the National Science Foundation under Grants Number 9977868 and 0129609. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Copyright Robert L. Grossman, 2002-2003, revised December 24, 2003.