Robert Grossman
News
-
Sector vs Hadoop. In a recent paper, we describe the
design and architecture of Sector.
The paper also describes some preliminary experimental studies
comparing the performance of Sector
and Hadoop. On the clusters
and distributed clusters used, Sector was about twice as fast as
Hadoop on the Terasort Benchmark. Sector is designed to be used on
clusters within a data center, as well as on distributed clusters
across data centers that are connected by wide area area high
performance 10 Gbps networks. The paper can be
found here.
-
Sector Version 1.5 Released.
Version 1.5 of Sector was released on March 18, 2008.
It can be obtained from Source Forge at the project
site sector.sf.net.
Sector is a wide area high performance storage and compute
cloud. For the past couple of years, Sector has been
used to distribute the Sloan Digital Sky Survey
(SDSS) via the
web site sdss.ncdm.uic.edu
The current version of Sector also includes high performance
distributed computing services.
-
UDT, Version 4 released.
UDT is an application layer high performance network transport protocol
that is available from Source Forge at
udt.sf.net. Version 4 of UDT was
recently released.
-
UDT will be part of Globus. Beginning
with Globus Version 4.2, one can
choose an option in GridFTP so that TCP is replaced with UDT, which
will speed up large data transfers.
-
November, 2007, Award:
On November 15, 2007, The Angle Project
won First Place in the 2007 Analytics Challenge at the ACM/IEEE
International Conference for High Performance Computing and
Communications 2007 (SC07).
The title of the project was "Angle: Detecting Anomalies and Emergent
Behavior from Distributed Data in Near Real Time."
-
November, 2007, Announcement: On November 12, a consortium led
by the National Center for Data Mining (NCDM) at the University of
Illinois at Chicago announced a second generation cloud computing
platform called Sector at the SC 2007 conference in Reno, NV. Until
now, cloud computing platforms have all used the standard Internet to
link distributed computing resources. In contrast, Sector uses high
performance, wide area 10 Gbps networks. The foundation for Sector
cloud services is the 10 Gbps Teraflow Network, a joint project of the
NCDM and the International Center for Advanced Internet Research
(iCAIR) at Northwestern University that connects distributed clusters
on three continents used dedicated and shared 10 Gbps networks.
Sector is currently used for distributing the Sloan Digital Sky Survey
(SDSS).
-
November, 2007, Tutorial: On November 10, I gave a tutorial at
SC 07 with Mike Wilde and Michal Sabala titled "A Tutorial
Introduction to High Performance Analytics and Workflow on Grids."
The tutorial included a hands-on lab.
-
October, 2007, Talk: On October 12, I gave a talk at the
National Science Foundation Symposium on Next Generation
of Data Mining and Cyber-Enabled Discovery for Innovation
(NGDM '07). The talk was titled:
Distributed Discovery in E-Science: Lessons from the Angle Project.
-
August, 2007, Award: The paper "Data Quality Models for High
Volume Transaction Streams: A Case Study" by Joesph Bugajski, Robert
Grossman, Chris Curry, David Locke and Steve Vejcik won the second
annual Data Mining Practice Prize at KDD 2007. The prize is awarded
each year "for work that has had a significant and quantitative impact
in the application in which it was applied."
-
July, 2007, Award: I was awarded the ACM Special Interest Group
on Knowledge Discovery and Data Mining (SIGKDD)
Service Award for my "role in the development of open and scalable
architectures and standards for the SIGKDD and Global KDD Communities."
-
June, 2007, Recent paper: It is known that there is a Hopf
algebra structure on the vector space with basis all heap-ordered
trees. We give a new bialgebra structure on the space with basis all
permutations and show that there is a bialgebra isomorphism between
the Hopf algebra of heap-ordered trees and the bialgebra of
permutations. The paper will appear in Communications in Algebra
during 2008.
See arXiv:0706.1327.
-
March, 2007, Improving preditive analytics using large numbers of
predictive models: A very practical mechanism for improving
predictive analytics as the amount of data increases, is to build an
analytic infrastructure that builds automatically many predictive
instead of the more traditional approach that builds one (or a few)
manually. I gave a
lecture on
this recently: Modeling Highly Large, Heterogeneous Data Sets: Towards
a Billion Models, DIMACS Workshop on Recent Advances in Mathematics
and Information Sciences for Analysis and Understanding of Massive and
Diverse Sources of Data, Rutgers University, New Brunswick, May 15,
2007. How this idea was applied to analyze transactional data from
Visa is described in two papers at
KDD 2007: Robert Grossman, Joseph
Bugajski, Chris Curry, David Locke, and Steve Vejcik, Detecting
Changes in Large Data Sets of Payments Cards Data: A Case Study, and
Joseph Bugajski, Chris Curry, Robert Grossman, David Locke, Steve
Vejcik, Data Quality Models for High Volume Transaction Streams at the
KDD Data Mining Case Studies Workshop.
-
March, 2007, Tutorial:
I gave a
tutorial
called "Introduction to Data Mining on Grids", at the Midwest Grid
Workshop in Chicago on March 25, 2007. The tutorial will be repeated
at SC 07 in November, 2007.
-
November, 2006 - SC 06 Bandwidth Challenge, First Place. Our
team won first place in the Bandwidth Challenge at the ACM/IEEE
International Conference for High Performance Computing and
Communications 2006 (SC 06) with the project: Distributing the Sloan
Digital Sky Survey Using Sector. To win the challenge, we transported
over 1.2TB of SDSS data from Chicago to Tampa disk to disk at a
sustained bandwidth of over 8.1 Gbps and a peak bandwidth of 9.18
Gbps. For a trace of the transfer, see the SC 06 Bandwidth Challenge web
site
-
November, 2006, Recent talk:
I gave a talk at the 11th International Conference on Information Quality
at MIT on November 11, 2006 with the title
Monitoring Data Quality for Very High Volume Transaction
Systems. In the talk, I described a production system we have
developed that uses over 20,000 separate predictive models to monitor
data quality for a high volume transaction system.
-
August, 2006, Recent Workshop: I organized the
Workshop on Data Mining Standards, Services and Platforms (DM-SSP 06),
at KDD-2006 in Philadelphia on August 20, 2006.
The workshop highlighted recent progress on developing standard-based
services for data mining and data intensive computing.
-
July, 2006, Recent talk: I gave the first annual Vyborny
Memorial Lecture at the University of Chicago on July 17, 2006. The
title was The Age of Data-Driven Discovery and Decision Support:
The New Rules.
-
July, 2006, Recent talk: I gave a talk at the 3rd International
Workshop on Data Integration in the Life Sciences 2006 (DILS'06) with
the title: Using Term Lists and Inverted Files to Improve Search
Speed for Metabolic Pathway Databases. In the talk, I describe
a system we are working on to improve the ability to discover
information from distributed databases containing information
about metabolic pathways and networks.
-
April, 2006 - Salishan Conference on High Speed Computing. On
April 26, 2006 I gave a talk at the Salishan Conference on High Speed
Computing with the title: Other People's Petabytes: The Challenge of
Distributed Data Mining and Distributed Data Integration.
-
April, 2006, Recent paper: We completed a paper summarizing
some of the work over the past few years on UDT: Yunhong Gu and Robert
L. Grossman, UDT: UDP-based Data Transfer for High-Speed Wide Area
Networks, Computer Networks, 2007, to appear.
-
December, 2005 - Augustus Version 0.2 released on Source Forge.
Augustus is an open source infrastructure
for building and deploying data mining and statistical models for
large data sets and high volume data streams. Augustus is compliant with the
Predictive Model Markup Language (PMML). Augustus supports vectorized
operations and is designed for data sets that are too
large for existing open source data mining systems.
-
December, 2005 - The Data Mining Group Released Version 3.1 of the
Predictive Model Markup Language (PMML). PMML
is the most widely deployed standard for expressing statistical
and data mining models in an application and platform independent
manner.
-
November, 2005 - SC 2005 Tri-Challenge Award Winner
We won the Tri-Challenge at SC 05, which was the best overall
winner of the HPC Analytics Challenge, the Bandwidth Challenge
and the Storage Challenge. The HPC Analytics Challenge is described
in more detail below. Our entry for the Bandwidth Challenge used
our UDT high performance data transport middleware to set a new milestone
for wide area disk-to-disk data transport.
We transferred the entire Release 3 of the Sloan Digital Sky Survey
(SDSS) data set (785 Gigabytes compressed) from the conference floor to
nodes at KISTI in Korea, in less than 3.5 hours. The average transfer
speed was over 650Mb/sec and the peak speed was over 1000
Mb/sec. This was the first time that an astronomy data set of this
size was transferred this fast across the Pacific. With conventional
networks and network protocols this transfer would not have been
practical.
-
November, 2005 - SC 2005 High Performance Analytics First Prize
We tied for first place in the first annual
HPC Analytics Challenge, which was part of the SC 05 Conference that took place
in November, 2005 in Seattle. The entry was called Real Time Change
Detection and Alerts from Highway Traffic
Data. We developed a test-bed containing real-time data from
over 830 highway traffic sensors in the Chicago region, weather data, and
text data about events that might affect traffic. The goal was to
detect interesting changes in traffic conditions in real-time.
-
August, 2005 - KDD 2005
I was the General Chair of the Eleventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, which took place in
Chicago on August 21-24, 2005. Over 800 researchers and industrial
experts attended the event. For more information, visit
the KDD 2005 web site.
-
August, 2005 - Explaining Data Mining
I answered some questions on Chicago Public Radio's WBEZ
about data mining. Here is a link to the
interview.
-
June, 2005 - ACM SIGKDD
I have been elected to the ACM Special Interest Group
on Knowledge Discovery and Data Mining
(SIGKDD)
Board of Directors for the period June 30, 2005 - July 1, 2007.
The SIGKDD Executive Committee conists of a chair, secretary
and five member board of directors.
-
May, 2005 - Site Update.
The site has been updated to
use xslt transformations. I
use xml to store basic metadata about the publications and
related items; and, previously, I had been
using Python scripts to produce the various html pages. The web
pages are now generated using xslt transformations.
-
May, 2005 - Magnify.
ChoicePoint acquired Magnify.
The press release can be found here. See the news for May 2, 2005.
-
April, 2005. Real Time Change Detection.
You can see a prototype of real time change detection for highway traffic data
on the Gateway Testbed,
which is part of the National Center for Data Mining's (NCDM) Teraflow Testbed.
The NCDM is developing and operating the testbed
so that researchers can have large and interesting data
sets to encourage the creation of new algorithms for exploring
and analyzing multi-modal data sets, especially from the context of change
detection.
The prototype uses
an open source high performance scoring engine being developed by Open Data.
-
April, 2005. Recent paper. An obstacle to buildling good
predictive models is the poor quality of the data available to most
projects. In the paper: Joseph Bugajski, Robert L. Grossman, Eric
Sumner and Zhao Tang, An Event Based Framework for Improving
Information Quality That Integrates Baseline Models, Causal Models and
Formal Reference Models, Second International ACM SIGMOD Workshop on
Information Quality in Information Systems (IQIS 2005), June 17th,
Baltimore, Maryland, co-located with ACM SIGMOD/PODS 2005, we
introduce a framework for improving data quality and show some
preliminary results when a large provider of electronic payments
uses this framework.
-
Feburary, 2005 - Open Data Web Site.
The Open Data web site has been updated with some success
stories from 2004.
Please see
www.opendatagroup.com.
-
Feburary, 2005 - Teraflow Testbed Web Site.
The Teraflow Testbed web site has been updated.
Please see
www.teraflowtesbed.net.
-
January, 2005 - Ensembles of Change Detection Models. Change
detection is a very useful practical method for determining whether a
system or process has undergone a statistically signficant change. In
many practical cases, there are temporal, geo-spatial, or logical
variations in the underlying processes. Recently, we developed and
prototyped a mechanism for combining many such change detection
algorithms to produce what is called an ensemble and showed how
ensembles of change detection models are more effective than
traditional methods for several practical problems. This has been
implemented using a real time scoring engine for highway traffic data
on the
Gateway Testbed and a publication will be appearing on this web
site shortly.
-
November, 2004 - Bulk Data Transport of the Sloan Digital Sky
Survey Data (SDSS):
Up until now, the Sloan Digital Sky Survey (SDSS) data sets
have been shared with European and Asia Pacific
collaborators by shipping disks around the world because the
data sets were too large to be transported easily using networks.
At a demonstration at the SC 05 conference in Pittsburgh,
we demonstrated how new transport
protocols are enabling thse data sets to be shared for
the first using over long distances using high performance
international networks. Specifically, we
transported the SDSS DR3 multiple terabyte dataset between Chicago, USA,
and various sites in Europe and Asia using NCDM's high performance
data transport protocol UDT (UDP-based Data Transport Protocol).
We did this thousands of times faster than could be achieved by
standard protocols, such as TCP, as it is usually deployed.
For more information, see the Teraflow Testbed
web site.
-
October, 2004 - The Data Mining Group Releases Version 3.0 of the
Predictive Model Markup Language (PMML). PMML Version 3.0 adds
several new models, including models for rule sets and text mining.
It also add the ability to compose certain data mining operations.
For example, in PMML Version 3.0 the outputs of regression models can
be used as the inputs to other models (model sequencing) and a
decision tree or regression model can be used to combine the outputs
of several embedded models (model selection).
For more information, see
the DMG web site.
-
October, 2004 - Teraflow Testbed:
The National Center for Data Mining at the University of Illinois
at Chicago recently launched the Teraflow Testbed, a high performance
tesbed for mining remote and distributed data. The testbed consists
of computer clusters in Chicago, Amsterdam, Geneva, Tokyo, and Kingston
connected by 1 Gbps and 10 Gbps networks. Several different data sets
are available, including astronomical data from the Sloan Digital Sky Survey,
highway sensor data from the Gateway Project, and mass spec data from the
Chicago Bioinformatics Consortium. For more information, see
the testbed's web site web site.
-
August, 2004 - Recent Paper: Robert L. Grossman and Richard
Larson, Differential Algebra Structures on Trees. In this paper, we
show how there are natural differential algebra structures on families
of rooted trees labeled with derivations once a connection is
specified. In 1989, Larson and I showed that there is a natural
multiplication on the vector space whose basis is the set of rooted
trees. There is also a natural comultiplication so that rooted trees
form a Hopf algebra. If rooted trees are labeled with derivations of
ring, there is also a natural module structure, which is the subject
of this paper. The paper can be found on arXiv.org (reference number
math.QA/0409006) and here. A draft of
the paper was completed about ten years ago, so I'm happy it is
finally finished. The paper will be published in Advances
in Applied Mathematics.
-
August, 2004 - Second KDD Workshop on Data Mining Standards,
Services, and Platforms. On August 22, 2004, the Second KDD
Workshop on Data Mining Standards, Services, and Platforms took place
in Seattle. This was the fourth year that there has been a KDD
workshop on the Predictive Model Markup Language (PMML) and related
areas and the second year of a broader conference with the theme of
Data Mining Standards, Services and Platforms. A theme of prior
year's workshop was the maturing of the PMML standard and the
opportunity this created for scoring engines. A theme of this
year's workshop was the maturing of infrastructure for data
preparation. The workshop proceedings are available
on line.
-
June, 2004 - High performance network protocols. For many
applications, exploring remote and distributed data is an
important component. Unfortunately, network protocols such as
TCP, as commonly deployed, do not provide the performance
required for exploring large data sets. The publication, which will appear in a
special issue of the Journal of Future Computer Systems (FGCS) on
high performance networking, contains an experimental study of
teraflows over wide area experimental 10 Gb/s networks linking
Chicago and Amsterdam and Chicago and Geneva. Teraflow
technology has been shown to be an effective infrastructure for
exploring large remote and distributed data sets.
-
May, 2004 - High performance web services for data mining.
It is well known that web services as currently defined do not
scale well for distributed data intensive applications such
as data mining. We have recently introduced scalable web services
for data mining called Open DMIX. The basic idea is to use a
TCP/XML based control channel and a separate data channel which
can use high performance network protocols and more efficient
packaging in order to improve performance. Open DMIX
is described in the
publication, which was presented
at the recent SIAM Workshop on high performance and distributed
data mining.
-
April, 2004 - Site Update: Some additional articles have
been put on line. The lists of talks and publications has
been updated.
-
March, 2004 - Experimental study of teraflows.
A teraflow is a high volume data flow. Today, we can create
data flows ranging from 1 Gb/s to 10+ Gb/s.
It is an open problem to develop protocols for teraflows which
are fair to other teraflows and friendly to commodity TCP flows.
We completed a recent experimental study
of teraflows, and showed that UDT-based teraflows are fast, fair, and friendly.
The most current version of this work can be found
here.
-
February, 2004 - Software Release: Version 1.2 of UDT was
released and is available on Source Forge:
sourceforge.net/projects/dataspace.
UDT is an open source application level library which provides fast,
fair and friendly transport of high volume data streams. A one
page summary about UDT can be found here.
-
December, 2003 - Site Update: The FAQs have been updated.
-
December, 2003 - Recent Paper:
Robert L. Grossman, Pavan Kasturi,
Donald Hamelberg, Bing Liu, An Empirical Study of the Universal
Chemical Key Algorithm for Assigning Unique Keys to Chemical
Compounds, Journal of Bioinformatics and Computational Biology,
2004, to appear. draft
-
November, 2003 - Site Update: There are now several lists of papers
organized by topics. These are
generated automatically from a master list of publications in xml.
Please let me know if find any errors.
There are now about 125 different publications listed on this site
and about 30 percent are available on line. I hope to increase
this percentage over the next few months.
-
November, 2003 - Recent Talk:
Robert Grossman,
Beyond Data Grids: Data Webs, Lambda Grids, and All That,
NASA Goddard Information Science and Technology
Colloquium Series,
NASA Goddard, November 12, 2003.
Here is the abstract.
-
November, 2003 - Recent Talk:
Robert Grossman, A Tutorial Introduction to High Performance
Data Transport, SC 03, Phoenix on November 15, 2003.
slides (size - 5 MB)
-
November, 2003 - Recent Paper:
Ian Foster and Robert L. Grossman, Data Integration
in a Bandwidth Rich World, Communications ACM,
Volume 46, Issue 11, November, 2003, pages 50-57.
Here is an early draft
This is from www.rgrossman.com