What is Diamond?

Diamond home

Diamond is a project that began in 2002 as a collaboration between Intel and Carnegie Mellon University. Over time, the research has evolved to span many other institutions, including the University of Pittsburgh, UPMC, Merck Research Labs, IBM Research, Georgia Institute of Technology, Rice University, and Duke University. For an account of the early phases of this evolution and the emergence of key Diamond concepts and mechanisms see the paper "Searching Complex Data Without an Index".

Diamond's goal is to enable interactive search of Internet data repositories that store vast amounts of complex, unindexed data such as digital photographs, video streams, and medical images. This research is centered on a storage architecture and open-source implementation called the OpenDiamond® platform. At the heart of this platform is the concept of early discard, or the ability to reject irrelevant data items very close to their point of storage. Since the knowledge needed to recognize irrelevant data is domain-specific, early discard requires application code called a searchlet to be executed close to storage. The OpenDiamond platform also embodies the concepts of result caching and self-tuning. This allows it to leverage work done in previous searches, and to dynamically adapt to different hardware configurations, workloads, and data content in a manner that is completely transparent to users and applications. A mechanism called scoping enables Diamond searches to span structured data sources (such as a relational database) as well as unstructured data (such as images). Modules called data retrievers enable searches over a wide range of data sources, including live data from webcams and dynamic content sources such as GigaPan. Layered on top of the OpenDiamond platform are a number of open-source applications that are customized for specific domains and data types.

More broadly, Diamond's goal is to help domain experts discover something relevant to a task in a large distributed repository of complex, unindexed and loosely-structured data. Suppose, for example, a pharmaceutical researcher wishes to identify adverse effects of a drug in a large collection of automated cell microscopy images. The term "adverse effects" refers to a vague concept. A more precise definition can only be given after examining the data in some depth. In other words, hypothesis-formation and hypothesis-validation proceed hand-in-hand in a tightly-coupled and iterative sequence. We refer to this inherently human-centric activity as interactive data exploration. To the best of our knowledge, Diamond was the first system (and currently the only system) to provide this capability.

OpenDiamond is a registered trademark of Carnegie Mellon University