Modi, A., Tikmany, R., Malik, T., Komondoor,R., Gehani, A. and D'Souza, D., "Kondo: Efficient Provenance-driven Data Debloating", 40th IEEE International Conference on Data Engineering (ICDE), 2024
Isolation increases upfront costs of provisioning containers. This is due to unnecessary software and data in container images. While several static and dynamic analysis methods for pruning unnecessary software are known, less attention has been paid to pruning unnecessary data. In this paper, we address the problem of determining and reducing unused data within a containerized application. Current data lineage methods can be used to detect data files that are never accessed in any of the observed runs, but this leads to a pessimistic amount of debloating. It is our observation that while an application may access a data file, it often accesses only a small portion of it over all its runs. Based on this observation, we present an approach and a tool Kondo, which aims to identify the set of all possible offsets that could be accessed within the data files over all executions of the application. Kondo works by fuzzing the parameter inputs to the application, and running it on the fuzzed inputs, with vastly fewer runs than brute force execution over all possible parameter valuations. Our evaluation on realistic benchmarks shows that Kondo is able to achieve 63% reduction in data file sizes and 98% recall against the set of all required offsets, on average.
@article{Malik-ICDE-C36,
title = {Kondo: Efficient Provenance-driven Data Debloating},
author = {Modi, A., Tikmany, R., Malik, T., Komondoor,R., Gehani, A. and D'Souza, D.},
journal = {40th IEEE International Conference on Data Engineering (ICDE)},
publisher = {40th IEEE International Conference on Data Engineering (ICDE)},
year = 2024,
}
2023
Malik, Tanu, "Reproducible eScience: The Data Containerization Challenge", IEEE eScience, 2023
Computational reproducibility is the cornerstone of the scientific method. We are witnessing a surge of reproducible practices emerging in scientific disciplines. Use of containers is a common theme across several of these practices. Containers isolate applications and make it simple to share code, data, and experiment settings. This paper presents current eScience infrastructures that use containers for reproducible research. We then present our view of the important research issues in data containerization for reproducible eScience. Containerization requires considering packaging, sharing, security, and language choices to ensure reproducibility and preservation. Despite its benefits, containerization can be inefficient and impact experi- ment reproducibility when users stop using it or transition the experiment to different environments. We describe reproducible containers that in addition to containerization include services to ensure and maintain long-term reproducibility and lay forward a vision based on reproducible containers.
@article{Malik-eScience-C35,
title = {Reproducible eScience: The Data Containerization Challenge},
author = {Malik, Tanu},
journal = {IEEE eScience},
publisher = {IEEE},
year = 2023,
}
Nakamura, Y, Kanj, I and Malik, T, "Efficient Differencing of System-level Provenance Graphs", 32nd ACM International Conference on Information and Knowledge Management (CIKM), 2023
Data provenance, when audited at the operating system level, generates a large volume of low-level events. Current provenance systems infer causal flow from these event traces, but do not infer application structure, such as loops and branches. The absence of these inferred structures decreases accuracy when comparing two event traces, leading to low-quality answers from a provenance system. In this paper, we infer nested natural and unnatural loop structures over a collection of provenance event traces. We describe an `unrolling method' that uses the inferred nested loop structure to systematically mark loop iterations, i.e., start and end, and thus to easily compare two event traces audited for the same application. Our loop-based unrolling improves the accuracy of trace comparison by 20-70% over trace comparisons that do not rely on inferred structures.
@article{Malik-CIKM-C34,
title = {Efficient Differencing of System-level Provenance Graphs},
author = {Nakamura, Yuta, Kanj, Iyad and Malik, Tanu},
journal = {32nd ACM International Conference on Information and Knowledge Management (CIKM)},
publisher = {32nd ACM International Conference on Information and Knowledge Management},
year = 2023,
}
Malik, T and Khan, S, "Towards Shareable and Reproducible Cloud Computing Experiments", IEEE CloudSummit, 2023
Containerization has emerged as a systematic way of sharing experiments comprising of code, data, and environment. Containerization isolates dependencies of an experiment and allows the computational results to be regenerated. Several new advancements within containerization make it further easy to encapsulate applications and share lighter-weight containers. However, using containerization for cloud computing experiments requires further improvements both at the container runtime level and the infrastructure-level. In this paper, we lay a vision for using containers as a dominant method for efficient sharing and improved reproducibility of cloud computing experiments. We advocate use of container-compliant cloud infrastructures, inclusion of performance profiles of the application or system architecture on which experiments were performed, and methods for statistical comparison across different container executions. We also outline challenges in the achieving this vision and propose existing solutions that can be adapted and propose new methods that can help with automation.
@article{Malik-CloudSummit-C33,
title = {Towards Shareable and Reproducible Cloud Computing Experiments},
author = {Malik, Tanu and Khan, S},
journal = {IEEE CloudSummit},
publisher = {IEEE Cloud Summit},
year = 2023,
}
Modi, A., Reyad, M, Gehani, A., and Malik, T, "Querying Container Provenance", WWW '23 Companion: Companion Proceedings of the ACM Web Conference, 2023
Containers are lightweight mechanisms for the isolation of operating system resources. They are realized by activating a set of namespaces. Given the use of containers in scientific computing, tracking and managing provenance within and across containers is becoming essential for debugging and reproducibility. In this work, we examine the properties of container provenance graphs that result from auditing containerized applications. We observe that the generated container provenance graphs are hypergraphs because one resource may belong to one or more namespaces. We examine the hierarchical behavior of mph{PID}, mph{mount}, and mph{user} namespaces, that are more commonly activated and show that even when represented as hypergraphs, the resulting container provenance graphs are acyclic. We experiment with recently published container logs and identify hypergraph properties.
@article{Malik-TAPP-W19,
title = {Querying Container Provenance},
author = {Modi, Aniket, Reyad, Moaz, Malik, Tanu and Gehani, Ashish},
journal = {WWW '23 Companion: Companion Proceedings of the ACM Web Conference},
publisher = {Companion Proceedings of the ACM Web Conference 2023 (WWW '23 Companion)},
year = 2023,
}
Niddodi, C., Gehani, A., Malik, T., Mohan, S., and Rilee, M., "IOSPReD: I/O Specialized Packaging of Reduced Datasets and Data-Intensive Applications for Efficient Reproducibility", IEEE Access, 2023
The data generated by large scale scientific systems such as NASA’s Earth Observing System Data and Information System is expected to increase substantially. Consequently, applications processing these huge volumes of data suffer from lack of storage space at the execution site. This poses a critical challenge while sharing data and reproducing application executions w.r.t. specific user inputs in data- intensive applications. To address this issue, we propose IOSPReD (I/O Specialized Packaging of Reduced Datasets), a data-based debloating framework, designed to automatically track and package only necessary chunks of data (along with the application) in a container. IOSPReD uses the specific inputs provided by the user to identify the necessary data chunks. To do so, the high level user inputs are mapped down to low level data file offsets. We evaluate IOSPReD on different realistic NASA datasets to assess (i) the amount of data reduction, (ii) the reproducibility of results across multiple application executions and also (iii) the impact on performance.
@article{Malik-Access-J7,
title = {IOSPReD: I/O Specialized Packaging of Reduced Datasets and Data-Intensive Applications for Efficient Reproducibility},
author = {Niddodi, Chaitra, Gehani, Ashish, Malik, Tanu, Mohan, Sibin, and Rilee, Michael},
journal = {IEEE Access},
publisher = {IEEE Access, vol. 11, pp. 1718-1731,},
year = 2023,
}
2022
Nakamura, Y. Malik, T. Kanj, I. Gehani, A., "Provenance-based Workflow Diagnostics Using Program Specification", 29th IEEE International Conference on High Performance Computing, Data, and Analytics, pp. 21-31, 12, 2022
Workflow management systems (WMS) help automate and coordinate scientific modules and monitor their execution. WMSes are also used to repeat a workflow application with different inputs to test sensitivity and reproducibility of runs. However, when differences arise in outputs across runs, current WMSes do not audit sufficient provenance metadata to determine where the execution first differed. This increases diagnostic time and leads to poor quality diagnostic results. In this paper, we use program specification to precisely determine locations where workflow execution differs. We use existing provenance audited to isolate modules where execution differs. We show that using program specification comes at some increased storage overhead due to mapping of provenance data flows onto program specification, but leads to better quality diagnostics in terms of the number of differences found and their location relative to comparing provenance metadata audited within current WMSes.
@book{Nakamura-HIPC22-C32,
title = {Provenance-based Workflow Diagnostics Using Program Specification},
author = {Nakamura, Yuta and Malik, Tanu and Kanj, Iyad and Gehani, Ashish},
year = 2022,
pages = {21-31},
}
Ahmad, R. Manne, N. Malik, T., "Reproducible Notebook Containers using Application Virtualization", 18th IEEE International Conference on eScience, pp. 1-10, 10, 2022
Notebooks have gained wide popularity in scientific computing. A notebook is both a web-based interactive front-end to program workflows and a lightweight container for sharing code and its output. Reproducing notebooks in different target environments, however, is a challenge. Notebooks do not share the computational environment in which they are executed. Consequently, despite being shareable they are often not reproducible. The application virtualization (AV) method enables shareability and reproducibility of applications in heterogeneous environments. AV-based tools, however, encapsulate non-interactive, batch applications. In this paper, we present FLINC, a user-space method and tool for creating reproducible notebook containers. FLINC virtualizes the notebook process that enables interactive computation and creates notebook containers, which include the environment and all data dependencies accessed by the notebook file. It relies on provenance collected during virtualization to ensure the correct behavior of a notebook when run repeatedly in different environments. We demonstrate how FLINC exports notebook containers seamlessly to non-notebook environments. Our experiments show that FLINC creates lighter weight containers as compared to equivalent non-interactive, batch containers, and preserves the same interactive workflow for the user as in current notebook platforms.
@book{Ahmad-eScience-C31,
title = {Reproducible Notebook Containers using Application Virtualization},
author = {Ahmad, Raza and Manne, Nithin and Malik, Tanu},
year = 2022,
pages = {1-10},
}
Manne, N. N. Satpati, S. Malik, T. Bagchi, A. Gehani, A. Chaudhary, A. , "CHEX: Multiversion Replay with Ordered Checkpoints", Proceedings of the Very Large Databases (VLDB), vol. 15, pp. 1297-1310, 2, 2022
In scientific computing and data science disciplines, it is often necessary to share application workflows and repeat results. Current tools containerize application workflows, and share the resulting container for repeating results. These tools, due to containerization, do improve sharing of results. However, they do not improve the efficiency of replay. In this paper, we present the multiversion replay problem which arises when multiple versions of an application are containerized, and each version must be replayed to repeat results. To avoid executing each version separately, we develop CHEX, which checkpoints program state and determines when it is permissible to reuse program state across versions. It does so using system call-based execution lineage. Our capability to identify common computations across versions enables us to consider optimizing replay using an in-memory cache, based on a checkpoint …
@article{Malik-arxiv22-C30,
title = {CHEX: Multiversion Replay with Ordered Checkpoints},
author = {Manne, Naga N M and Satpati, Shilvi and Malik, Tanu and Bagchi, Amitabha and Gehani, Ashish and Chaudhary, Amitabh},
journal = {Proceedings of the Very Large Databases (VLDB)},
year = 2022,
pages = {1297-1310},
}
2021
That, D. T. Gharehdaghi, M. Rasin, A. Malik, T. , "LDI: Learned Distribution Index for Column Stores", 2021 IEEE International Conference on Big Data (Big Data), pp. 376-387, 12, 2021
In column stores, which ingest large amounts of data into multiple column groups, query performance deteriorates. Commercial column stores use log-structured merge (LSM) tree on projections to ingest data rapidly. LSM improves ingestion performance, but in column stores the sort-merge phase is I/O-intensive, which slows concurrent queries and reduces overall throughput. In this paper, we aim to reduce the sorting and merging cost that arise when data is ingested in column stores. We present LDI, a learned distribution index for column stores. LDI learns a frequency-based data distribution and constructs a bucket worth of data based on the learned distribution. Filled buckets that conform to the distribution are written out to disk; unfilled buckets are retained to achieve the desired level of sortedness, thus avoiding the expensive sort-merge phase. We present an algorithm to learn and adapt to distributions, and a …
@inproceedings{Hai-BigData-C29,
title = {LDI: Learned Distribution Index for Column Stores},
author = {That, Dai-Hai T T and Gharehdaghi, Mohammadsaleh and Rasin, Alexander and Malik, Tanu},
booktitle = {2021 IEEE International Conference on Big Data (Big Data)},
publisher = {IEEE},
year = 2021,
pages = {376-387},
}
Plale, B. A. Malik, T. Pouchard, L. C. , "Reproducibility Practice in High-Performance Computing: Community Survey Results", Computing in Science & Engineering, vol. 23, pp. 55-60, 9, 2021
The integrity of science and engineering research is grounded in assumptions of rigor and transparency on the part of those engaging in such research. HPC community effort to strengthen rigor and transparency take the form of reproducibility efforts. In a recent survey of the SC conference community, we collected information about the SC reproducibility initiative activities. We present the survey results in this article. Results show that the reproducibility initiative activities have contributed to higher levels of awareness on the part of SC conference technical program participants, and hint at contributing to greater scientific impact for the published papers of the SC conference series. Stringent point-of-manuscript-submission verification is problematic for reasons we point out, as are inherent difficulties of computational reproducibility in HPC. Future efforts should better decouple the community educational goals from …
@article{Malik-IEEE21-J6,
title = {Reproducibility Practice in High-Performance Computing: Community Survey Results},
author = {Plale, Beth A P and Malik, Tanu and Pouchard, Line C P},
journal = {Computing in Science & Engineering},
publisher = {IEEE},
year = 2021,
pages = {55-60},
}
That, D. H. T. Gharehdaghi, M. Rasin, A. Malik, T. , "On Lowering Merge Costs of an LSM Tree", Proceedings of the 33rd International Conference on Scientific and Statistical Database Management, 7, 2021
In column stores, which ingest large amounts of data into multiple column groups, query performance deteriorates. Commercial column stores use log-structured merge (LSM) tree on projections to ingest data rapidly. LSM tree improves ingestion performance, but for column stores the sort-merge maintenance phase in an LSM tree is I/O-intensive, which slows concurrent queries and reduces overall throughput. In this paper, we present a simple heuristic approach to reduce the sorting and merging cost that arise when data is ingested in column stores. We demonstrate how a Min-Max heuristic can construct buckets and identify the level of sortedness in each range of data. Filled and relatively-sorted buckets are written out to disk; unfilled buckets are retained to achieve a better level of sortedness, thus avoiding the expensive sort-merge phase. We compare our Min-Max approach with LSM tree and production columnar stores using real and synthetic datasets.
@inproceedings{Hai-SSDBM21-W18,
title = {On Lowering Merge Costs of an LSM Tree},
author = {That, Dai Hai Ton and Gharehdaghi, Mohammad and Rasin, Alexander and Malik, Tanu},
booktitle = {Proceedings of the 33rd International Conference on Scientific and Statistical Database Management},
year = 2021,
}
Malik, T. , "Artifact Description/Artifact Evaluation: A Reproducibility Bane or a Boon", Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems, pp. 1-1, 6, 2021
Several systems research conferences now incorporate an artifact description and artifact evaluation (AD/AE) process as part of the paper submission. Authors of accepted papers optionally submit a plethora of artifacts: documentation, links, tools, code, data, and scripts for independent validation of the claims in their paper. An artifact evaluation committee (AEC) evaluates the artifacts and stamps papers with accepted artifacts, which then receive publisher badges. Does this AD/AE process serve authors and reviewers? Is it scalable for large conferences such as SCxy? Using the last three SCxy Reproducibility Initiatives as the basis, this talk will analyze the benefits and the miseries of the AD/AE process. Several systems research conferences now incorporate an artifact description and artifact evaluation (AD/AE) process as part of the paper submission. Authors of accepted papers optionally submit a plethora of …
@book{Malik-PRECS21-K1,
title = {Artifact Description/Artifact Evaluation: A Reproducibility Bane or a Boon},
author = {Malik, Tanu},
year = 2021,
pages = {1-1},
}
Choi, YoungDon and Goodall, Jonathan and Ahmad, Raza and Malik, Tanu and Tarboton, David, "An Approach for Open and Reproducible Hydrological Modeling using Sciunit and HydroShare", EGU General Assembly Conference Abstracts, 4, 2021
@article{Choi-EGU21,
title = {An Approach for Open and Reproducible Hydrological Modeling using Sciunit and HydroShare},
author = {},
journal = {EGU General Assembly Conference Abstracts},
year = 2021,
}
2020
Essawy, B. T. Goodall, J. L. Voce, D. Morsy, M. M. Sadler, J. M. Choi, Y. D. Tarboton, D. G. Malik, T. , "A taxonomy for reproducible and replicable research in environmental modelling", Environmental Modelling & Software, vol. 134, pp. 104753, 12, 2020
Despite the growing acknowledgment of reproducibility crisis in computational science, there is still a lack of clarity around what exactly constitutes a reproducible or replicable study in many computational fields, including environmental modelling. To this end, we put forth a taxonomy that defines an environmental modelling study as being either 1) repeatable, 2) runnable, 3) reproducible, or 4) replicable. We introduce these terms with illustrative examples from hydrology using a hydrologic modelling framework along with cyberinfrastructure aimed at fostering reproducibility. Using this taxonomy as a guide, we argue that containerization is an important but lacking component needed to achieve the goal of computational reproducibility in hydrology and environmental modelling. Examples from hydrology are provided to demonstrate how new tools, including a user-friendly tool for containerization of computational …
@article{Essawy-EMS20-O5,
title = {A taxonomy for reproducible and replicable research in environmental modelling},
author = {Essawy, Bakinam T E and Goodall, Jonathan L G and Voce, Daniel and Morsy, Mohamed M M and Sadler, Jeffrey M S and Choi, Young D C and Tarboton, David G T and Malik, Tanu},
journal = {Environmental Modelling & Software},
publisher = {Elsevier},
year = 2020,
pages = {104753},
}
Wagner, J. Rasin, A. Malik, T. Grier, J. , "ODSA: Open Database Storage Access", Extending Database Technology (EDBT), 8, 2020
Applications in several areas, such as privacy, security, and integrity validation, require direct access to database management system (DBMS) storage. However, relational DBMSes are designed for physical data independence, and thus limit internal storage exposure. Consequently, applications either cannot be enabled or access storage with ad-hoc solutions, such as querying the ROWID (thereby exposing physical record location within DBMS storage but not OS storage) or using DBMS “page repair” tools that read and write DBMS data pages directly. These ad-hoc methods are difficult to program, maintain, and port across various DBMSes.
In this paper, we present a specification of programmable access to relational DBMS storage. Open Database Storage Access (ODSA) is a simple, DBMS-agnostic, easy-to-program storage interface for DBMSes. We formulate novel operations using ODSA, such as comparing page-level metadata. We present three compelling use cases that are enabled by ODSA and demonstrate how to implement them with ODSA.
@inproceedings{Wagner-EDBT20-W15,
title = {ODSA: Open Database Storage Access},
author = {Rasin, Alexander and Malik, Tanu and Grier, Jonathan},
booktitle = {Extending Database Technology (EDBT)},
year = 2020,
}
Wagner, J. Rasin, A. Heart, K. Malik, T. Grier, J. , "DF-toolkit: interacting with low-level database storage", Proceedings of the VLDB Endowment, vol. 13, 8, 2020
Applications in several areas, such as privacy, security, and integrity validation, require direct access to database management system (DBMS) storage. However, relational DBM-Ses are designed for physical data independence, and thus limit internal storage exposure. Consequently, applications either cannot be enabled or access storage with ad-hoc solutions, such as querying the ROWID (which can expose physical record location within DBMS storage but not within OS storage) or using DBMS “page repair” tools that read and write DBMS data pages directly. Such ad-hoc methods are limited in their capabilities and difficult to program, maintain, and port across various DBMSes. In this demonstration, we showcase DF-Toolkit–a set of tools that provide an abstracted access to the DBMS storage layer. Users will be able to view DBMS storage not accessible through other applications. Examples include unallocated (eg, deleted) data, index value-pointer pairs, and cached DBMS pages in RAM. Users will also be able to interact with several special-purpose security applications that audit DBMS storage beyond what DBMS vendors support.
@article{Wagner-VLDB20-W17,
title = {DF-toolkit: interacting with low-level database storage},
author = {Wagner, James and Rasin, Alexander and Heart, Karen and Malik, Tanu and Grier, Jonathan},
journal = {Proceedings of the VLDB Endowment},
year = 2020,
}
Niddodi, C. Gehani, A. Malik, T. Navas, J. A. Mohan, S. , "MiDas: Containerizing Data-Intensive Applications with I/O Specialization", Proceedings of the 3rd International Workshop on Practical Reproducible Evaluation of Computer Systems, pp. 21-25, 6, 2020
Scientific applications often depend on data produced from computational models. Model-generated data can be prohibitively large. Current mechanisms for sharing and distributing reproducible applications, such as containers, assume all model data is saved and included with a program to support its successful re-execution. However, including model data increases the sizes of containers. This increases the cost and time required for deployment and further reuse. We present a framework named MiDas ( Minimizing Datasets) for specializing I/O libraries which, given an application, automates the process of identifying and including only a subset of the data accessed by the program. To do this, MiDas combines static and dynamic analysis techniques to map high level user inputs to low level file offsets. We show several orders of magnitude reduction in data size via specialization of I/O libraries associated with …
@book{Niddodi-PRECS20-W16,
title = {MiDas: Containerizing Data-Intensive Applications with I/O Specialization},
author = {Niddodi, Chaitra and Gehani, Ashish and Malik, Tanu and Navas, Jorge A N and Mohan, Sibin},
year = 2020,
pages = {21-25},
}
Chuah, J. Deeds, M. Malik, T. Choi, Y. Goodall, J. L. , "Documenting computing environments for reproducible experiments", Parallel Computing: Technology Trends, pp. 756-765, 2020
Establishing the reproducibility of an experiment often requires repeating the experiment in its native computing environment. Containerization tools provide declarative interfaces for documenting native computing environments. Declarative documentation, however, may not precisely recreate the native computing environment because of human errors or dependency conflicts. An alternative is to trace the native computing environment during application execution. Tracing, however, does not generate declarative documentation.
@book{Malik-PARCO20-C26,
title = {Documenting computing environments for reproducible experiments},
author = {Chuah, Jason and Deeds, Madeline and Malik, Tanu and Choi, Youngdon and Goodall, Jonathan L G},
publisher = {IOS Press},
year = 2020,
pages = {756-765},
}
Ahmad, R. Nakamura, Y. Manne, N. N. Malik, T. , "{PROV-CRT}: Provenance Support for Container Runtimes", 12th International Workshop on Theory and Practice of Provenance (TaPP 2020), 2020
A container runtime isolates computations and its associated data dependencies and is thus useful for porting applications on new machines. Current container runtimes, such as LXC and Docker, however, do not automatically track provenance, which is essential for verifying computations. We demonstrate PROV-CRT, a provenance module in a container runtime that tracks the provenance of computations during container creation and uses audited provenance to compare computations during container replay. We show how this module simplifies and improves the efficiency of complex container management tasks, such as classifying container contents and incrementally replaying containerized applications.
@inproceedings{Malik-IPAW20-P5,
title = {{PROV-CRT}: Provenance Support for Container Runtimes},
author = {Ahmad, Raza and Nakamura, Yuta and Manne, Naga N M and Malik, Tanu},
booktitle = {12th International Workshop on Theory and Practice of Provenance (TaPP 2020)},
year = 2020,
}
Nakamura, Y. Ahmad, R. Malik, T. , "Content-defined Merkle Trees for Efficient Container Delivery", 28th IEEE International Conference on High Performance Computing, Data, & Analytics, 2020
Containerization simplifies the sharing and deployment of applications when environments change in the software delivery chain. To deploy an application, container delivery methods push and pull container images. These methods operate on file and layer (set of files) granularity, and introduce redundant data within a container. Several container operations such as upgrading, installing, and maintaining become inefficient, because of copying and provisioning of redundant data. In this paper, we reestablish recent results that block-level deduplication reduces the size of individual containers, by verifying the result using content-defined chunking. Block-level deduplication, however, does not improve the efficiency of push/pull operations which must determine the specific blocks to transfer. We introduce a content-defined Merkle Tree (CDMT) over deduplicated storage in a container. CDMT indexes deduplicated …
@inproceedings{Malik-HiPC20-C28,
title = {Content-defined Merkle Trees for Efficient Container Delivery},
author = {Nakamura, Yuta and Ahmad, Raza and Malik, Tanu},
booktitle = {28th IEEE International Conference on High Performance Computing, Data, & Analytics},
year = 2020,
}
Nakamura, Y. Malik, T. Gehani, A. , "Efficient provenance alignment in reproduced executions", 12th International Workshop on Theory and Practice of Provenance (TaPP 2020), 2020
Reproducing experiments entails repeating experiments with changes. Changes, such as a change in input arguments, a change in the invoking environment, or a change due to nondeterminism in the runtime may alter results. If results alter significantly, perusing them is not sufficient—users must analyze the impact of a change and determine if the experiment computed the same steps. Making fine-grained, stepwise comparisons can be both challenging and time-consuming. In this paper, we compare a reproduced execution with recorded system provenance of the original execution, and determine provenance alignment. The alignment is based on comparing the specific location in the program, the control flow of the execution, and data inputs. Experiments show that the alignment method has a low overhead to compute a match and realigns with a small look-ahead buffer.
@inproceedings{Malik-TaPP20-C27,
title = {Efficient provenance alignment in reproduced executions},
author = {Nakamura, Yuta and Malik, Tanu and Gehani, Ashish},
booktitle = {12th International Workshop on Theory and Practice of Provenance (TaPP 2020)},
year = 2020,
}
2019
Youngdahl, A. Ton-That, D. Malik, T. , "SciInc: A Container Runtime for Incremental Recomputation", 2019 15th International Conference on eScience (eScience), pp. 291-300, 9, 2019
The conduct of reproducible science improves when computations are portable and verifiable. A container runtime provides an isolated environment for running computations and thus is useful for porting applications on new machines. Current container engines, such as LXC and Docker, however, do not track provenance, which is essential for verifying computations. In this paper, we present SciInc, a container runtime that tracks the provenance of computations during container creation. We show how container engines can use audited provenance data for efficient container replay. SciInc observes inputs to computations, and, if they change, propagates the changes, re-using partially memoized computations and data that are identical across replay and original run. We chose light-weight data structures for storing the provenance trace to maintain the invariant of shareable and portable container runtime. To …
@inproceedings{Malik-eScience19-C25,
title = {SciInc: A Container Runtime for Incremental Recomputation},
author = {Youngdahl, Andrew and Ton-That, Dai-Hai and Malik, Tanu},
booktitle = {2019 15th International Conference on eScience (eScience)},
publisher = {IEEE},
year = 2019,
pages = {291-300},
}
Missier, P. Malik, T. Cala, J. , "Report on the first international workshop on incremental re-computation: Provenance and beyond", ACM SIGMOD Record, vol. 47, pp. 35-38, 5, 2019
In the last decade, advances in computing have deeply transformed data processing. Increasingly systems aim to process massive amounts of data efficiently, often with fast response times that are typically characterised by the 4V's, i.e., Volume, Variety, Velocity, and Veracity. While fast data processing is desirable, it is also often the case that the outcomes of computationally expensive processes become obsolete over time, due to changes in inputs, reference datasets, tools, libraries, and deployment environment. Given massive data processing, such changes must be carefully accounted for, and their impact on original computation assessed, to determine how much re-computation is needed in response to changes.
@article{Malik-IPRB19-R4,
title = {Report on the first international workshop on incremental re-computation: Provenance and beyond},
author = {Missier, Paolo and Malik, Tanu and Cala, Jacek},
journal = {ACM SIGMOD Record},
publisher = {ACM},
year = 2019,
pages = {35-38},
}
That, D. H. T. Wagner, J. Rasin, A. Malik, T. , "PLI+: Efficient Clustering of Cloud Databases", Distributed and Parallel Databases, vol. 37, pp. 177-208, 3, 2019
Commercial cloud database services increase availability of data and provide reliable access to data. Routine database maintenance tasks such as clustering, however, increase the costs of hosting data on commercial cloud instances. Clustering causes an I/O burst; clustering in one-shot depletes I/O credit accumulated by an instance and increases the cost of hosting data. An unclustered database decreases query performance by scanning large amounts of data, gradually depleting I/O credits. In this paper, we introduce Physical Location Index Plus (PLI^+ PLI+), an indexing method for databases hosted on commercial cloud. PLI^+ PLI+ relies on internal knowledge of data layout, building a physical location index, which maps a range of physical co-locations with a range of attribute values to create approximately sorted buckets. As new data is inserted, writes are partitioned in memory based on incoming data …
@article{Hai-DAPD19-J5,
title = {PLI+: Efficient Clustering of Cloud Databases},
author = {That, Dai H T T and Wagner, James and Rasin, Alexander and Malik, Tanu},
journal = {Distributed and Parallel Databases},
publisher = {Springer US},
year = 2019,
pages = {177-208},
}
2018
Sadler, J. Essawy, B. Goodall, J. Voce, D. CHOI, Y. Morsy, M. Yuan, Z. Malik, T. , "Leveraging Scientific Cyberinfrastructures to Achieve Computational Hydrologic Model Reproducibility", AGU Fall Meeting Abstracts, vol. 2018, pp. C13J-1252, 12, 2018
Achieving reproducibility of computational models and workflows is an important challenge that calls for open and reusable code and data, well-documented workflows, and controlled environments that allow others to verify published findings. HydroShare (http://www. hydroshare. org) and GeoTrust (http://geotrusthub. org/), two new cyberinfrastructure tools under active development, have the potential to address this challenge in the field of computational hydrology. HydroShare is a web-based system for sharing hydrologic data and models as digital resources. HydroShare allows hydrologists to upload model input data resources, add detailed hydrologic-specific metadata to these resources, and interact with the data directly within HydroShare for collaborative modeling using tools like JupyterHub. GeoTrust provides tools for scientists to efficiently reproduce, track and share geoscience applications by building …
@article{Essawy-AGU18,
title = {Leveraging Scientific Cyberinfrastructures to Achieve Computational Hydrologic Model Reproducibility},
author = {Sadler, JM and Essawy, BT and Goodall, JL and Voce, D and CHOI, Y and Morsy, MM and Yuan, Z and Malik, T},
journal = {AGU Fall Meeting Abstracts},
year = 2018,
pages = {C13J-1252},
}
Rasin, A. Malik, T. Wagner, J. Kim, C. , "Where Provenance in Database Storage", International Provenance and Annotation Workshop, pp. 231-235, 7, 2018
Where provenance is a relationship between a data item and the location from which this data was copied. In a DBMS, a typical use of where provenance is in establishing a copy-by-address relationship between the output of a query and the particular data value(s) that originated it. Normal DBMS operations create a variety of auxiliary copies of the data (e.g., indexes, MVs, cached copies). These copies exist over time with relationships that evolve continuously – (A) indexes maintain the copy with a reference to the origin value, (B) MVs maintain the copy without a reference to the source table, (C) cached copies are created once and are never maintained. A query may be answered from any of these auxiliary copies; however, this where provenance is not computed or maintained. In this paper, we describe sources from which forensic analysis of storage can derive where provenance of table data.
@inproceedings{Rasin-TaPP18-P4,
title = {Where Provenance in Database Storage},
author = {Rasin, Alexander and Malik, Tanu and Wagner, James and Kim, Caleb},
booktitle = {International Provenance and Annotation Workshop},
publisher = {Springer, Cham},
year = 2018,
pages = {231-235},
}
Essawy, B. T. Goodall, J. L. Zell, W. Voce, D. Morsy, M. M. Sadler, J. Yuan, Z. Malik, T. , "Integrating scientific cyberinfrastructures to improve reproducibility in computational hydrology: Example for HydroShare and GeoTrust", Environmental Modelling & Software, vol. 105, pp. 217-229, 7, 2018
The reproducibility of computational environmental models is an important challenge that calls for open and reusable code and data, well-documented workflows, and controlled environments that allow others to verify published findings. This requires an ability to document and share raw datasets, data preprocessing scripts, model inputs, outputs, and the specific model code with all associated dependencies. HydroShare and GeoTrust, two scientific cyberinfrastructures under development, can be used to improve reproducibility in computational hydrology. HydroShare is a web-based system for sharing hydrologic data and models as digital resources including detailed, hydrologic-specific resource metadata. GeoTrust provides tools for scientists to efficiently reproduce and share geoscience applications. This paper outlines a use case example, which focuses on a workflow that uses the MODFLOW model, to …
@article{Essawy-EMS18-O4,
title = {Integrating scientific cyberinfrastructures to improve reproducibility in computational hydrology: Example for HydroShare and GeoTrust},
author = {Essawy, Bakinam T E and Goodall, Jonathan L G and Zell, Wesley and Voce, Daniel and Morsy, Mohamed M M and Sadler, Jeffrey and Yuan, Zhihao and Malik, Tanu},
journal = {Environmental Modelling & Software},
publisher = {Elsevier},
year = 2018,
pages = {217-229},
}
Pham, Q. Malik, T. That, D. H. T. Youngdahl, A. , "Improving Reproducibility of Distributed Computational Experiments", Proceedings of the First International Workshop on Practical Reproducible Evaluation of Computer Systems, pp. 1-6, 6, 2018
Conference and journal publications increasingly require experiments associated with a submitted article to be repeatable. Authors comply to this requirement by sharing all associated digital artifacts, ie, code, data, and environment configuration scripts. To ease aggregation of the digital artifacts, several tools have recently emerged that automate the aggregation of digital artifacts by auditing an experiment execution and building a portable container of code, data, and environment. However, current tools only package non-distributed computational experiments. Distributed computational experiments must either be packaged manually or supplemented with sufficient documentation.
@book{Malik-PRECS18-W13,
title = {Improving Reproducibility of Distributed Computational Experiments},
author = {Pham, Quan and Malik, Tanu and That, Dai H T T and Youngdahl, Andrew},
year = 2018,
pages = {1-6},
}
Yuan, Z. That, D. H. T. Kothari, S. Fils, G. Malik, T. , "Utilizing provenance in reusable research objects", Informatics, vol. 5, pp. 14, 3, 2018
Science is conducted collaboratively, often requiring the sharing of knowledge about computational experiments. When experiments include only datasets, they can be shared using Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs). An experiment, however, seldom includes only datasets, but more often includes software, its past execution, provenance, and associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While a necessary method, mere aggregation is not sufficient for the sharing of computational experiments. Other users must be able to easily recompute on these shared research objects. Computational provenance is often the key to enable such reuse. In this paper, we show how reusable research objects can utilize provenance to correctly repeat a previous reference execution, to construct a subset of a research object for partial reuse, and to reuse existing contents of a research object for modified reuse. We describe two methods to summarize provenance that aid in understanding the contents and past executions of a research object. The first method obtains a process-view by collapsing low-level system information, and the second method obtains a summary graph by grouping related nodes and edges with the goal to obtain a graph view similar to application workflow. Through detailed experiments, we show the efficacy and efficiency of our algorithms.
@article{Malik-Informatics18-J4,
title = {Utilizing provenance in reusable research objects},
author = {Yuan, Zhihao and That, Dai H T T and Kothari, Siddhant and Fils, Gabriel and Malik, Tanu},
journal = {Informatics},
publisher = {Multidisciplinary Digital Publishing Institute},
year = 2018,
pages = {14},
}
Wagner, J. Rasin, A. Heart, K. Malik, T. Furst, J. Grier, J. , "Detecting database file tampering through page carving", 21st International Conference on Extending Database Technology, 3, 2018
Database Management Systems (DBMSes) secure data against regular users through defensive mechanisms such as access control, and against privileged users with detection mechanisms such as audit logging. Interestingly, these security mechanisms are built into the DBMS and are thus only useful for monitoring or stopping operations that are executed through the DBMS API. Any access that involves directly modifying database files (at file system level) would, by definition, bypass any and all security layers built into the DBMS itself. In this paper,we propose and evaluate an approach that detects direct modifications to database files that have already bypassed the DBMS and its internal security mechanisms. Our approach applies forensic analysis to first validate database indexes and then compares index state with data in the DBMS tables. We show that indexes are much more difficult to modify and can be further fortified with hashing. Our approach supports most relational DBMSes by leveraging index structures that are already built into the system to detect database storage tampering that would currently remain undetectable.
@article{Wagner-EDBT18-C24,
title = {Detecting database file tampering through page carving},
author = {Wagner, James and Rasin, Alexander and Heart, Karen and Malik, Tanu and Furst, Jacob and Grier, Jonathan},
journal = {21st International Conference on Extending Database Technology},
year = 2018,
}
Malik, T. Rasin, A. Youngdahl, A. , "Using Provenance for Generating Automatic Citations", 10th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2018), 2018
When computational experiments include only datasets, they could be shared through the Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs) which point to these resources. However, experiments seldom include only datasets, but most often also include software, execution results, provenance, and other associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While an entire Research Object may be citable using a URI or a DOI, it is often desirable to cite specific sub-components of a research object to help identify, authorize, date, and retrieve the published sub-components of these objects. In this paper, we present an approach to automatically generate citations for sub-components of research objects by using the object's recorded provenance traces. The generated citations can be used as is or taken as suggestions that can be grouped and combined to produce higher level citations.
@inproceedings{Malik-TaPP18-W14,
title = {Using Provenance for Generating Automatic Citations},
author = {Malik, Tanu and Rasin, Alexander and Youngdahl, Andrew},
booktitle = {10th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2018)},
year = 2018,
}
Essawy, B. T. Goodall, J. L. Morsy, M. M. Zell, W. Sadler, J. Malik, T. Yuan, Z. Voce, D. , "Achieving Reproducible Computational Hydrologic Models by Integrating Scientific Cyberinfrastructures", 9th International Congress on Environmental Modelling and Software, 2018
Reproducibility of computational workflows is an important challenge that calls for open and reusable code and data, well-documented workflows, and controlled environments that allow others to verify published findings. HydroShare (http://www. hydroshare. org) and GeoTrust (http://geotrusthub. org/), two new cyberinfrastructure tools under active development, can be used to improve reproducibility in computational hydrology. HydroShare is a web-based system for sharing hydrologic data and model resources. HydroShare offers hydrologists the capability to upload model input data as resources, add hydrologic-specific metadata to these resources, and use the data directly within HydroShare for collaborative modeling using tools like JupyterHub. GeoTrust provides tools for scientists to efficiently reproduce, track and share geoscience applications by building ‘sciunit,’which are efficient, lightweight, self-contained packages of computational experiments that can be guaranteed to repeat or reproduce regardless of deployment challenges. We will present a use case example focusing on a workflow that uses the MODFLOW model to demonstrate how HydroShare and GeoTrust can be integrated to easily and efficiently reproduce computational workflows. This use case example automates pre-processing of model inputs, model execution, and post-processing of model output. This work demonstrates how the integration of HydroShare and Geotrust ensures the logical and physical preservation of computation workflows and that reproducibility can be achieved by replicating the original sciunit, modifying it to produce a new sciunit and finally …
@inproceedings{Essawy-IEMS18,
title = {Achieving Reproducible Computational Hydrologic Models by Integrating Scientific Cyberinfrastructures},
author = {Essawy, Bakinam T E and Goodall, Jonathan L G and Morsy, Mohamed M M and Zell, Wesley and Sadler, Jeffrey and Malik, Tanu and Yuan, Zhihao and Voce, Daniel},
booktitle = {9th International Congress on Environmental Modelling and Software},
year = 2018,
}
2017
Malik, T. Tarboton, D. G. Goodall, J. L. Choi, E. Bhatt, A. Peckham, S. D. Foster, I. That, D. T. Essawy, B. Yuan, Z. Dash, P. Fils, G. Gan, T. Fadugba, O. I. Saxena, A. Valentic, T. A. , "GeoTrust Hub: A Platform For Sharing And Reproducing Geoscience Applications", AGU Fall Meeting Abstracts, vol. 2017, pp. IN43A-0068, 12, 2017
Recent requirements of scholarly communication emphasize the reproducibility of scientific claims. Text-based research papers are considered poor mediums to establish reproducibility. Papers must be accompanied by research objects, aggregation of digital artifacts that together with the paper provide an authoritative record of a piece of research. We will present GeoTrust Hub (http://geotrusthub. org), a platform for creating, sharing, and reproducing reusable research objects. GeoTrust Hub provides tools for scientists to creategeounits'--reusable research objects. Geounits are self-contained, annotated, and versioned containers that describe and package computational experiments in an efficient and light-weight manner. Geounits can be shared on public repositories such as HydroShare and FigShare, and also using their respective APIs reproduced on provisioned clouds. The latter feature enables science …
@article{Malik-AGU17,
title = {GeoTrust Hub: A Platform For Sharing And Reproducing Geoscience Applications},
author = {Malik, Tanu and Tarboton, David G T and Goodall, Jonathan L G and Choi, Eunseo and Bhatt, Asti and Peckham, Scott D P and Foster, Ian and That, DH T T and Essawy, B and Yuan, Z and Dash, PK and Fils, Gabriel and Gan, Tian and Fadugba, Oluwaseun I F and Saxena, Arushi and Valentic, Todd A V},
journal = {AGU Fall Meeting Abstracts},
year = 2017,
pages = {IN43A-0068},
}
Goodall, J. L. Castronova, A. M. Bandaragoda, C. Morsy, M. M. Sadler, J. M. Essawy, B. Tarboton, D. G. Malik, T. Nijssen, B. Clark, M. P. Liu, Y. Wang, S. , "Cyberinfrastructure to Support Collaborative and Reproducible Computational Hydrologic Modeling", AGU Fall Meeting Abstracts, vol. 2017, pp. H14H-05, 12, 2017
Creating cyberinfrastructure to support reproducibility of computational hydrologic models is an important research challenge. Addressing this challenge requires open and reusable code and data with machine and human readable metadata, organized in ways that allow others to replicate results and verify published findings. Specific digital objects that must be tracked for reproducible computational hydrologic modeling include (1) raw initial datasets,(2) data processing scripts used to clean and organize the data,(3) processed model inputs,(4) model results, and (5) the model code with an itemization of all software dependencies and computational requirements. HydroShare is a cyberinfrastructure under active development designed to help users store, share, and publish digital research products in order to improve reproducibility in computational hydrology, with an architecture supporting hydrologic-specific …
@article{Essawy-AGU17,
title = {Cyberinfrastructure to Support Collaborative and Reproducible Computational Hydrologic Modeling},
author = {Goodall, Jonathan L G and Castronova, Anthony M C and Bandaragoda, Christina and Morsy, Mohamed M M and Sadler, Jeffrey M S and Essawy, Bakinam and Tarboton, David G T and Malik, Tanu and Nijssen, Bart and Clark, Martyn P C and Liu, Yan and Wang, Shao-Wen},
journal = {AGU Fall Meeting Abstracts},
year = 2017,
pages = {H14H-05},
}
Ton That DH. Fils, G. Yuan, Z. Malik, T. , "Sciunits: Reusable Research Objects", 2017 IEEE 13th International Conference on e-Science (e-Science), pp. 374-383, 10, 2017
Science is conducted collaboratively, often requiring knowledge sharing about computational experiments. When experiments include only datasets, they can be shared using Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs). An experiment, however, seldom includes only datasets, but more often includes software, its past execution, provenance, and associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While a necessary method, mere aggregation is not sufficient for the sharing of computational experiments. Other users must be able to easily recompute on these shared research objects. In this paper, we present the sciunit, a reusable research object in which aggregated content is recomputable. We describe a Git-like client that efficiently creates …
@inproceedings{Malik-eScience17-C23,
title = {Sciunits: Reusable Research Objects},
author = {Fils, Gabriel and Yuan, Zhihao and Malik, Tanu},
booktitle = {2017 IEEE 13th International Conference on e-Science (e-Science)},
publisher = {IEEE},
year = 2017,
pages = {374-383},
}
Wagner, J. Rasin, A. That, D. H. T. Malik, T. , "PLI: Augmenting live databases with custom clustered indexes", Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1-6, 6, 2017
RDBMSes only support one clustered index per database table that can speed up query processing. Database applications, that continually ingest large amounts of data, perceive slow query response times to long downtimes, as the clustered index ordering must be strictly maintained. In this paper, we show that application slowdown or downtime, however, can often be avoided if database systems expose the physical location of attributes that are completely or approximately clustered.
@book{Hai-SSDBM17-W12,
title = {PLI: Augmenting live databases with custom clustered indexes},
author = {Wagner, James and Rasin, Alexander and Ton That, Dai Hai and Malik, Tanu},
year = 2017,
pages = {1-6},
}
Wagner, J. Rasin, A. Malik, T. Heart, K. Jehle, H. Grier, J. , "Database forensic analysis with DBCarver", CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, 1, 2017
The increasing use of databases in the storage of critical and sensitive information in many organizations has lead to an increase in the rate at which databases are exploited in computer crimes. While there are several techniques and tools available for database forensics, they mostly assume apriori database preparation, such as relying on tamper-detection software to be in place or use of detailed logging. Investigators, alternatively, need forensic tools and techniques that work on poorly-configured databases and make no assumptions about the extent of damage in a database. In this paper, we present DBCarver, a tool for reconstructing database content from a database image without using any log or system metadata. The tool uses page carving to reconstruct both query-able data and non-queryable data (deleted data). We describe how the two kinds of data can be combined to enable a variety of forensic analysis questions hitherto unavailable to forensic investigators. We show the generality and efficiency of our tool across several databases through a set of robust experiments.
@article{Wagner-CIDR17-C22,
title = {Database forensic analysis with DBCarver},
author = {Wagner, James and Rasin, Alexander and Malik, Tanu and Heart, Karen and Jehle, Hugo and Grier, Jonathan},
journal = {CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research},
year = 2017,
}
2016
Balasubramani, B. S. Shivaprabhu, V. R. Krishnamurthy, S. Cruz, I. F. Malik, T. , "Ontology-based urban data exploration", Proceedings of the 2nd ACM SIGSPATIAL Workshop on Smart Cities and Urban Analytics, pp. 1-8, 10, 2016
Cities are actively creating open data portals to enable predictive analytics of urban data. However, the large number of observable patterns that can be extracted as rules by techniques such as Association Rule Mining (ARM) makes the task of sifting through patterns a tedious and time-consuming task. In this paper, we explore the use of domain ontologies to:(i) filter and prune rules that are variations of a more general concept in the ontology, and (ii) replace groups of rules by a single general rule with the intent of downsizing the number of initial rules while preserving the semantics. We show how the combination of several methods reduces significantly the number of rules thus effectively allowing city administrators to use open data to generate patterns, use them for decision making, and better direct limited government resources.
@book{Malik-SMC16-W10,
title = {Ontology-based urban data exploration},
author = {Balasubramani, Booma S B and Shivaprabhu, Vivek R S and Krishnamurthy, Smitha and Cruz, Isabel F C and Malik, Tanu},
year = 2016,
pages = {1-8},
}
Li, X. Xu, X. Malik, T. , "Interactive provenance summaries for reproducible science", 2016 IEEE 12th International Conference on e-Science (e-Science), pp. 355-360, 10, 2016
Recorded provenance facilitates reproducible science. Provenance metadata can help determine how data were possibly transformed, processed, and derived from original sources. While provenance is crucial for verification and validation, there remains the issue of the granularity - detail at which provenance data must be provided to a user, especially for conducting reproducible science. When data are reproduced successfully the need for detailed provenance is minimal and an essence of the recorded provenance suffices. However, when data are not reproduced correctly users want to quickly drill down into fine-grained provenance to understand causes for failure. In this paper, we describe a drill-up/drill-down method for exploring provenance traces. The drill-up method summarizes the trace by grouping nodes and edges of the trace that have same derivation histories. The method preserves provenance data …
@inproceedings{Malik-eScience16-W11,
title = {Interactive provenance summaries for reproducible science},
author = {Li, Xiang and Xu, Xiaoyang and Malik, Tanu},
booktitle = {2016 IEEE 12th International Conference on e-Science (e-Science)},
publisher = {IEEE},
year = 2016,
pages = {355-360},
}
Essawy, B. T. Goodall, J. L. Malik, T. Xu, H. Conway, M. Gil, Y. , "Challenges with Maintaining Legacy Software to Achieve Reproducible Computational Analyses: An Example for Hydrologic Modeling Data Processing Pipelines", iEMSs Conference, 2016
In hydrology, like many other scientific disciplines with large computational demands, scientists have created a significant and growing collection of software tools for data manipulation, analysis, and simulation. While core computation model software are likely to be well maintained by the groups that develop these codes, other software such as data pre-and post-processing tools, used less often but still critical to scientists, may receive less attention. These codes will become “legacy” software, simply meaning that the software is out of date by modern standards. A challenge facing the scientific community is how to maintain this legacy software so that it achieves reproducible results now and in the future, with minimal investment of resources. This talk will present an example of this problem in hydrology with the pre-processing tools used to create a Variable Infiltration Capacity (VIC) model simulation. The data processing pipeline for creating the input files for VIC is complex requiring code written over the years by various student researchers and sometimes requiring out-of-date compilers (eg, FORTRAN 77) to compile portions of the code. We are confident that the use of legacy software is not a unique problem for VIC, but rather a wider problem common with other hydrologic models and scientific modeling in general. Through prior work, we have automated a VIC data processing pipeline, but moving these pipelines to new machines remains a significant challenge due in large part to the need to install legacy software dependencies. This work takes the following steps to address these challenges. The first step is to create containers using …
@inproceedings{Essawy-IEMS16,
title = {Challenges with Maintaining Legacy Software to Achieve Reproducible Computational Analyses: An Example for Hydrologic Modeling Data Processing Pipelines},
author = {Essawy, Bakinam T E and Goodall, Jonathan L G and Malik, Tanu and Xu, Hao and Conway, Michael and Gil, Yolanda},
booktitle = {iEMSs Conference},
year = 2016,
}
2015
Malik, T. Foster, I. Goodall, J. L. Peckham, S. D. Baker, J. B. Gurnis, M. , "Personalized, Shareable Geoscience Dataspaces For Simplifying Data Management and Improving Reproducibility", AGU Fall Meeting Abstracts, vol. 2015, pp. IN21E-01, 12, 2015
Research activities are iterative, collaborative, and now data-and compute-intensive. Such research activities mean that even the many researchers who work in small laboratories must often create, acquire, manage, and manipulate much diverse data and keep track of complex software. They face difficult data and software management challenges, and data sharing and reproducibility are neglected. There is signficant federal investment in powerful cyberinfrastructure, in part to lesson the burden associated with modern data-and compute-intensive research. Similarly, geoscience communities are establishing research repositories to facilitate data preservation. Yet we observe a large fraction of the geoscience community continues to struggle with data and software management. The reason, studies suggest, is not lack of awareness but rather that tools do not adequately support time-consuming data life cycle …
@article{Malik-AGU15-A8,
title = {Personalized, Shareable Geoscience Dataspaces For Simplifying Data Management and Improving Reproducibility},
author = {Malik, Tanu and Foster, Ian and Goodall, Jonathan L G and Peckham, Scott D P and Baker, Joseph B B and Gurnis, Michael},
journal = {AGU Fall Meeting Abstracts},
year = 2015,
pages = {IN21E-01},
}
Pham, Q. Thaler, S. Malik, T. Foster, I. Glavic, B. , "Sharing and reproducing database applications", Proceedings of the VLDB Endowment, vol. 8, pp. 1988-1991, 8, 2015
Sharing and repeating scientific applications is crucial for verifying claims, reproducing experimental results (e.g., to repeat a computational experiment described in a publication), and promoting reuse of complex applications. The predominant methods of sharing and making applications repeatable are building a companion web site and/or provisioning a virtual machine image (VMI). Recently, application virtualization(AV), has emerged as a light-weight alternative for sharing and efficient repeatability. AV approaches such as Linux Containers create a chroot-like environment [4], while approaches such as CDE [1] trace system calls during application execution to copy all binaries, data, and software dependencies into a self-contained package.
@article{Malik-VLDB15-C20,
title = {Sharing and reproducing database applications},
author = {Pham, Quan and Thaler, Severin and Malik, Tanu and Foster, Ian and Glavic, Boris},
journal = {Proceedings of the VLDB Endowment},
publisher = {VLDB Endowment},
year = 2015,
pages = {1988-1991},
}
Madduri, R. Rodriguez, A. Uram, T. Heitmann, K. Malik, T. Sehrish, S. Chard, R. Cholia, S. Paterno, M. Kowalkowski, J. Habib, S. , "PDACS: a portal for data analysis services for cosmological simulations", Computing in Science & Engineering, vol. 17, pp. 18-26, 7, 2015
PDACS (Portal for Data Analysis Services for Cosmological Simulations) is a Web-based analysis portal that provides access to large simulations and large-scale parallel analysis tools to the research community. It provides opportunities to access, transfer, manipulate, search, and record simulation data, as well as to contribute applications and carry out (possibly complex) computational analyses of the data. PDACS also enables wrapping of analysis tools written in a large number of languages within its workflow system, providing a powerful way to carry out multilevel/multistep analyses. The system allows for cross-layer provenance tracking, implementing a transparent method for sharing workflow specifications, as well as a convenient mechanism for checking reproducibility of results generated by the workflows. Users are able to submit their own tools to the system and to share tools with the rest of the community.
@article{Malik-CISE15-J3,
title = {PDACS: a portal for data analysis services for cosmological simulations},
author = {Madduri, Ravi and Rodriguez, Alex and Uram, Thomas and Heitmann, Katrin and Malik, Tanu and Sehrish, Saba and Chard, Ryan and Cholia, Shreyas and Paterno, Marc and Kowalkowski, Jim and Habib, Salman},
journal = {Computing in Science & Engineering},
publisher = {IEEE},
year = 2015,
pages = {18-26},
}
Meng, H. Kommineni, R. Pham, Q. Gardner, R. Malik, T. Thain, D. , "An invariant framework for conducting reproducible computational science", Journal of Computational Science, vol. 9, pp. 137-142, 7, 2015
Computational reproducibility depends on the ability to not only isolate necessary and sufficient computational artifacts but also to preserve those artifacts for later re-execution. Both isolation and preservation present challenges in large part due to the complexity of existing software and systems as well as the implicit dependencies, resource distribution, and shifting compatibility of systems that result over time—all of which conspire to break the reproducibility of an application. Sandboxing is a technique that has been used extensively in OS environments in order to isolate computational artifacts. Several tools were proposed recently that employ sandboxing as a mechanism to ensure reproducibility. However, none of these tools preserve the sandboxed application for re-distribution to a larger scientific community aspects that are equally crucial for ensuring reproducibility as sandboxing itself. In this paper, we …
@article{Malik-JCCS15-J2,
title = {An invariant framework for conducting reproducible computational science},
author = {Meng, Haiyan and Kommineni, Rupa and Pham, Quan and Gardner, Robert and Malik, Tanu and Thain, Douglas},
journal = {Journal of Computational Science},
publisher = {Elsevier},
year = 2015,
pages = {137-142},
}
Pham, Q. Malik, T. , "GEN: a database interface generator for HPC programs", Proceedings of the 27th International Conference on Scientific and Statistical Database Management, pp. 1-5, 6, 2015
In this paper, we present GEN an interface generator that takes user-supplied C declarations and provides the necessary interface needed to load and access data from common scientific array databases such as SciDB and Rasdaman. GEN can be used for storing the output of parallel computations directly into the database and automates the previously used inefficient ingestion process which requires development of special database schemas for each computation. Further, GEN requires no modifications to existing C code and can build a working interface in minutes. We show how GEN can be used for Cosmology analysis programs to output data sets in real-time to a database and use for subsequent analysis. We show that GEN introduces modest overhead in program execution but is more efficient than writing to files and then loading. More significantly, it significantly reduces the programmatic overhead of …
@book{Malik-SSDBM15-W9,
title = {GEN: a database interface generator for HPC programs},
author = {Pham, Quan and Malik, Tanu},
year = 2015,
pages = {1-5},
}
Pham, Q. Malik, T. Glavic, B. Foster, I. , "LDV: Light-weight database virtualization", 2015 IEEE 31st International Conference on Data Engineering, pp. 1179-1190, 4, 2015
We present a light-weight database virtualization (LDV) system that allows users to share and re-execute applications that operate on a relational database (DB). Previous methods for sharing DB applications, such as companion websites and virtual machine images (VMIs), support neither easy and efficient re-execution nor the sharing of only a relevant DB subset. LDV addresses these issues by monitoring application execution, including DB operations, and using the resulting execution trace to create a lightweight re-executable package. A LDV package includes, in addition to the application, either the DB management system (DBMS) and relevant data or, if the DBMS and/or data cannot be shared, just the application-DBMS communications for replay during re-execution. We introduce a linked DB-operating system provenance model and show how to infer data dependencies based on temporal information …
@inproceedings{Malik-ICDE15-C19,
title = {LDV: Light-weight database virtualization},
author = {Pham, Quan and Malik, Tanu and Glavic, Boris and Foster, Ian},
booktitle = {2015 IEEE 31st International Conference on Data Engineering},
publisher = {IEEE},
year = 2015,
pages = {1179-1190},
}
2014
Catlett, C. Malik, T. Goldstein, B. Giuffrida, J. Shao, Y. Panella, A. Eder, D. Zanten, E. v. Mitchum, R. Thaler, S. Foster, I. T. , "Plenario: An Open Data Discovery and Exploration Platform for Urban Science.", IEEE Data Eng. Bull., vol. 37, pp. 27-42, 12, 2014
The past decade has seen the widespread release of open data concerning city services, conditions, and activities by government bodies and public institutions of all sizes. Hundreds of open data portals now host thousands of datasets of many different types. These new data sources represent enormous potential for improved understanding of urban dynamics and processes—and, ultimately, for more livable, efficient, and prosperous communities. However, those who seek to realize this potential quickly discover that discovering and applying those data relevant to any particular question can be extraordinarily difficult, due to decentralized storage, heterogeneous formats, and poor documentation. In this context, we introduce Plenario, a platform designed to automating time-consuming tasks associated with the discovery, exploration, and application of open city data—and, in so doing, reduce barriers to data use for researchers, policymakers, service providers, journalists, and members of the general public. Key innovations include a geospatial data warehouse that allows data from many sources to be registered into a common spatial and temporal frame; simple and intuitive interfaces that permit rapid discovery and exploration of data subsets pertaining to a particular area and time, regardless of type and source; easy export of such data subsets for further analysis; a user-configurable data ingest framework for automated importing and periodic updating of new datasets into the data warehouse; cloud hosting for elastic scaling and rapid creation of new Plenario instances; and an open source implementation to enable community contributions …
@article{Malik-IEEE14-J1,
title = {Plenario: An Open Data Discovery and Exploration Platform for Urban Science.},
author = {Catlett, Charlie and Malik, Tanu and Goldstein, Brett and Giuffrida, Jonathan and Shao, Yetong and Panella, Alessandro and Eder, Derek and Zanten, Eric v Z and Mitchum, Robert and Thaler, Severin and Foster, Ian T F},
journal = {IEEE Data Eng. Bull.},
year = 2014,
pages = {27-42},
}
Malik, T. Chard, K. Tchoua, R. B. Foster, I. , "GeoDataspaces: Simplifying Data Management Tasks with Globus", AGU Fall Meeting Abstracts, vol. 2014, pp. IN34B-08, 12, 2014
Data and its management are central to modern scientific enterprise. Typically, geoscientists rely on observations and model output data from several disparate sources (file systems, RDBMS, spreadsheets, remote data sources). Integrated data management solutions that provide intuitive semantics and uniform interfaces, irrespective of the kind of data source are, however, lacking. Consequently, geoscientists are left to conduct low-level and time-consuming data management tasks, individually, and repeatedly for discovering each data source, often resulting in errors in handling. In this talk we will describe how the EarthCube GeoDataspace project is improving this situation for seismologists, hydrologists, and space scientists by simplifying some of the existing data management tasks that arise when developing computational models. We will demonstrate a GeoDataspace, bootstrapped with geounits, which are …
@article{Malik-AGU14-A7,
title = {GeoDataspaces: Simplifying Data Management Tasks with Globus},
author = {Malik, Tanu and Chard, Kyle and Tchoua, Roselyne B T and Foster, Ian},
journal = {AGU Fall Meeting Abstracts},
year = 2014,
pages = {IN34B-08},
}
Pham, Q. Malik, T. Foster, I. , "Auditing and maintaining provenance in software packages", International Provenance and Annotation Workshop, pp. 97-109, 6, 2014
Science projects are increasingly investing in computational reproducibility. Constructing software pipelines to demonstrate reproducibility is also becoming increasingly common. To aid the process of constructing pipelines, science project members often adopt reproducible methods and tools. One such tool is CDE, which is a software packaging tool that encapsulates source code, datasets and environments. However, CDE does not include information about origins of dependencies. Consequently when multiple CDE packages are combined and merged to create a software pipeline, several issues arise requiring an author to manually verify compatibility of distributions, environment variables, software dependencies and compiler options. In this work, we propose software provenance to be included as part of CDE so that resulting provenance-included CDE packages can be easily used for creating …
@inproceedings{Malik-IPAW14-C17,
title = {Auditing and maintaining provenance in software packages},
author = {Pham, Quan and Malik, Tanu and Foster, Ian},
booktitle = {International Provenance and Annotation Workshop},
publisher = {Springer, Cham},
year = 2014,
pages = {97-109},
}
Malik, T. Chard, K. Foster, I. , "Benchmarking cloud-based tagging services", 2014 IEEE 30th International Conference on Data Engineering Workshops, pp. 231-238, 3, 2014
Tagging services have emerged as a useful and popular way to organize data resources. Despite popular interest, an efficient implementation of tagging services is a challenge since highly dynamic schemas and sparse, heterogeneous attributes must be supported within a shared, openly writable database. NoSQL databases support dynamic schemas and sparse data but lack efficient native support for joins that are inherent to query and search functionality in tagging services. Relational databases provide sufficient support for joins, but offer a multitude of options to manifest dynamic schemas and tune sparse data models, making evaluation of a tagging service time consuming and painful. In this case-study paper, we describe a benchmark for tagging services, and propose benchmarking modules that can be used to evaluate the suitability of a database for workloads generated from tagging services. We have …
@inproceedings{Malik-CloudDB14-C18,
title = {Benchmarking cloud-based tagging services},
author = {Malik, Tanu and Chard, Kyle and Foster, Ian},
booktitle = {2014 IEEE 30th International Conference on Data Engineering Workshops},
publisher = {IEEE},
year = 2014,
pages = {231-238},
}
Malik, T. , "GeoBase: indexing NetCDF files for large-scale data analysis", Big data management, technologies, and applications, pp. 295-313, 2014
Data-rich scientific disciplines increasingly need end-to-end systems that ingest large volumes of data, make it quickly available, and enable processing and exploratory data analysis in a scalable manner. Key-value stores have attracted attention, since they offer highly available data storage, but must be engineered further for end-to-end support. In particular, key-value stores have minimal support for scientific data that resides in self-describing, array-based binary file formats and do not natively support scientific queries on multi-dimensional data. In this chapter, the authors describe GeoBase, which enables querying over scientific data by improving end-to-end support through two integrated, native components: a linearization-based index to enable rich scientific querying on multi-dimensional data and a plugin that interfaces key-value stores with array-based binary file formats. Experiments show that this end-to …
@book{Malik-BigData14-B2,
title = {GeoBase: indexing NetCDF files for large-scale data analysis},
author = {Malik, Tanu},
publisher = {IGI Global},
year = 2014,
pages = {295-313},
}
Malik, T. Pham, Q. Foster, I. T. Leisch, F. Peng, R. , "SOLE: towards descriptive and interactive publications", Implementing reproducible research, 2014
2 Dummy title corroborate descriptions through indirect means, such as by building companion websites that share data and software packages, these external websites continue to remain disconnected from the content within the paper, making it difficult to verify claims and reproduce results. There is an critical need for systems that minimize this disconnect. We describe Science Object Linking and Embedding (SOLE), a framework for creating descriptive and interactive publications by linking them with associated science objects, such as source codes, datasets, annotations, workflows, re-playable packages, and virtual machine images. SOLE provides a suite of tools that assist the author to create and host science objects that can then be linked with research papers for the purpose of assessment, repeatability, and verification of research. The framework also creates a linkable representation of the science object with the publication and manages a bibliography-like specification of science objects. In this chapter, we introduce SOLE, and describe its use for augmenting the content of computation-based scientific publications. We present examples from climate science, chemistry, biology, and computer science.
@article{Malik-SOLE14-B3,
title = {SOLE: towards descriptive and interactive publications},
author = {Malik, Tanu and Pham, Quan and Foster, Ian T F and Leisch, F and Peng, RD},
journal = {Implementing reproducible research},
publisher = {CRC Press},
year = 2014,
}
2013
Whaling, R. Malik, T. Foster, I. , "Lens: a faceted browser for research networking platforms", 2013 IEEE 9th International Conference on e-Science, pp. 196-203, 10, 2013
Research networking platforms, such as VIVO and Profiles Networking provide an information infrastructure for scholarship, representing information about research and researchers-their scholarly works, research interests, and organizational relationships. These platforms are open information infrastructures for scholarship, consisting of linked open data and open-source software tools for managing and visualizing scholarly information. Being RDF based, faceted browsing is a natural technique for navigating such data, partitioning the scholarly information space into orthogonal conceptual dimensions. However, this technique has so far been explored through limited queries in research networking platforms-not allowing for instance full graph based navigation on RDF data. In this paper we present Lens a client-side user interface for faceted navigation of scholarly RDF data. Lens is based on Exhibit, which is a …
@inproceedings{Malik-eScience13-C16,
title = {Lens: a faceted browser for research networking platforms},
author = {Whaling, Richard and Malik, Tanu and Foster, Ian},
booktitle = {2013 IEEE 9th International Conference on e-Science},
publisher = {IEEE},
year = 2013,
pages = {196-203},
}
Zhao, D. Shou, C. Maliky, T. Raicu, I. , "Distributed data provenance for large-scale data-intensive computing", 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1-8, 9, 2013
It has become increasingly important to capture and understand the origins and derivation of data (its provenance). A key issue in evaluating the feasibility of data provenance is its performance, overheads, and scalability. In this paper, we explore the feasibility of a general metadata storage and management layer for parallel file systems, in which metadata includes both file operations and provenance metadata. We experimentally investigate the design optimality-whether provenance metadata should be loosely-coupled or tightly integrated with a file metadata storage systems. We consider two systems that have applied similar distributed concepts to metadata management, but focusing singularly on kind of metadata: (i) FusionFS, which implements a distributed file metadata management based on distributed hash tables, and (ii) SPADE, which uses a graph database to store audited provenance data and …
@inproceedings{Malik-CLUSTER13-C15,
title = {Distributed data provenance for large-scale data-intensive computing},
author = {Zhao, Dongfang and Shou, Chen and Maliky, Tanu and Raicu, Ioan},
booktitle = {2013 IEEE International Conference on Cluster Computing (CLUSTER)},
publisher = {IEEE},
year = 2013,
pages = {1-8},
}
Hereld, M. Malik, T. Vishwanath, V. , "Proactive Support for Large-Scale Data Exploration", 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, pp. 2025-2034, 5, 2013
Computational science is generating increasingly unwieldy datasets created by complex and high-resolution simulations of physical, social, and economic systems. Traditional post processing of such large datasets requires high bandwidth to large storage resources. In situ processing approaches can reduce I/O requirements but steal processing cycles from the simulation and forsake interactive data exploration. The Fusion project aims to develop a new approach for exploring large-scale scientific datasets wherein the system actively assists the user in the data exploration process. A key component of the system is a software assistant that evaluates the stated and implied analysis goals of the scientist, observes the environment, models and proposes actions to be taken, and orchestrates the generation of analysis and visualization products for the user. These products are managed and made available to the …
@inproceedings{Malik-IPDPSW13-W7,
title = {Proactive Support for Large-Scale Data Exploration},
author = {Hereld, Mark and Malik, Tanu and Vishwanath, Venkatram},
booktitle = {2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum},
publisher = {IEEE},
year = 2013,
pages = {2025-2034},
}
Shou, C. Zhao, D. Malik, T. Raicu, I. , "Towards a provenance-aware distributed filesystem", 5th Workshop on the Theory and Practice of Provenance (TaPP), 2013
It has become increasingly important to capture and understand the origins and derivation of data (its provenance). A key issue in evaluating the feasibility of data provenance is its performance, overheads, and scalability. This paper presents a provenance-aware distributed filesystem, that offers excellent scalability while retaining the provenance overhead negligible under certain conditions. This work integrated two recent research projects, SPADE (Support for Provenance Auditing in Distributed Environments) and FusionFS (Fusion distributed File System) with simple and efficient communication protocols. The preliminary results on a 32-node cluster show that FusionFS+ SPADE is a promising prototype with negligible provenance overhead and has promise to scale to larger scales as FusionFS has been shown to scale.
@article{Malik-TaPP13-P3,
title = {Towards a provenance-aware distributed filesystem},
author = {Shou, Chen and Zhao, Dongfang and Malik, Tanu and Raicu, Ioan},
journal = {5th Workshop on the Theory and Practice of Provenance (TaPP)},
year = 2013,
}
Pham, Q. Malik, T. Foster, I. , "Using provenance for repeatability", 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP 13), 2013
We present Provenance-To-Use (PTU), a tool that minimizes computation time during repeatability testing. Authors can use PTU to build a package that includes their software program and a provenance trace of an initial reference execution. Testers can select a subset of the package’s processes for a partial deterministic replay—based, for example, on their compute, memory and I/O utilization as measured during the reference execution. Using the provenance trace, PTU guarantees that events are processed in the same order using the same data from one execution to the next. We show the efficiency of PTU for conducting repeatability testing of workflow-based scientific programs.
@inproceedings{Malik-TaPP13-W8,
title = {Using provenance for repeatability},
author = {Pham, Quan and Malik, Tanu and Foster, Ian},
booktitle = {5th USENIX Workshop on the Theory and Practice of Provenance (TaPP 13)},
year = 2013,
}
Malik, T. Gehani, A. Tariq, D. Zaffar, F. , "Sketching distributed data provenance", Data Provenance and Data Management in eScience, pp. 85-107, 2013
Users can determine the precise origins of their data by collecting detailed provenance records. However, auditing at a finer grain produces large amounts of metadata. To efficiently manage the collected provenance, several provenance management systems, including SPADE, record provenance on the hosts where it is generated. Distributed provenance raises the issue of efficient reconstruction during the query phase. Recursively querying provenance metadata or computing its transitive closure is known to have limited scalability and cannot be used for large provenance graphs. We present matrix filters, which are novel data structures for representing graph information, and demonstrate their utility for improving query efficiency with experiments on provenance metadata gathered while executing distributed workflow applications.
@book{Malik-eScience-B1,
title = {Sketching distributed data provenance},
author = {Malik, Tanu and Gehani, Ashish and Tariq, Dawood and Zaffar, Fareed},
publisher = {Springer, Berlin, Heidelberg},
year = 2013,
pages = {85-107},
}
2012
Malik, T. Foster, I. , "Addressing data access needs of the long-tail distribution of geoscientists", 2012 IEEE International Geoscience and Remote Sensing Symposium, pp. 5348-5351, 7, 2012
Data and computation are fundamental to advances in geoscience research and discovery. However, geoscientists currently spend too much time looking for the “right” data, subsequently accessing these data, and then transforming them into a form suitable for analysis. This data management overhead affects a scientists' competitive advantage in making useful contributions. Several cyber-infrastructure (CI) efforts are being undertaken to improve the data management needs of the long-tail geoscientists. In this paper, we highlight characteristics of CI solutions that will form the basis for the successful and widely adopted solutions.
@inproceedings{Malik-IGRASS12-O3,
title = {Addressing data access needs of the long-tail distribution of geoscientists},
author = {Malik, Tanu and Foster, Ian},
booktitle = {2012 IEEE International Geoscience and Remote Sensing Symposium},
publisher = {IEEE},
year = 2012,
pages = {5348-5351},
}
Pham, Q. Malik, T. Foster, I. Lauro, R. D. Montella, R. , "SOLE: linking research papers with science objects", International Provenance and Annotation Workshop, pp. 203-208, 6, 2012
We introduce Science Object Linking and Embedding (SOLE), a tool for linking research papers with associated science objects , such as source codes, datasets, annotations, workflows, packages, and virtual machine images. The objective of SOLE is to reduce the cost to an author of linking research papers with such science objects for the purpose of reproducible research. To this end, SOLE allows an author to use simple tags to delimit a science object to be associated with a research paper. It creates an adequate representation of the science object and manages a bibliography-like specification of science objects. Authors and readers can reference elements of this bibliography and associate them with phrases in the text of the research paper through a Web interface, in a similar manner to a traditional bibliography tool.
@inproceedings{Malik-IPAW12-W6,
title = {SOLE: linking research papers with science objects},
author = {Pham, Quan and Malik, Tanu and Foster, Ian and Lauro, Roberto D L and Montella, Raffaele},
booktitle = {International Provenance and Annotation Workshop},
publisher = {Springer, Berlin, Heidelberg},
year = 2012,
pages = {203-208},
}
Foster, I. Katz, D. S. Malik, T. Fox, P. , "Wagging the long tail of earth science: Why we need an earth science data web, and how to build it", 2012
Consider Alice, a geoscientist, who wants to investigate the role of sea surface temperatures (SSTs) on anomalous atmospheric circulations and associated precipitation in the tropics. She hypothesizes that nonlinear dynamics can help her model transport processes propagated long distances through the atmosphere or ocean, and asks a graduate student to obtain daily weather, land-cover, and other environmental data products that may be used to validate her hypothesis. Like the vast majority of NSF-funded researchers (see Table 1), Alice works with limited resources. Indeed, her laboratory comprises just herself, a couple of graduate students, an undergraduate, and a technician. In the absence of suitable expertise and infrastructure, the apparently simple task that she assigns to her graduate student becomes an information discovery and management nightmare. Data are either not available or are of poor quality. Downloading and transforming datasets takes weeks. Alice then faces new challenges. Will these new data enrich her compute-intensive model, or simply propagate errors? Or should they seek other, higher-resolution datasets? What software can she use to help answer these questions? We cannot blame Alice if she ultimately abandons this promising avenue of research.
@misc{Malik-EarthCube12-R2,
title = {Wagging the long tail of earth science: Why we need an earth science data web, and how to build it},
author = {Foster, Ian and Katz, Daniel S K and Malik, Tanu and Fox, Peter},
year = 2012,
}
2011
Malik, T. Best, N. Elliott, J. Madduri, R. Foster, I. , "Improving the efficiency of subset queries on raster images", Proceedings of the ACM SIGSPATIAL Second International Workshop on High Performance and Distributed Geographic Information Systems, pp. 34-37, 11, 2011
We propose a parallel method to accelerate the performance of subset queries on raster images. The method, based on map-reduce paradigm, includes two principles from database management systems to improve the performance of subset queries. First, we employ column-oriented storage format for storing locationand weather variables. Second, we improve data locality by storing multidimensional attributes such as space and time in a Hilbert order instead of a serial, row-wise order. We implement the principles in a map-reduce environment, maintaining compatibility with the replication and scheduling constraints. We show through experiments that the techniques improve data locality and increase performance of subset queries, respectively, by 5x and 2x.
@book{Malik-HPDGIS11-W5,
title = {Improving the efficiency of subset queries on raster images},
author = {Malik, Tanu and Best, Neil and Elliott, Joshua and Madduri, Ravi and Foster, Ian},
year = 2011,
pages = {34-37},
}
Gehani, A. Tariq, D. Baig, B. Malik, T. , "Policy-based integration of provenance metadata", 2011 IEEE International Symposium on Policies for Distributed Systems and Networks, pp. 149-152, 6, 2011
Reproducibility has been a cornerstone of the scientific method for hundreds of years. The range of sources from which data now originates, the diversity of the individual manipulations performed, and the complexity of the orchestrations of these operations all limit the reproducibility that a scientist can ensure solely by manually recording their actions. We use an architecture where aggregation, fusion, and composition policies define how provenance records can be automatically merged to facilitate the analysis and reproducibility of experiments. We show that the overhead of collecting and storing provenance metadata can vary dramatically depending on the policy used to integrate it.
@inproceedings{Malik-POLICY11-C14,
title = {Policy-based integration of provenance metadata},
author = {Gehani, Ashish and Tariq, Dawood and Baig, Basim and Malik, Tanu},
booktitle = {2011 IEEE International Symposium on Policies for Distributed Systems and Networks},
publisher = {IEEE},
year = 2011,
pages = {149-152},
}
2010
Malik, T. Nistor, L. Gehani, A. , "Tracking and sketching distributed data provenance", 2010 IEEE Sixth International Conference on e-Science, pp. 190-197, 12, 2010
Current provenance collection systems typically gather metadata on remote hosts and submit it to a central server. In contrast, several data-intensive scientific applications require a decentralized architecture in which each host maintains an authoritative local repository of the provenance metadata gathered on that host. The latter approach allows the system to handle the large amounts of metadata generated when auditing occurs at fine granularity, and allows users to retain control over their provenance records. The decentralized architecture, however, increases the complexity of auditing, tracking, and querying distributed provenance. We describe a system for capturing data provenance in distributed applications, and the use of provenance sketches to optimize subsequent data provenance queries. Experiments with data gathered from distributed workflow applications demonstrate the feasibility of a …
@inproceedings{Malik-eScience10-C13,
title = {Tracking and sketching distributed data provenance},
author = {Malik, Tanu and Nistor, Ligia and Gehani, Ashish},
booktitle = {2010 IEEE Sixth International Conference on e-Science},
publisher = {IEEE},
year = 2010,
pages = {190-197},
}
Malik, T. Wang, X. Little, P. Chaudhary, A. Thakar, A. , "A Dynamic Data Middleware cache for Rapidly-growing Scientific Repositories", ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing, pp. 64-84, 11, 2010
Modern scientific repositories are growing rapidly in size. Scientists are increasingly interested in viewing the latest data as part of query results. Current scientific middleware systems, however, assume repositories are static. Thus, they cannot answer scientific queries with the latest data. The queries, instead, are routed to the repository until data at the middleware system is refreshed. In data-intensive scientific disciplines, such as astronomy, indiscriminate query routing or data refreshing often results in runaway network costs. This severely affects the performance and scalability of the repositories and makes poor use of the middleware system. We present Delta a dynamic data middleware system for rapidly-growing scientific repositories. Delta’s key component is a decision framework that adaptively decouples data objects—choosing to keep some data object at the middleware, when they are heavily queried, and keeping some data objects at the repository, when they are heavily updated. Our algorithm profiles incoming workload to search for optimal data decoupling that reduces network costs. It leverages formal concepts from the network flow problem, and is robust to evolving scientific workloads. We evaluate the efficacy of Delta, through a prototype implementation, by running query traces collected from a real astronomy survey.
@inproceedings{Malik-Middleware10-C12,
title = {A Dynamic Data Middleware cache for Rapidly-growing Scientific Repositories},
author = {Malik, Tanu and Wang, Xiaodan and Little, Philip and Chaudhary, Amitabh and Thakar, Ani},
booktitle = {ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing},
publisher = {Springer, Berlin, Heidelberg},
year = 2010,
pages = {64-84},
}
Wang, X. Perlman, E. Burns, R. Malik, T. Budavári, T. Meneveau, C. Szalay, A. , "JAWS: Job-aware workload scheduling for the exploration of turbulence simulations", SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-11, 11, 2010
We present JAWS, a job-aware, data-driven batch scheduler that improves query throughput for data-intensive scientific database clusters. As datasets reach petabyte-scale, workloads that scan through vast amounts of data to extract features are gaining importance in the sciences. However, acute performance bottlenecks result when multiple queries execute simultaneously and compete for I/O resources. Our solution, JAWS, divides queries into I/O-friendly sub-queries for scheduling. It then identifies overlapping data requirements within the workload and executes sub-queries in batches to maximize data sharing and reduce redundant I/O. JAWS extends our previous work by supporting workflows in which queries exhibit data dependencies, exploiting workload knowledge to coordinate caching decisions, and combating starvation through adaptive and incremental trade-offs between query throughput and …
@inproceedings{Malik-SC10-C11,
title = {JAWS: Job-aware workload scheduling for the exploration of turbulence simulations},
author = {Wang, Xiaodan and Perlman, Eric and Burns, Randal and Malik, Tanu and Budavári, Tamas and Meneveau, Charles and Szalay, Alexander},
booktitle = {SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis},
publisher = {IEEE},
year = 2010,
pages = {1-11},
}
Venkatasubramanian, V. Malik, T. Giridhar, A. Villez, K. Prasad, R. Shukla, A. Rieger, C. Daum, K. McQueen, M. , "RNEDE: Resilient network design environment", 2010 3rd International Symposium on Resilient Control Systems, pp. 72-75, 8, 2010
Modern living is more and more dependent on the intricate web of critical infrastructure systems. The failure or damage of such systems can cause huge disruptions. Traditional design of this web of critical infrastructure systems was based on the principles of functionality and reliability. However, it is increasingly being realized that such design objectives are not sufficient. Threats, disruptions and faults often compromise the network, taking away the benefits of an efficient and reliable design. Thus, traditional network design parameters must be combined with self-healing mechanisms to obtain a resilient design of the network. In this paper, we present RNEDE a resilient network design environment that not only optimizes the network for performance but tolerates fluctuations in its structure that result from external threats and disruptions. The environment evaluates a set of remedial actions to bring a compromised …
@inproceedings{Malik-ISCRS10-C10,
title = {RNEDE: Resilient network design environment},
author = {Venkatasubramanian, Venkat and Malik, Tanu and Giridhar, Arun and Villez, Kris and Prasad, Raghvendra and Shukla, Aviral and Rieger, Craig and Daum, Keith and McQueen, Miles},
booktitle = {2010 3rd International Symposium on Resilient Control Systems},
publisher = {IEEE},
year = 2010,
pages = {72-75},
}
Gehani, A. Kim, M. Malik, T. , "Efficient querying of distributed provenance stores", Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 613-621, 6, 2010
Current projects that automate the collection of provenance information use a centralized architecture for managing the resulting metadata-that is, provenance is gathered at remote hosts and submitted to a central provenance management service. In contrast, we are developing a completely decentralized system with each computer maintaining the authoritative repository of the provenance gathered on it. Our model has several advantages, such as scaling to large amounts of metadata generation, providing low-latency access to provenance metadata about local data, avoiding the need for synchronization with a central service after operating while disconnected from the network, and letting users retain control over their data provenance records. We describe the SPADE project's support for tracking data provenance in distributed environments, including how queries can be optimized with provenance sketches …
@book{Malik-CLADE10-W4,
title = {Efficient querying of distributed provenance stores},
author = {Gehani, Ashish and Kim, Minyoung and Malik, Tanu},
year = 2010,
pages = {613-621},
}
Malik, T. Prasad, R. Patil, S. Chaudhary, A. Venkatasubramanian, V. , "Providing scalable data services in ubiquitous networks", International Conference on Database Systems for Advanced Applications, pp. 445-457, 4, 2010
Topology is a fundamental part of a network that governs connectivity between nodes, the amount of data flow and the efficiency of data flow between nodes. In traditional networks, due to physical limitations, topology remains static for the course of the network operation. Ubiquitous data networks (UDNs), alternatively, are more adaptive and can be configured for changes in their topology. This flexibility in controlling their topology makes them very appealing and an attractive medium for supporting “anywhere, any place” communication. However, it raises the problem of designing a dynamic topology. The dynamic topology design problem is of particular interest to application service providers who need to provide cost-effective data services on a ubiquitous network. In this paper we describe algorithms that decide when and how the topology should be reconfigured in response to a change in the data …
@inproceedings{Malik-UDM10-W3,
title = {Providing scalable data services in ubiquitous networks},
author = {Malik, Tanu and Prasad, Raghvendra and Patil, Sanket and Chaudhary, Amitabh and Venkatasubramanian, Venkat},
booktitle = {International Conference on Database Systems for Advanced Applications},
publisher = {Springer, Berlin, Heidelberg},
year = 2010,
pages = {445-457},
}
2009
Wang, X. Burns, R. Malik, T. , "Liferaft: Data-driven, batch processing for the exploration of scientific databases", Conference on Innovative Database Research (CIDR), 9, 2009
Workloads that comb through vast amounts of data are gaining importance in the sciences. These workloads consist of needle in a haystack queries that are long running and data intensive so that query throughput limits performance. To maximize throughput for data-intensive queries, we put forth LifeRaft: a query processing system that batches queries with overlapping data requirements. Rather than scheduling queries in arrival order, LifeRaft executes queries concurrently against an ordering of the data that maximizes data sharing among queries. This decreases I/O and increases cache utility. However, such batch processing can increase query response time by starving interactive workloads. LifeRaft addresses starvation using techniques inspired by head scheduling in disk drives. Depending upon the workload saturation and queuing times, the system adaptively and incrementally trades-off processing queries in arrival order and data-driven batch processing. Evaluating LifeRaft in the SkyQuery federation of astronomy databases reveals a two-fold improvement in query throughput.
@article{Malik-CIDR09-C8,
title = {Liferaft: Data-driven, batch processing for the exploration of scientific databases},
author = {Wang, Xiaodan and Burns, Randal and Malik, Tanu},
journal = {Conference on Innovative Database Research (CIDR)},
year = 2009,
}
Malik, T. Wang, X. Dash, D. Chaudhary, A. Ailamaki, A. Burns, R. , "Adaptive physical design for curated archives", International Conference on Scientific and Statistical Database Management, pp. 148-166, 6, 2009
We introduce AdaptPD, an automated physical design tool that improves database performance by continuously monitoring changes in the workload and adapting the physical design to suit the incoming workload. Current physical design tools are offline and require specification of a representative workload. AdaptPD is “always on” and incorporates online algorithms which profile the incoming workload to calculate the relative benefit of transitioning to an alternative design. Efficient query and transition cost estimation modules allow AdaptPD to quickly decide between various design configurations. We evaluate AdaptPD with the SkyServer Astronomy database using queries submitted by SkyServer’s users. Experiments show that AdaptPD adapts to changes in the workload, improves query performance substantially over offline tools, and introduces minor computational overhead.
@inproceedings{Malik-SSDBM09-C9,
title = {Adaptive physical design for curated archives},
author = {Malik, Tanu and Wang, Xiaodan and Dash, Debabrata and Chaudhary, Amitabh and Ailamaki, Anastasia and Burns, Randal},
booktitle = {International Conference on Scientific and Statistical Database Management},
publisher = {Springer, Berlin, Heidelberg},
year = 2009,
pages = {148-166},
}
2008
Krishnamurthy, B. Malik, T. Stamatis, S. Venkatasubramanian, V. Caruthers, J. , "Rule-based classification systems for informatics", 2008 IEEE Fourth International Conference on eScience, pp. 420-421, 12, 2008
Classification of data is an important step in the knowledge evolution of sciences. Traditionally, in sciences, classification of data was performed by human experts. Human knowledge can recognize unique functional properties that are necessary and sufficient to place complex structures and phenomena into a particular class or group. However, with the growth in scientific data and rapid changes in knowledge, it is no longer feasible for humans to classify objects. Automation of the classification process is necessary to cope with the growing amount of data. Otherwise, classification will become the rate-limiting step for scientific data analysis.In this paper, we address the needs of such automation in the SciAEther project and develop ChES, a fast and reproducible framework for classifying molecules in chemical data. Our framework captures human understanding through an ontology and the diversity in classification …
@inproceedings{Malik-eScience08-P1,
title = {Rule-based classification systems for informatics},
author = {Krishnamurthy, Balachander and Malik, Tanu and Stamatis, Stephen and Venkatasubramanian, Venkat and Caruthers, J},
booktitle = {2008 IEEE Fourth International Conference on eScience},
publisher = {IEEE},
year = 2008,
pages = {420-421},
}
Malik, T. Burns, R. , "Workload-Aware histograms for remote applications", International Conference on Data Warehousing and Knowledge Discovery, pp. 402-412, 9, 2008
Recently several database-based applications have emerged that are remote from data sources and need accurate histograms for query cardinality estimation. Traditional approaches for constructing histograms require complete access to data and are I/O and network intensive, and therefore no longer apply to these applications. Recent approaches use queries and their feedback to construct and maintain “workload aware” histograms. However, these approaches either employ heuristics, thereby providing no guarantees on the overall histogram accuracy, or rely on detailed query feedbacks, thus making them too expensive to use. In this paper, we propose a novel, incremental method for constructing histograms that uses minimum feedback and guarantees minimum overall residual error. Experiments on real, high dimensional data shows 30-40% higher estimation accuracy over currently known …
@inproceedings{Malik-DaWAK08-C7,
title = {Workload-Aware histograms for remote applications},
author = {Malik, Tanu and Burns, Randal},
booktitle = {International Conference on Data Warehousing and Knowledge Discovery},
publisher = {Springer, Berlin, Heidelberg},
year = 2008,
pages = {402-412},
}
Malik, T. Wang, X. Burns, R. Dash, D. Ailamaki, A. , "Automated physical design in database caches", 2008 IEEE 24th International Conference on Data Engineering Workshop, pp. 27-34, 4, 2008
Performance of proxy caches for database federations that serve a large number of users is crucially dependent on its physical design. Current techniques, automated or otherwise, for physical design depend on the identification of a representative workload. In proxy caches, however, such techniques are inadequate since workload characteristics change rapidly. This is remarkably shown at the proxy cache of SkyQuery, an Astronomy federation, which receives a continuously evolving workload. We present novel techniques for automated physical design that adapt with the workload and balance the performance benefits of physical design decisions with the cost of implementing these decisions. These include both competitive and incremental algorithms that optimize the combined cost of query evaluation and making physical design changes. Our techniques are general in that they do not make assumptions …
@inproceedings{Malik-SMDB08-W2,
title = {Automated physical design in database caches},
author = {Malik, Tanu and Wang, Xiaodan and Burns, Randal and Dash, Debabrata and Ailamaki, Anastasia},
booktitle = {2008 IEEE 24th International Conference on Data Engineering Workshop},
publisher = {IEEE},
year = 2008,
pages = {27-34},
}
Malik, T. , "Large scale data management for the sciences", 2008
Traditional enterprises and novel scientific applications are accumulating petabyte-scale datasets, which makes the need for large-scale data management more pressing than ever. Geographic distribution of the datasets accompanied by complex demands on data makes large-scale data management challenging. This is especially true for sciences that model complex physical and biological phenomena using data from multiple sources.
@misc{Malik-Thesis08-Th1,
title = {Large scale data management for the sciences},
author = {Malik, Tanu},
year = 2008,
}
2007
Wang, X. Malik, T. Burns, R. Papadomanolakis, S. Ailamaki, A. , "A workload-driven unit of cache replacement for mid-tier database caching", International Conference on Database Systems for Advanced Applications, pp. 374-385, 4, 2007
Making multi-terabyte scientific databases publicly accessible over the Internet is increasingly important in disciplines such as Biology and Astronomy. However, contention at a centralized, backend database is a major performance bottleneck, limiting the scalability of Internet-based, database applications. Mid-tier caching reduces contention at the backend database by distributing database operations to the cache. To improve the performance of mid-tier caches, we propose the caching of query prototypes, a workload-driven unit of cache replacement in which the cache object is chosen from various classes of queries in the workload. In existing mid-tier caching systems, the storage organization in the cache is statically defined. Our approach adapts cache storage to workload changes, requires no prior knowledge about the workload, and is transparent to the application. Experiments over a one-month, 1 …
@inproceedings{Malik-DASFAA07-C6,
title = {A workload-driven unit of cache replacement for mid-tier database caching},
author = {Wang, Xiaodan and Malik, Tanu and Burns, Randal and Papadomanolakis, Stratos and Ailamaki, Anastassia},
booktitle = {International Conference on Database Systems for Advanced Applications},
publisher = {Springer, Berlin, Heidelberg},
year = 2007,
pages = {374-385},
}
Malik, T. Burns, R. C. Chawla, N. V. , "A Black-Box Approach to Query Cardinality Estimation.", CIDR, pp. 56-67, 1, 2007
We present a “black-box” approach to estimating query cardinality that has no knowledge of query execution plans and data distribution, yet provides accurate estimates. It does so by grouping queries into syntactic families and learning the cardinality distribution of that group directly from points in a high-dimensional input space constructed from the query’s attributes, operators, function arguments, aggregates, and constants. We envision an increasing need for such an approach in applications in which query cardinality is required for resource optimization and decision-making at locations that are remote from the data sources. Our primary case study is the Open SkyQuery federation of Astronomy archives, which uses a scheduling and caching mechanism at the mediator for execution of federated queries at remote sources. Experiments using real workloads show that the black-box approach produces accurate estimates and is frugal in its use of space and in computation resources. Also, the black-box approach provides dramatic improvements in the performance of caching in Open SkyQuery.
@inproceedings{Malik-CIDR07-C5,
title = {A Black-Box Approach to Query Cardinality Estimation.},
author = {Malik, Tanu and Burns, Randal C B and Chawla, Nitesh V C},
booktitle = {CIDR},
year = 2007,
pages = {56-67},
}
2006
Malik, T. Burns, R. Chawla, N. V. Szalay, A. , "Estimating query result sizes for proxy caching in scientific database federations", SC'06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pp. 36-36, 11, 2006
In a proxy cache for federations of scientific databases it is important to estimate the size of a query before making a caching decision. With accurate estimates, near-optimal cache performance can be obtained. On the other extreme, inaccurate estimates can render the cache totally ineffective. We present classification and regression over templates (CAROT), a general method for estimating query result sizes, which is suited to the resource-limited environment of proxy caches and the distributed nature of database federations. CAROT estimates query result sizes by learning the distribution of query results, not by examining or sampling data, but from observing workload. We have integrated CAROT into the proxy cache of the National Virtual Observatory (NVO) federation of astronomy databases. Experiments conducted in the NVO show that CAROT dramatically outperforms conventional estimation techniques and …
@inproceedings{Malik-SC06-C4,
title = {Estimating query result sizes for proxy caching in scientific database federations},
author = {Malik, Tanu and Burns, Randal and Chawla, Nitesh V C and Szalay, Alex},
booktitle = {SC'06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing},
publisher = {IEEE},
year = 2006,
pages = {36-36},
}
2005
Malik, T. Burns, R. Chaudhary, A. , "Bypass caching: Making scientific databases good network citizens", 21st International Conference on Data Engineering (ICDE'05), pp. 94-105, 4, 2005
Scientific database federations are geographically distributed and network bound. Thus, they could benefit from proxy caching. However, existing caching techniques are not suitable for their workloads, which compare and join large data sets. Existing techniques reduce parallelism by conducting distributed queries in a single cache and lose the data reduction benefits of performing selections at each database. We develop the bypass-yield formulation of caching, which reduces network traffic in wide-area database federations, while preserving parallelism and data reduction. Bypass-yield caching is altruistic; caches minimize the overall network traffic generated by the federation, rather than focusing on local performance. We present an adaptive, workload-driven algorithm for managing a bypass-yield cache. We also develop on-line algorithms that make no assumptions about workload: a k-competitive …
@inproceedings{Malik-ICDE05-C3,
title = {Bypass caching: Making scientific databases good network citizens},
author = {Malik, Tanu and Burns, Randal and Chaudhary, Amitabh},
booktitle = {21st International Conference on Data Engineering (ICDE'05)},
publisher = {IEEE},
year = 2005,
pages = {94-105},
}
Batsakis, A. Malik, T. Terzis, A. , "Practical passive lossy link inference", International Workshop on Passive and Active Network Measurement, pp. 362-367, 3, 2005
We propose a practical technique for the identification of lossy network links. Our scheme is based on a function that computes the likelihood of each link to be lossy. This function mainly depends on the number of times a link appears in lossy paths and on the relative loss rates of these paths. Preliminary simulation results show that our solution achieves accuracy comparable to statistical methods (e.g. Bayesian) at significantly lower running time.
@inproceedings{Malik-PAN05-W1,
title = {Practical passive lossy link inference},
author = {Batsakis, Alexandros and Malik, Tanu and Terzis, Andreas},
booktitle = {International Workshop on Passive and Active Network Measurement},
publisher = {Springer, Berlin, Heidelberg},
year = 2005,
pages = {362-367},
}
2002
Szalay, A. S. Budavári, T. Malik, T. Gray, J. Thakar, A. R. , "Web services for the virtual observatory", Virtual Observatories, vol. 4846, pp. 124-132, 12, 2002
Web Services form a new, emerging paradigm to handle distributed access to resources over the Internet. There are platform independent standards (SOAP, WSDL), which make the developers' task considerably easier. This article discusses how web services could be used in the context of the Virtual Observatory. We envisage a multi-layer architecture, with interoperating services. A well-designed lower layer consisting of simple, standard services implemented by most data providers will go a long way towards establishing a modular architecture. More complex applications can be built upon this core layer. We present two prototype applications, the SdssCutout and the SkyQuery as examples of this layered architecture.
@inproceedings{Malik-SPIE02-O2,
title = {Web services for the virtual observatory},
author = {Szalay, Alexander S S and Budavári, Tamás and Malik, Tanu and Gray, Jim and Thakar, Ani R T},
booktitle = {Virtual Observatories},
publisher = {SPIE},
year = 2002,
pages = {124-132},
}
Malik, T. Szalay, A. S. Budavari, T. Thakar, A. R. , "SkyQuery: A WebService approach to federate databases", arXiv preprint cs/0211023, 11, 2002
Traditional science searched for new objects and phenomena that led to discoveries. Tomorrow's science will combine together the large pool of information in scientific archives and make discoveries. Scienthists are currently keen to federate together the existing scientific databases. The major challenge in building a federation of these autonomous and heterogeneous databases is system integration. Ineffective integration will result in defunct federations and under utilized scientific data. Astronomy, in particular, has many autonomous archives spread over the Internet. It is now seeking to federate these, with minimal effort, into a Virtual Observatory that will solve complex distributed computing tasks such as answering federated spatial join queries. In this paper, we present SkyQuery, a successful prototype of an evolving federation of astronomy archives. It interoperates using the emerging Web services standard. We describe the SkyQuery architecture and show how it efficiently evaluates a probabilistic federated spatial join query.
@article{Malik-CIDR02-C2,
title = {SkyQuery: A WebService approach to federate databases},
author = {Malik, Tanu and Szalay, Alex S S and Budavari, Tamas and Thakar, Ani R T},
journal = {arXiv preprint cs/0211023},
year = 2002,
}
Szalay, A. S. Gray, J. Thakar, A. R. Kunszt, P. Z. Malik, T. Raddick, J. Stoughton, C. , "The SDSS SkyServer - Public Access to the Sloan Digital Sky Server Data", ACM Special Interest Group on Management of Data (SIGMOD), pp. 570-581, 8, 2002
@article{Malik-SIGMOD02-C1,
title = {The SDSS SkyServer - Public Access to the Sloan Digital Sky Server Data},
author = {Szalay, Alexander S S and Gray, Jim and Thakar, Ani R T and Kunszt, Peter Z K and Malik, Tanu and Raddick, Jordan and Stoughton, Christopher},
journal = {ACM Special Interest Group on Management of Data (SIGMOD)},
year = 2002,
pages = {570-581},
}