Publication: A Unified Framework for Collaborative Knowledge Graph Construction, Editing, and Distribution
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Knowledge graphs (KGs) have emerged as a critical technology for grounding artificial intelligence systems in structured facts, offering a solution to the hallucination and relia- bility issues plaguing large language models (LLMs). Despite their utility, the infrastruc- ture required to construct, store, version, and collaboratively edit large-scale KGs remains fragmented. Previous work has addressed individual aspects of graph management but has failed to provide a unified, version-controlled ecosystem that supports the property- rich graphs required by modern applications. To address this infrastructure gap, this thesis introduces a comprehensive framework comprising four integrated systems: Optimus, a reproducible pipeline for graph construction; Diamond, a novel lossless binary com- pression format; GitGraph, a semantic version control system; and GraphEnv, an en- vironment for multi-agent collaboration. We implemented this framework to enable the end-to-end lifecycle of graph development, from initial data ingestion to downstream appli- cations. We utilized Optimus to construct OptimusKG, a biomedical KG with 192,307 nodes, 21.5M edges, and 88.6M properties, demonstrating a 56.5% reduction in build time through parallel execution. To address storage bottlenecks, we developed the Diamond algorithm, which we benchmarked against standard formats, achieving a 34×compression ratio on the popular PrimeKG dataset while preserving all node and edge properties. Fur- thermore, we formalized the theory of graph versioning by developing a three-way merge algorithm that allows for semantic, structure-aware conflict resolution, enabling true dis- tributed collaboration. Finally, we integrated these tools into GRENCE, a clinical decision support application that uses our infrastructure to ground LLM reasoning in verifiable medical data. This work establishes a robust software engineering foundation for KGs, transforming them from static artifacts into dynamic, evolving knowledge stores that can be efficiently maintained by hybrid teams of human experts and autonomous agents.