An automated data extraction pipeline built for Solidigm's AI debugging tool, processing unstructured engineering documents and service tickets to feed RAG models with 99% accuracy.
During Spring 2025, our CodeLab UC Davis team partnered with Solidigm to build a data extraction pipeline as the foundation for their AI debugging tool. Working with lead engineer Ali Hashim and support from Jasmin Vora and Charles Anyimi, we developed an automated system to streamline data querying and improve service ticket metadata storage efficiency.
Solidigm needed to develop an AI-powered diagnostic tool using Retrieval-Augmented Generation (RAG) architecture. The challenge was building a pipeline to accurately extract, clean, classify, and vectorize raw, unstructured engineering data from PDFs and service tickets.
We built a comprehensive pipeline using Data-iku as the platform, LlamaIndex for PDF processing, and custom JavaScript interfaces. The system processes both structured CSV data from Jira and unstructured PDF documents, achieving over 99% accuracy in metadata extraction.
Successfully delivered a scalable, automated pipeline with dual-view plugin interface: Settings/Dashboard for configuration and Extraction View for quality control. The system provides full transparency and manual correction capabilities for long-term maintainability.
The Settings/Dashboard interface allows engineers to control metadata categories and customize keyword lists for improved extraction accuracy.
The Extraction View provides side-by-side comparison of raw documents and extracted metadata for quality control and manual corrections.
The CodeLab UC Davis team collaborating on the Solidigm AI data extraction pipeline project.
Conducted direct interviews with Ali Hashim to identify workflows and pain points in service ticket management. Key insights included the need for customizable extraction criteria, visibility into accuracy, and streamlined review processes.
Created wireframes for dual-view structure: Settings/Dashboard for configuration and Extraction Review for quality control. Developed lo-fi to hi-fi prototypes matching Solidigm's visual identity, focusing on minimizing friction for technical users.
Built the pipeline using Data-iku's existing platform, implemented PDF processing with LlamaIndex and PDFPlumber, and created custom JavaScript interfaces. Achieved 99% accuracy through rigorous experimentation and modular pipeline architecture to handle diverse document formats.
Handles both structured CSV data from Jira service tickets and unstructured PDF engineering documents using specialized processing pipelines for each format type.
Achieves over 99% accuracy in metadata extraction through combination of rule-based logic, semantic parsing, and model-driven classification with custom heuristics for document processing.
Settings/Dashboard interface enables users to define metadata categories, add custom keywords, and configure extraction parameters without requiring backend modifications.
Extraction View provides searchable history of past extractions with side-by-side layout for raw and processed data comparison, enabling manual correction and full transparency.
We extend our sincere thanks to Solidigm and their incredible team for their collaboration and support throughout this rewarding data extraction project.