Solidigm AI Data Extraction Pipeline interface
Data-ikuLlamaIndexJavaScriptPDFPlumberAzure

AI Data Extraction Pipeline

An automated data extraction pipeline built for Solidigm's AI debugging tool, processing unstructured engineering documents and service tickets to feed RAG models with 99% accuracy.

Tech Stack

Data-iku
LlamaIndex
PDFPlumber
JavaScript
HTML/CSS
Pandas
Azure VM
Figma
GitHub
Jira

Project Overview

During Spring 2025, our CodeLab UC Davis team partnered with Solidigm to build a data extraction pipeline as the foundation for their AI debugging tool. Working with lead engineer Ali Hashim and support from Jasmin Vora and Charles Anyimi, we developed an automated system to streamline data querying and improve service ticket metadata storage efficiency.

The Challenge

Solidigm needed to develop an AI-powered diagnostic tool using Retrieval-Augmented Generation (RAG) architecture. The challenge was building a pipeline to accurately extract, clean, classify, and vectorize raw, unstructured engineering data from PDFs and service tickets.

The Solution

We built a comprehensive pipeline using Data-iku as the platform, LlamaIndex for PDF processing, and custom JavaScript interfaces. The system processes both structured CSV data from Jira and unstructured PDF documents, achieving over 99% accuracy in metadata extraction.

The Outcome

Successfully delivered a scalable, automated pipeline with dual-view plugin interface: Settings/Dashboard for configuration and Extraction View for quality control. The system provides full transparency and manual correction capabilities for long-term maintainability.

Project Visuals

Solidigm AI Settings Dashboard showing metadata categories and keyword management

The Settings/Dashboard interface allows engineers to control metadata categories and customize keyword lists for improved extraction accuracy.

Solidigm AI Extraction View showing side-by-side document comparison

The Extraction View provides side-by-side comparison of raw documents and extracted metadata for quality control and manual corrections.

CodeLab UC Davis team working on Solidigm AI project

The CodeLab UC Davis team collaborating on the Solidigm AI data extraction pipeline project.

Development Process

Research & User Interviews

Conducted direct interviews with Ali Hashim to identify workflows and pain points in service ticket management. Key insights included the need for customizable extraction criteria, visibility into accuracy, and streamlined review processes.

Design & Prototyping

Created wireframes for dual-view structure: Settings/Dashboard for configuration and Extraction Review for quality control. Developed lo-fi to hi-fi prototypes matching Solidigm's visual identity, focusing on minimizing friction for technical users.

Development & Integration

Built the pipeline using Data-iku's existing platform, implemented PDF processing with LlamaIndex and PDFPlumber, and created custom JavaScript interfaces. Achieved 99% accuracy through rigorous experimentation and modular pipeline architecture to handle diverse document formats.

Key Features

Dual-Format Processing

Handles both structured CSV data from Jira service tickets and unstructured PDF engineering documents using specialized processing pipelines for each format type.

99% Extraction Accuracy

Achieves over 99% accuracy in metadata extraction through combination of rule-based logic, semantic parsing, and model-driven classification with custom heuristics for document processing.

Customizable Extraction Control

Settings/Dashboard interface enables users to define metadata categories, add custom keywords, and configure extraction parameters without requiring backend modifications.

Quality Control & Review System

Extraction View provides searchable history of past extractions with side-by-side layout for raw and processed data comparison, enabling manual correction and full transparency.

Interested in working together?

We extend our sincere thanks to Solidigm and their incredible team for their collaboration and support throughout this rewarding data extraction project.