Reliability
Can the pipeline reliably convert heterogeneous federal trafficking dockets — civil and criminal alike — into faithful, schema-consistent structured data?
A modular agentic pipeline that ingests trafficking evidence in every format — court records, text, images, audio, structured data — and turns it into an investigator-accessible knowledge graph.
Research vision
Trafficking evidence arrives in every format — court records, online posts and ads, images, audio, case-management data — but it's fragmented, siloed, and rarely analyzable across sources. My goal is one system that ingests all of it, organizes it semantically, and makes it queryable at scale.
The architecture is modality-aware: a router classifies each incoming item and sends it to the right open-source model — legal text to a legal-domain extractor, general text to an instruction-tuned LLM, audio through speech recognition, images and audio through ImageBind for cross-modal embedding, structured records through schema mapping.
Every modality converges on a shared embedding space — indexed and searched with FAISS — and a unified knowledge graph (Neo4j), exposed through a retrieval-augmented interface, so an audio clip, a court-exhibit image, and a paragraph of an indictment can be retrieved and reasoned over in the same query.
Why legal dockets first
A system this broad can't be validated all at once, and the wrong starting modality stalls everything downstream. The dissertation deliberately begins with the most readily available, most structured, highest-signal source: federal trafficking legal dockets.
Court filings are public record, retrievable at scale, and dense with the exact entities the graph is built to hold — defendants, charges, parties, outcomes, jurisdictions. Proving the full ingestion → extraction → graph → RAG path on legal text first means the messier modalities extend a foundation that already works, rather than fighting an unproven architecture.
What I'm building
A modular, agentic pipeline that ingests dockets across both strands, derives its schema from the real docket population, assembles structured records into a queryable Neo4j knowledge graph, and exposes it through a RAG interface for non-technical analysts.
The dissertation evaluates the legal-text path of the broader multimodal architecture end-to-end. The contribution is the working system itself, measured against whether it's operationally useful.
Research questions
Can the pipeline reliably convert heterogeneous federal trafficking dockets — civil and criminal alike — into faithful, schema-consistent structured data?
Does the resulting knowledge graph let investigators surface cross-case and cross-strand patterns that document-level tools cannot?
Can non-technical analysts actually retrieve, reason about, and act on case information better than with their current tools? Answered through a formal user study — the central validation question.
Approach
Applied computer science: success is operational utility for investigators, not benchmark superiority. The model bake-offs and embedding evaluations are design decisions justified by their effect on the working system.
The pipeline is one path of a system architected to take in every trafficking evidence modality through a shared graph and embedding space.
Federal trafficking dockets are absent from every major legal NLP corpus — genuinely new territory rather than a re-benchmarking of well-trodden appellate text.
The under-studied, faster-growing civil strand — including third-party corporate liability theories developing in real time — is treated as a first-class case type.
Schema and graph carry both strands as equals, with a parallel_proceeding edge linking a civil suit to its corresponding criminal case so cross-strand queries are first-class.
The extraction schema is itself a research deliverable, derived from stratified analysis of the real docket distribution rather than assumed in advance.
The system · five modular components
Modality router, acquisition, role-aware PII scrubbing, and case-type classification — emitting modality-tagged structured JSON. The router is the surface along which the system extends from legal text to the full multimodal vision.
Builds FAISS indexes for hybrid retrieval — text encoders for the dissertation, ImageBind cross-modal embeddings for the broader system.
Converts structured records into (subject, relation, object) triples with confidence scores and source-chunk provenance — case-type-aware, including parallel-proceeding linkage.
Writes triples into Neo4j under enforced schema constraints, with parallel civil and criminal node and edge types and full provenance.
Interprets natural-language investigative queries, retrieves subgraphs via FAISS + Neo4j Cypher, and synthesizes answers with source citations, a confidence score, and a reasoning chain.
Modality routing · the full system
| Modality | Input | Model / method | Output | Status |
|---|---|---|---|---|
| LEGAL_TEXT | Federal court filings, dockets (PDF) | Legal-domain structured extraction (RQ1 bake-off) | Structured JSON → KG | Active — dissertation scope |
| GENERAL_TEXT | Social posts, ads, NGO narrative reports | Instruction-tuned LLM (LLaMA-3) | Entities, trafficking indicators, URLs | Designed; future extension |
| AUDIO | Interviews, intercepted calls (MP3/WAV) | Whisper ASR + LLM extraction + ImageBind | Structured payload + 1024-dim embedding | Designed; future extension |
| IMAGE/VIDEO | Ads, location/vehicle images, keyframes | ImageBind + OCR | 1024-dim embedding + descriptors + OCR text | Designed; future extension |
| STRUCTURED_DATA | CSV/JSON from case systems, open data | Schema mapping (YAML, no LLM) | Structured JSON → KG | Designed; future extension |
All routes pass through the PII scrubbing gate before any model is invoked. ImageBind's cross-modal alignment is what lets audio, image, and text be retrieved in a single embedding space.
Core technical experience
Developed across the broader multimodal program of work.
Encoding image and audio into a shared embedding space aligned with text, enabling queries that cross modality boundaries.
Index construction and tuning (Flat / IVF / IVFPQ) for scalable semantic retrieval, with embedding-to-source traceability.
Neo4j schema design, ontology constraints, entity resolution via embedding-similarity merges, and provenance-tagged nodes and edges.
Hybrid vector + graph-traversal retrieval feeding an LLM synthesis layer that returns cited, auditable answers with reasoning chains.
Ethics as design requirements
In a criminal-justice or civil-litigation context, a system that fails these shouldn't be deployed regardless of technical performance.
Filed court documents — never open investigations or survivor service records.
Defendants retained as public record; victims, witnesses, minors, informants, and survivor-capacity plaintiffs tokenized with within-docket-consistent, cross-docket-unlinkable tokens; ambiguous cases flagged for review.
Corpus composition documented, with metrics disaggregated by case type, statute, jurisdiction, and document type.
Resolved court documents only; legal constructs rather than operational detail; a research tool, not a live feed. Extension to live multimodal data requires additional dual-use assessment.
Survivor-informed extensions to schema, query design, and governance are scoped as required steps before any operational deployment.