Picture for Jillian Bommarito

Jillian Bommarito

The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

Add code
Apr 10, 2025
Viaarxiv icon

Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

Add code
Apr 05, 2025
Figure 1 for Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
Figure 2 for Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
Figure 3 for Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
Figure 4 for Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
Viaarxiv icon

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

Add code
Mar 21, 2025
Figure 1 for KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
Figure 2 for KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
Figure 3 for KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
Figure 4 for KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
Viaarxiv icon

Towards Best Practices for Open Datasets for LLM Training

Add code
Jan 14, 2025
Viaarxiv icon

GPT as Knowledge Worker: A Zero-Shot Evaluation of CPA Capabilities

Add code
Jan 11, 2023
Viaarxiv icon