Skip to content

Data Requirements

This document outlines the data requirements for training, fine-tuning, and using models in the Bluefly LLM ecosystem.

Overview

High-quality data is essential for effective machine learning. This document describes the requirements and best practices for preparing data for various uses within the Bluefly LLM ecosystem.

Data Types

The Bluefly LLM ecosystem works with various types of data:

Text Data

  • Plain Text: Unformatted text for general purposes
  • Structured Text: Text with defined format (JSON, XML, etc.)
  • Markdown: Formatted text with Markdown syntax
  • Code: Programming language source code

Document Data

  • PDF Documents: Portable Document Format files
  • Office Documents: Word, Excel, PowerPoint files
  • HTML Documents: Web pages and HTML content

Conversation Data

  • Chat Logs: Records of conversations
  • Question-Answer Pairs: Pairs of questions and answers
  • Instructions and Responses: Pairs of instructions and expected responses

Data Formats

Training Data Format

Training data should be formatted as JSONL (JSON Lines) files, with each line containing a complete training example:

{"text": "This is a sample training example."}
{"text": "This is another training example."}

For instruction tuning, use the following format:

{"instruction": "Explain machine learning in simple terms.", "response": "Machine learning is a way for computers to learn from examples rather than being explicitly programmed..."}
{"instruction": "Summarize this article about climate change.", "response": "The article discusses the latest climate change research findings..."}

Fine-tuning Data Format

Fine-tuning data follows a similar format but may include additional fields:

{"instruction": "Explain machine learning.", "response": "Machine learning is...", "category": "education", "difficulty": "beginner"}
{"instruction": "Write code for binary search.", "response": "def binary_search(arr, target):\\n...", "category": "programming", "language": "python"}

Inference Data Format

For inference requests, use JSON format:

{
  "prompt": "Explain machine learning in simple terms.",
  "max_tokens": 150,
  "temperature": 0.7
}

Data Quality Requirements

General Requirements

  • Accuracy: Data should be factually correct
  • Relevance: Data should be relevant to the task
  • Diversity: Data should cover a wide range of cases
  • Balance: Data should be balanced across categories
  • Representativeness: Data should represent real-world scenarios

Text Quality

  • Spelling and Grammar: Text should be well-written with minimal errors
  • Clarity: Text should be clear and understandable
  • Consistency: Style and terminology should be consistent
  • Length: Text should be of appropriate length for the task

Content Guidelines

  • Safety: Data should not contain harmful or unsafe content
  • Privacy: Data should not contain personally identifiable information
  • Ethics: Data should adhere to ethical guidelines
  • Legality: Data should comply with legal requirements

Data Preparation Process

Collection

  1. Source Identification: Identify reliable data sources
  2. Extraction: Extract data from sources
  3. Documentation: Document data provenance and collection methods

Cleaning

  1. Deduplication: Remove duplicate entries
  2. Error Correction: Fix spelling and grammatical errors
  3. Noise Removal: Remove irrelevant or noisy content
  4. Standardization: Standardize formatting and style

Preprocessing

  1. Tokenization: Convert text to tokens
  2. Normalization: Normalize text (case, punctuation, etc.)
  3. Filtering: Apply content filters
  4. Augmentation: Generate additional examples if needed

Validation

  1. Schema Validation: Verify data adheres to required format
  2. Quality Checks: Apply quality criteria
  3. Statistical Analysis: Analyze data distributions
  4. Manual Review: Review samples for quality

Data Volume Requirements

Minimum Requirements

Task Minimum Examples Recommended Examples
Classification 100 per class 1,000+ per class
Summarization 500 5,000+
Question Answering 1,000 10,000+
Fine-tuning 1,000 50,000+

Quality vs. Quantity

While volume is important, quality is paramount. A smaller dataset of high-quality examples is often more effective than a larger dataset of low-quality examples.

Special Requirements by Task

Classification

  • Class Balance: Even distribution across classes
  • Edge Cases: Include boundary cases
  • Multi-label: Clear indication of multiple applicable labels

Question Answering

  • Question Diversity: Various question types and formats
  • Answer Completeness: Complete and accurate answers
  • Context: Necessary context for answering questions

Summarization

  • Source Length: Appropriate length of source text
  • Summary Quality: Well-written, concise summaries
  • Abstractiveness: Mix of extractive and abstractive summaries

Instruction Tuning

  • Instruction Clarity: Clear and unambiguous instructions
  • Response Quality: High-quality responses
  • Task Diversity: Coverage of various tasks and domains

Data Privacy and Security

Privacy Considerations

  • PII Removal: Remove personally identifiable information
  • Anonymization: Anonymize sensitive data
  • Consent: Ensure data is collected with appropriate consent
  • Compliance: Adhere to privacy regulations (GDPR, CCPA, etc.)

Security Measures

  • Access Control: Restrict access to sensitive data
  • Encryption: Encrypt data at rest and in transit
  • Secure Transfer: Use secure methods for data transfer
  • Secure Storage: Store data in secure locations

Tools and Utilities

Data Preparation Tools

The Bluefly LLM ecosystem provides tools for data preparation:

  • Data Validator: Validates data format and quality
  • Data Cleaner: Helps clean and normalize data
  • Anonymizer: Helps remove personal information
  • Tokenizer: Tokenizes text for analysis

Command-line Utilities

# Validate dataset format
bfcli data validate --input dataset.jsonl --schema qa_schema.json

# Clean dataset
bfcli data clean --input raw_dataset.jsonl --output clean_dataset.jsonl

# Split dataset
bfcli data split --input dataset.jsonl --train 0.8 --val 0.1 --test 0.1

# Analyze dataset statistics
bfcli data analyze --input dataset.jsonl --output stats.json

Best Practices

General Best Practices

  • Start Small: Begin with a smaller, high-quality dataset
  • Iterate: Continuously improve data quality
  • Balance: Balance data across categories and types
  • Document: Thoroughly document data sources and processing
  • Test: Test data with models before full-scale training

Domain-Specific Best Practices

  • Technical Content: Include clear explanations and examples
  • Creative Content: Provide diverse creative examples
  • Educational Content: Cover different levels of complexity
  • Conversational Data: Ensure natural dialogue flow