Data Requirements¶
This document outlines the data requirements for training, fine-tuning, and using models in the Bluefly LLM ecosystem.
Overview¶
High-quality data is essential for effective machine learning. This document describes the requirements and best practices for preparing data for various uses within the Bluefly LLM ecosystem.
Data Types¶
The Bluefly LLM ecosystem works with various types of data:
Text Data¶
- Plain Text: Unformatted text for general purposes
- Structured Text: Text with defined format (JSON, XML, etc.)
- Markdown: Formatted text with Markdown syntax
- Code: Programming language source code
Document Data¶
- PDF Documents: Portable Document Format files
- Office Documents: Word, Excel, PowerPoint files
- HTML Documents: Web pages and HTML content
Conversation Data¶
- Chat Logs: Records of conversations
- Question-Answer Pairs: Pairs of questions and answers
- Instructions and Responses: Pairs of instructions and expected responses
Data Formats¶
Training Data Format¶
Training data should be formatted as JSONL (JSON Lines) files, with each line containing a complete training example:
{"text": "This is a sample training example."}
{"text": "This is another training example."}
For instruction tuning, use the following format:
{"instruction": "Explain machine learning in simple terms.", "response": "Machine learning is a way for computers to learn from examples rather than being explicitly programmed..."}
{"instruction": "Summarize this article about climate change.", "response": "The article discusses the latest climate change research findings..."}
Fine-tuning Data Format¶
Fine-tuning data follows a similar format but may include additional fields:
{"instruction": "Explain machine learning.", "response": "Machine learning is...", "category": "education", "difficulty": "beginner"}
{"instruction": "Write code for binary search.", "response": "def binary_search(arr, target):\\n...", "category": "programming", "language": "python"}
Inference Data Format¶
For inference requests, use JSON format:
{
"prompt": "Explain machine learning in simple terms.",
"max_tokens": 150,
"temperature": 0.7
}
Data Quality Requirements¶
General Requirements¶
- Accuracy: Data should be factually correct
- Relevance: Data should be relevant to the task
- Diversity: Data should cover a wide range of cases
- Balance: Data should be balanced across categories
- Representativeness: Data should represent real-world scenarios
Text Quality¶
- Spelling and Grammar: Text should be well-written with minimal errors
- Clarity: Text should be clear and understandable
- Consistency: Style and terminology should be consistent
- Length: Text should be of appropriate length for the task
Content Guidelines¶
- Safety: Data should not contain harmful or unsafe content
- Privacy: Data should not contain personally identifiable information
- Ethics: Data should adhere to ethical guidelines
- Legality: Data should comply with legal requirements
Data Preparation Process¶
Collection¶
- Source Identification: Identify reliable data sources
- Extraction: Extract data from sources
- Documentation: Document data provenance and collection methods
Cleaning¶
- Deduplication: Remove duplicate entries
- Error Correction: Fix spelling and grammatical errors
- Noise Removal: Remove irrelevant or noisy content
- Standardization: Standardize formatting and style
Preprocessing¶
- Tokenization: Convert text to tokens
- Normalization: Normalize text (case, punctuation, etc.)
- Filtering: Apply content filters
- Augmentation: Generate additional examples if needed
Validation¶
- Schema Validation: Verify data adheres to required format
- Quality Checks: Apply quality criteria
- Statistical Analysis: Analyze data distributions
- Manual Review: Review samples for quality
Data Volume Requirements¶
Minimum Requirements¶
Task | Minimum Examples | Recommended Examples |
---|---|---|
Classification | 100 per class | 1,000+ per class |
Summarization | 500 | 5,000+ |
Question Answering | 1,000 | 10,000+ |
Fine-tuning | 1,000 | 50,000+ |
Quality vs. Quantity¶
While volume is important, quality is paramount. A smaller dataset of high-quality examples is often more effective than a larger dataset of low-quality examples.
Special Requirements by Task¶
Classification¶
- Class Balance: Even distribution across classes
- Edge Cases: Include boundary cases
- Multi-label: Clear indication of multiple applicable labels
Question Answering¶
- Question Diversity: Various question types and formats
- Answer Completeness: Complete and accurate answers
- Context: Necessary context for answering questions
Summarization¶
- Source Length: Appropriate length of source text
- Summary Quality: Well-written, concise summaries
- Abstractiveness: Mix of extractive and abstractive summaries
Instruction Tuning¶
- Instruction Clarity: Clear and unambiguous instructions
- Response Quality: High-quality responses
- Task Diversity: Coverage of various tasks and domains
Data Privacy and Security¶
Privacy Considerations¶
- PII Removal: Remove personally identifiable information
- Anonymization: Anonymize sensitive data
- Consent: Ensure data is collected with appropriate consent
- Compliance: Adhere to privacy regulations (GDPR, CCPA, etc.)
Security Measures¶
- Access Control: Restrict access to sensitive data
- Encryption: Encrypt data at rest and in transit
- Secure Transfer: Use secure methods for data transfer
- Secure Storage: Store data in secure locations
Tools and Utilities¶
Data Preparation Tools¶
The Bluefly LLM ecosystem provides tools for data preparation:
- Data Validator: Validates data format and quality
- Data Cleaner: Helps clean and normalize data
- Anonymizer: Helps remove personal information
- Tokenizer: Tokenizes text for analysis
Command-line Utilities¶
# Validate dataset format
bfcli data validate --input dataset.jsonl --schema qa_schema.json
# Clean dataset
bfcli data clean --input raw_dataset.jsonl --output clean_dataset.jsonl
# Split dataset
bfcli data split --input dataset.jsonl --train 0.8 --val 0.1 --test 0.1
# Analyze dataset statistics
bfcli data analyze --input dataset.jsonl --output stats.json
Best Practices¶
General Best Practices¶
- Start Small: Begin with a smaller, high-quality dataset
- Iterate: Continuously improve data quality
- Balance: Balance data across categories and types
- Document: Thoroughly document data sources and processing
- Test: Test data with models before full-scale training
Domain-Specific Best Practices¶
- Technical Content: Include clear explanations and examples
- Creative Content: Provide diverse creative examples
- Educational Content: Cover different levels of complexity
- Conversational Data: Ensure natural dialogue flow