Inference Services¶
This document describes the inference services provided by the Bluefly LLM ecosystem.
Overview¶
The Bluefly inference services provide scalable, reliable, and efficient access to machine learning models. These services handle the complexities of model deployment, scaling, and monitoring, allowing applications to leverage machine learning capabilities through simple API calls.
Architecture¶
The inference services architecture consists of several components:
- API Gateway: Entry point for API requests
- Request Router: Routes requests to appropriate models
- Request Validator: Validates request format and parameters
- Model Manager: Manages model lifecycle and versioning
- Model Server: Hosts and serves model instances
- Monitoring: Tracks service health and performance
- Caching: Caches common requests to improve performance
- Rate Limiter: Controls request rates to prevent abuse
Available Models¶
The Bluefly inference services provide access to various models:
Model ID | Description | Capabilities | Max Input Tokens | Max Output Tokens |
---|---|---|---|---|
bluefly-text-1 | General text processing | Text generation, summarization, classification | 4,096 | 4,096 |
bluefly-text-1-long | Extended context text processing | Text generation with long context | 16,384 | 4,096 |
bluefly-doc-1 | Document processing | Document analysis, information extraction | 32,768 | 4,096 |
bluefly-code-1 | Code processing | Code generation, explanation, review | 8,192 | 8,192 |
bluefly-embedding-1 | Text embedding | Vector embeddings for semantic search | 8,192 | N/A |
API Reference¶
Base URL¶
https://api.bluefly.io/api/v1
Authentication¶
All API requests require authentication using API keys or JWT tokens. See the Authentication documentation for details.
Model Inference¶
Text Generation¶
POST /models/{model_id}/generate
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY
{
"prompt": "Write a short explanation of machine learning for beginners.",
"max_tokens": 150,
"temperature": 0.7,
"top_p": 0.9,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
"stop": ["\n\n"],
"format": "text"
}
Response:
{
"id": "gen_123abc",
"model": "bluefly-text-1",
"output": "Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. Think of it like teaching a child...",
"usage": {
"prompt_tokens": 12,
"completion_tokens": 138,
"total_tokens": 150
}
}
Text Embeddings¶
POST /models/bluefly-embedding-1/embeddings
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY
{
"input": "The quick brown fox jumps over the lazy dog.",
"encoding_format": "float"
}
Response:
{
"id": "emb_456def",
"model": "bluefly-embedding-1",
"object": "embedding",
"embeddings": [
[0.1234, -0.4567, 0.7890, ...]
],
"usage": {
"prompt_tokens": 9,
"total_tokens": 9
}
}
Deployment Options¶
Cloud Deployment¶
Bluefly inference services are available as a managed cloud service, accessible through the API:
- Standard Tier: Shared infrastructure with fair usage policies
- Dedicated Tier: Dedicated infrastructure for higher throughput and lower latency
- Enterprise Tier: Custom deployments with SLAs and support
On-Premises Deployment¶
For customers with specific security or compliance requirements, Bluefly offers on-premises deployment options:
- Docker Containers: Containerized deployment for Kubernetes environments
- VM Images: Virtual machine images for traditional infrastructure
- Air-Gapped Deployment: Fully isolated deployment for high-security environments
Model Configuration¶
The inference services offer various configuration options for model behavior:
Generation Parameters¶
Parameter | Description | Default | Range |
---|---|---|---|
temperature | Controls randomness of output | 0.7 | 0.0 - 2.0 |
top_p | Nucleus sampling parameter | 0.9 | 0.0 - 1.0 |
frequency_penalty | Penalizes repeated tokens | 0.0 | -2.0 - 2.0 |
presence_penalty | Penalizes tokens already in the text | 0.0 | -2.0 - 2.0 |
max_tokens | Maximum number of tokens to generate | Model-specific | 1 - Model max |
stop | Sequences that stop generation | [] | Array of strings |
Model Versions¶
Each model has multiple versions available:
- Latest: The most recent version of the model
- Stable: The current stable version recommended for production
- Legacy: Previous versions for backward compatibility
Specify the version in the model ID:
bluefly-text-1-v2
Scaling and Performance¶
Auto-scaling¶
The inference services automatically scale based on demand:
- Horizontal Scaling: Adding more model instances during high load
- Model Caching: Keeping frequently used models in memory
- Request Batching: Combining multiple requests for efficient processing
Performance Optimization¶
The services include various optimizations:
- Quantization: Using lower precision for faster inference
- Model Distillation: Using smaller, faster models where appropriate
- Hardware Acceleration: Utilizing GPUs and specialized hardware
Performance Benchmarks¶
Typical performance metrics for the models:
Model | Latency (P50) | Latency (P95) | Throughput |
---|---|---|---|
bluefly-text-1 | 150ms | 250ms | 100 req/s |
bluefly-text-1-long | 300ms | 500ms | 50 req/s |
bluefly-doc-1 | 500ms | 800ms | 30 req/s |
bluefly-code-1 | 200ms | 350ms | 80 req/s |
bluefly-embedding-1 | 50ms | 100ms | 200 req/s |
Security and Compliance¶
Data Security¶
The inference services implement robust security measures:
- Data Encryption: Encryption in transit and at rest
- Request Authentication: API key and JWT authentication
- Input Validation: Validation of all input parameters
Compliance¶
The services are designed for compliance with regulations:
- GDPR: Compliance with EU data protection regulations
- HIPAA: Compliance with healthcare data regulations (Enterprise tier)
- SOC 2: Service organization control compliance
Integration Examples¶
JavaScript/Node.js¶
const axios = require('axios');
async function generateText(prompt) {
try {
const response = await axios.post(
'https://api.bluefly.io/api/v1/models/bluefly-text-1/generate',
{
prompt,
max_tokens: 150,
temperature: 0.7,
format: 'text'
},
{
headers: {
'Authorization': `Bearer ${process.env.BLUEFLY_API_KEY}`,
'Content-Type': 'application/json'
}
}
);
return response.data;
} catch (error) {
console.error('Error generating text:', error.response?.data || error.message);
throw error;
}
}
Python¶
import requests
import os
def generate_text(prompt):
try:
response = requests.post(
'https://api.bluefly.io/api/v1/models/bluefly-text-1/generate',
json={
'prompt': prompt,
'max_tokens': 150,
'temperature': 0.7,
'format': 'text'
},
headers={
'Authorization': f"Bearer {os.environ.get('BLUEFLY_API_KEY')}",
'Content-Type': 'application/json'
}
)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error generating text: {e}")
raise