Inference Services¶

This document describes the inference services provided by the Bluefly LLM ecosystem.

Overview¶

The Bluefly inference services provide scalable, reliable, and efficient access to machine learning models. These services handle the complexities of model deployment, scaling, and monitoring, allowing applications to leverage machine learning capabilities through simple API calls.

Architecture¶

The inference services architecture consists of several components:

API Gateway: Entry point for API requests
Request Router: Routes requests to appropriate models
Request Validator: Validates request format and parameters
Model Manager: Manages model lifecycle and versioning
Model Server: Hosts and serves model instances
Monitoring: Tracks service health and performance
Caching: Caches common requests to improve performance
Rate Limiter: Controls request rates to prevent abuse

Available Models¶

The Bluefly inference services provide access to various models:

Model ID	Description	Capabilities	Max Input Tokens	Max Output Tokens
bluefly-text-1	General text processing	Text generation, summarization, classification	4,096	4,096
bluefly-text-1-long	Extended context text processing	Text generation with long context	16,384	4,096
bluefly-doc-1	Document processing	Document analysis, information extraction	32,768	4,096
bluefly-code-1	Code processing	Code generation, explanation, review	8,192	8,192
bluefly-embedding-1	Text embedding	Vector embeddings for semantic search	8,192	N/A

API Reference¶

Base URL¶

https://api.bluefly.io/api/v1

Authentication¶

All API requests require authentication using API keys or JWT tokens. See the Authentication documentation for details.

Model Inference¶

Text Generation¶

POST /models/{model_id}/generate
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "prompt": "Write a short explanation of machine learning for beginners.",
  "max_tokens": 150,
  "temperature": 0.7,
  "top_p": 0.9,
  "frequency_penalty": 0.0,
  "presence_penalty": 0.0,
  "stop": ["\n\n"],
  "format": "text"
}

Response:

{
  "id": "gen_123abc",
  "model": "bluefly-text-1",
  "output": "Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. Think of it like teaching a child...",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 138,
    "total_tokens": 150
  }
}

Text Embeddings¶

POST /models/bluefly-embedding-1/embeddings
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "input": "The quick brown fox jumps over the lazy dog.",
  "encoding_format": "float"
}

Response:

{
  "id": "emb_456def",
  "model": "bluefly-embedding-1",
  "object": "embedding",
  "embeddings": [
    [0.1234, -0.4567, 0.7890, ...]  
  ],
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 9
  }
}

Deployment Options¶

Cloud Deployment¶

Bluefly inference services are available as a managed cloud service, accessible through the API:

Standard Tier: Shared infrastructure with fair usage policies
Dedicated Tier: Dedicated infrastructure for higher throughput and lower latency
Enterprise Tier: Custom deployments with SLAs and support

On-Premises Deployment¶

For customers with specific security or compliance requirements, Bluefly offers on-premises deployment options:

Docker Containers: Containerized deployment for Kubernetes environments
VM Images: Virtual machine images for traditional infrastructure
Air-Gapped Deployment: Fully isolated deployment for high-security environments

Model Configuration¶

The inference services offer various configuration options for model behavior:

Generation Parameters¶

Parameter	Description	Default	Range
temperature	Controls randomness of output	0.7	0.0 - 2.0
top_p	Nucleus sampling parameter	0.9	0.0 - 1.0
frequency_penalty	Penalizes repeated tokens	0.0	-2.0 - 2.0
presence_penalty	Penalizes tokens already in the text	0.0	-2.0 - 2.0
max_tokens	Maximum number of tokens to generate	Model-specific	1 - Model max
stop	Sequences that stop generation	[]	Array of strings

Model Versions¶

Each model has multiple versions available:

Latest: The most recent version of the model
Stable: The current stable version recommended for production
Legacy: Previous versions for backward compatibility

Specify the version in the model ID:

bluefly-text-1-v2

Scaling and Performance¶

Auto-scaling¶

The inference services automatically scale based on demand:

Horizontal Scaling: Adding more model instances during high load
Model Caching: Keeping frequently used models in memory
Request Batching: Combining multiple requests for efficient processing

Performance Optimization¶

The services include various optimizations:

Quantization: Using lower precision for faster inference
Model Distillation: Using smaller, faster models where appropriate
Hardware Acceleration: Utilizing GPUs and specialized hardware

Performance Benchmarks¶

Typical performance metrics for the models:

Model	Latency (P50)	Latency (P95)	Throughput
bluefly-text-1	150ms	250ms	100 req/s
bluefly-text-1-long	300ms	500ms	50 req/s
bluefly-doc-1	500ms	800ms	30 req/s
bluefly-code-1	200ms	350ms	80 req/s
bluefly-embedding-1	50ms	100ms	200 req/s

Security and Compliance¶

Data Security¶

The inference services implement robust security measures:

Data Encryption: Encryption in transit and at rest
Request Authentication: API key and JWT authentication
Input Validation: Validation of all input parameters

Compliance¶

The services are designed for compliance with regulations:

GDPR: Compliance with EU data protection regulations
HIPAA: Compliance with healthcare data regulations (Enterprise tier)
SOC 2: Service organization control compliance

Integration Examples¶

JavaScript/Node.js¶

const axios = require('axios');

async function generateText(prompt) {
  try {
    const response = await axios.post(
      'https://api.bluefly.io/api/v1/models/bluefly-text-1/generate',
      {
        prompt,
        max_tokens: 150,
        temperature: 0.7,
        format: 'text'
      },
      {
        headers: {
          'Authorization': `Bearer ${process.env.BLUEFLY_API_KEY}`,
          'Content-Type': 'application/json'
        }
      }
    );

    return response.data;
  } catch (error) {
    console.error('Error generating text:', error.response?.data || error.message);
    throw error;
  }
}

Python¶

import requests
import os

def generate_text(prompt):
    try:
        response = requests.post(
            'https://api.bluefly.io/api/v1/models/bluefly-text-1/generate',
            json={
                'prompt': prompt,
                'max_tokens': 150,
                'temperature': 0.7,
                'format': 'text'
            },
            headers={
                'Authorization': f"Bearer {os.environ.get('BLUEFLY_API_KEY')}",
                'Content-Type': 'application/json'
            }
        )
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error generating text: {e}")
        raise