Skip to content

Inference Services

This document describes the inference services provided by the Bluefly LLM ecosystem.

Overview

The Bluefly inference services provide scalable, reliable, and efficient access to machine learning models. These services handle the complexities of model deployment, scaling, and monitoring, allowing applications to leverage machine learning capabilities through simple API calls.

Architecture

The inference services architecture consists of several components:

  1. API Gateway: Entry point for API requests
  2. Request Router: Routes requests to appropriate models
  3. Request Validator: Validates request format and parameters
  4. Model Manager: Manages model lifecycle and versioning
  5. Model Server: Hosts and serves model instances
  6. Monitoring: Tracks service health and performance
  7. Caching: Caches common requests to improve performance
  8. Rate Limiter: Controls request rates to prevent abuse

Available Models

The Bluefly inference services provide access to various models:

Model ID Description Capabilities Max Input Tokens Max Output Tokens
bluefly-text-1 General text processing Text generation, summarization, classification 4,096 4,096
bluefly-text-1-long Extended context text processing Text generation with long context 16,384 4,096
bluefly-doc-1 Document processing Document analysis, information extraction 32,768 4,096
bluefly-code-1 Code processing Code generation, explanation, review 8,192 8,192
bluefly-embedding-1 Text embedding Vector embeddings for semantic search 8,192 N/A

API Reference

Base URL

https://api.bluefly.io/api/v1

Authentication

All API requests require authentication using API keys or JWT tokens. See the Authentication documentation for details.

Model Inference

Text Generation

POST /models/{model_id}/generate
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "prompt": "Write a short explanation of machine learning for beginners.",
  "max_tokens": 150,
  "temperature": 0.7,
  "top_p": 0.9,
  "frequency_penalty": 0.0,
  "presence_penalty": 0.0,
  "stop": ["\n\n"],
  "format": "text"
}

Response:

{
  "id": "gen_123abc",
  "model": "bluefly-text-1",
  "output": "Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. Think of it like teaching a child...",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 138,
    "total_tokens": 150
  }
}

Text Embeddings

POST /models/bluefly-embedding-1/embeddings
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "input": "The quick brown fox jumps over the lazy dog.",
  "encoding_format": "float"
}

Response:

{
  "id": "emb_456def",
  "model": "bluefly-embedding-1",
  "object": "embedding",
  "embeddings": [
    [0.1234, -0.4567, 0.7890, ...]  
  ],
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 9
  }
}

Deployment Options

Cloud Deployment

Bluefly inference services are available as a managed cloud service, accessible through the API:

  • Standard Tier: Shared infrastructure with fair usage policies
  • Dedicated Tier: Dedicated infrastructure for higher throughput and lower latency
  • Enterprise Tier: Custom deployments with SLAs and support

On-Premises Deployment

For customers with specific security or compliance requirements, Bluefly offers on-premises deployment options:

  • Docker Containers: Containerized deployment for Kubernetes environments
  • VM Images: Virtual machine images for traditional infrastructure
  • Air-Gapped Deployment: Fully isolated deployment for high-security environments

Model Configuration

The inference services offer various configuration options for model behavior:

Generation Parameters

Parameter Description Default Range
temperature Controls randomness of output 0.7 0.0 - 2.0
top_p Nucleus sampling parameter 0.9 0.0 - 1.0
frequency_penalty Penalizes repeated tokens 0.0 -2.0 - 2.0
presence_penalty Penalizes tokens already in the text 0.0 -2.0 - 2.0
max_tokens Maximum number of tokens to generate Model-specific 1 - Model max
stop Sequences that stop generation [] Array of strings

Model Versions

Each model has multiple versions available:

  • Latest: The most recent version of the model
  • Stable: The current stable version recommended for production
  • Legacy: Previous versions for backward compatibility

Specify the version in the model ID:

bluefly-text-1-v2

Scaling and Performance

Auto-scaling

The inference services automatically scale based on demand:

  • Horizontal Scaling: Adding more model instances during high load
  • Model Caching: Keeping frequently used models in memory
  • Request Batching: Combining multiple requests for efficient processing

Performance Optimization

The services include various optimizations:

  • Quantization: Using lower precision for faster inference
  • Model Distillation: Using smaller, faster models where appropriate
  • Hardware Acceleration: Utilizing GPUs and specialized hardware

Performance Benchmarks

Typical performance metrics for the models:

Model Latency (P50) Latency (P95) Throughput
bluefly-text-1 150ms 250ms 100 req/s
bluefly-text-1-long 300ms 500ms 50 req/s
bluefly-doc-1 500ms 800ms 30 req/s
bluefly-code-1 200ms 350ms 80 req/s
bluefly-embedding-1 50ms 100ms 200 req/s

Security and Compliance

Data Security

The inference services implement robust security measures:

  • Data Encryption: Encryption in transit and at rest
  • Request Authentication: API key and JWT authentication
  • Input Validation: Validation of all input parameters

Compliance

The services are designed for compliance with regulations:

  • GDPR: Compliance with EU data protection regulations
  • HIPAA: Compliance with healthcare data regulations (Enterprise tier)
  • SOC 2: Service organization control compliance

Integration Examples

JavaScript/Node.js

const axios = require('axios');

async function generateText(prompt) {
  try {
    const response = await axios.post(
      'https://api.bluefly.io/api/v1/models/bluefly-text-1/generate',
      {
        prompt,
        max_tokens: 150,
        temperature: 0.7,
        format: 'text'
      },
      {
        headers: {
          'Authorization': `Bearer ${process.env.BLUEFLY_API_KEY}`,
          'Content-Type': 'application/json'
        }
      }
    );

    return response.data;
  } catch (error) {
    console.error('Error generating text:', error.response?.data || error.message);
    throw error;
  }
}

Python

import requests
import os

def generate_text(prompt):
    try:
        response = requests.post(
            'https://api.bluefly.io/api/v1/models/bluefly-text-1/generate',
            json={
                'prompt': prompt,
                'max_tokens': 150,
                'temperature': 0.7,
                'format': 'text'
            },
            headers={
                'Authorization': f"Bearer {os.environ.get('BLUEFLY_API_KEY')}",
                'Content-Type': 'application/json'
            }
        )
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error generating text: {e}")
        raise