Supported Providers
Flo AI supports multiple LLM providers with consistent interfaces, allowing you to easily switch between different models and providers.OpenAI
Basic Configuration
Available Models
Streaming Support
Anthropic Claude
Basic Configuration
Available Models
Google Gemini
Basic Configuration
Available Models
Google Vertex AI
Configuration
Ollama (Local)
Configuration
Popular Local Models
Provider Comparison
| Provider | Best For | Cost | Speed | Quality |
|---|---|---|---|---|
| GPT-4o | Complex reasoning | High | Medium | Excellent |
| GPT-4o-mini | Balanced tasks | Medium | Fast | Good |
| Claude-3.5-Sonnet | Creative writing | High | Medium | Excellent |
| Claude-3.5-Haiku | Simple tasks | Low | Fast | Good |
| Gemini-2.5-Pro | Multimodal tasks | Medium | Medium | Good |
| Gemini-2.5-Flash | Fast responses | Low | Very Fast | Good |
| Ollama | Privacy/Offline | Free | Variable | Variable |
Model Selection Guide
For Different Use Cases
Performance Optimization
Environment Configuration
API Keys
Python Configuration
Advanced Configuration
Custom Headers
Retry Configuration
Rate Limiting
Model Switching
Dynamic Model Selection
A/B Testing
Troubleshooting
Common Issues
API Key Errors
API Key Errors
Ensure your API keys are correctly set:
Rate Limiting
Rate Limiting
If you hit rate limits, implement backoff:
Model Not Found
Model Not Found
Check that the model name is correct and available in your region:
Best Practices
Model Selection
- Start with GPT-4o-mini for most tasks
- Use GPT-4o for complex reasoning
- Try Claude for creative tasks
- Use Gemini for multimodal or fast responses
- Use Ollama for privacy-sensitive applications
Cost Optimization
- Use appropriate models for task complexity
- Implement caching for repeated queries
- Set reasonable limits on max_tokens
- Monitor usage and costs
- Use streaming for long responses
Performance Tips
- Batch requests when possible
- Use connection pooling for high-volume applications
- Implement retry logic with exponential backoff
- Cache responses for identical inputs
- Monitor latency and optimize accordingly

