Quickstart
Introduction to Flora
Flora is an enterprise-grade RAG platform that keeps ingestion, embeddings, retrieval, and generation inside your infrastructure boundary. This documentation helps developers and platform teams deploy, configure, and operate the full stack with confidence.
Get running in minutes
The default deployment ships as a composable stack with clear service boundaries. Start with Docker Compose for local validation, then promote to Kubernetes with Helm when you are ready for production controls.
Architecture overview
A typical Flora environment is built around independent services that can scale separately by workload profile.
- Doctor handles ingestion and normalized extraction from enterprise document sources.
- TEI generates embeddings locally and feeds vectors into Qdrant collections.
- Qdrant enforces retrieval-time constraints with metadata and role-aware filters.
- vLLM serves low-latency generation with PagedAttention for efficient GPU usage.
Installation
Use one of the following commands to bring up Flora in a controlled environment. Replace image tags and values files according to your release policy.
docker-compose up -d --build flora-stack
helm upgrade --install flora ./infra/charts/flora \
--namespace flora-system --create-namespaceIngestion Pipeline
Doctor connectors can be configured for internal knowledge systems so parsing and normalization happen entirely on your network perimeter.
Supported formats
Out of the box, Flora supports PDF, DOCX, TXT, and structured metadata attachments for traceable ingestion.
Embeddings
Deploy TEI close to your API layer to reduce embedding latency and keep all vectorization traffic internal.
Model selection
Choose multilingual or domain-specific models based on recall targets, memory budget, and throughput requirements.
Vector Storage
Define shard and replication strategy in Qdrant according to collection size, SLA, and fault tolerance targets.
RBAC filtering
Attach role metadata at ingestion time and apply payload filters during retrieval so unauthorized chunks never reach generation.
Inference
Tune max model length, batch size, and request concurrency to stabilize latency under peak load.
PagedAttention
PagedAttention optimizes memory paging for KV cache management, allowing high-throughput serving with predictable GPU utilization.
API Reference
The Flora API surface includes ingestion, search, answer generation, and admin endpoints. Use this section to integrate services and automate platform workflows.