Data Management on Qdrant - Vector Database

Airbyte

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

Airbyte

Airbyte is an open-source data integration platform that helps you replicate your data between different systems. It has a growing list of connectors that can be used to ingest data from multiple sources. Building data pipelines is also crucial for managing the data in Qdrant, and Airbyte is a great tool for this purpose.

Airbyte may take care of the data ingestion from a selected source, while Qdrant will help you to build a search engine on top of it. There are three supported modes of how the data can be ingested into Qdrant:

Apache Airflow

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

Apache Airflow

Apache Airflow is an open-source platform for authoring, scheduling and monitoring data and computing workflows. Airflow uses Python to create workflows that can be easily scheduled and monitored.

Qdrant is available as a provider in Airflow to interface with the database.

Prerequisites

Before configuring Airflow, you need:

A Qdrant instance to connect to. You can set one up in our installation guide.
A running Airflow instance. You can use their Quick Start Guide.

Apache Spark

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

Apache Spark

Spark is a distributed computing framework designed for big data processing and analytics. The Qdrant-Spark connector enables Qdrant to be a storage destination in Spark.

Installation

To integrate the connector into your Spark environment, get the JAR file from one of the sources listed below.

GitHub Releases

The packaged jar file with all the required dependencies can be found here.

Building from Source

To build the jar from source, you need JDK@8 and Maven installed. Once the requirements have been satisfied, run the following command in the project root.

Chonkie

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

Chonkie

Chonkie is a no-nonsense, ultra-light, and lightning-fast chunking library designed for RAG (Retrieval-Augmented Generation) applications.

Chonkie integrates seamlessly with Qdrant through the QdrantHandshake class, allowing you to chunk, embed, and store text data without ever leaving the Chonkie SDK.

Setup

Install Chonkie with Qdrant support:

pip install "chonkie[qdrant]"

Basic Usage

The QdrantHandshake provides a simple interface for storing and searching chunks:

from chonkie import QdrantHandshake, SemanticChunker

# Initialize handshake with custom embedding model
handshake = QdrantHandshake(
 url="http://localhost:6333",
 collection_name="my_documents",
 embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)

# Create and write chunks
chunker = SemanticChunker()
chunks = chunker.chunk("Your text content here...")
handshake.write(chunks)

# Search using natural language
results = handshake.search(query="your search query", limit=5)
for result in results:
 print(f"{result['score']}: {result['text']}")

Qdrant Cloud

handshake = QdrantHandshake(
 url="https://your-cluster.qdrant.io",
 api_key="your-api-key",
 collection_name="my_collection",
 embedding_model="BAAI/bge-small-en-v1.5" # Change to your preferred model
)

Complete RAG Pipeline

Build end-to-end RAG pipelines using Chonkie’s fluent Pipeline API:

CocoIndex

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

CocoIndex

CocoIndex is a high performance ETL framework to transform data for AI, with real-time incremental processing.

Qdrant is available as a native built-in vector database to store and retrieve embeddings.

Install CocoIndex:

pip install -U cocoindex

Install Postgres with Docker Compose:

docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d

CocoIndex is a stateful ETL framework and only processes data that has changed. It uses Postgres as a metadata store to track the state of the data.

Confluent Kafka

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

Built by the original creators of Apache Kafka®, Confluent Cloud is a cloud-native and complete data streaming platform available on AWS, Azure, and Google Cloud. The platform includes a fully managed, elastically scaling Kafka engine, 120+ connectors, serverless Apache Flink®, enterprise-grade security controls, and a robust governance suite.

With our Qdrant-Kafka Sink Connector, Qdrant is part of the Connect with Confluent technology partner program. It brings fully managed data streams directly to organizations from Confluent Cloud, making it easier for organizations to stream any data to Qdrant with a fully managed Apache Kafka service.

DLT

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

DLT(Data Load Tool)

DLT is an open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets.

With the DLT-Qdrant integration, you can now select Qdrant as a DLT destination to load data into.

DLT Enables

Automated maintenance - with schema inference, alerts and short declarative code, maintenance becomes simple.
Run it where Python runs - on Airflow, serverless functions, notebooks. Scales on micro and large infrastructure alike.
User-friendly, declarative interface that removes knowledge obstacles for beginners while empowering senior professionals.

Usage

To get started, install dlt with the qdrant extra.

InfinyOn Fluvio

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

InfinyOn Fluvio is an open-source platform written in Rust for high speed, real-time data processing. It is cloud native, designed to work with any infrastructure type, from bare metal hardware to containerized platforms.

Usage with Qdrant

With the Qdrant Fluvio Connector, you can stream records from Fluvio topics to Qdrant collections, leveraging Fluvio’s delivery guarantees and high-throughput.

Pre-requisites

A Fluvio installation. You can refer to the Fluvio Quickstart for instructions.
Qdrant server to connect to. You can set up a local instance or a free cloud instance at cloud.qdrant.io.

Downloading the connector

Run the following commands after setting up Fluvio.

Redpanda Connect

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

Redpanda Connect is a declarative data-agnostic streaming service designed for efficient, stateless processing steps. It offers transaction-based resiliency with back pressure, ensuring at-least-once delivery when connecting to at-least-once sources with sinks, without the need to persist messages during transit.

Connect pipelines are configured using a YAML file, which organizes components hierarchically. Each section represents a different component type, such as inputs, processors and outputs, and these can have nested child components and dynamic values.

Unstructured

info@qdrant.tech (Andrey Vasnetsov) — Mon, 01 Jan 0001 00:00:00 +0000

Unstructured

Unstructured is a library designed to help preprocess, structure unstructured text documents for downstream machine learning tasks.

Qdrant can be used as an ingestion destination in Unstructured.

Setup

Install Unstructured with the qdrant extra.

pip install "unstructured-ingest[qdrant]"

Usage

Depending on the use case you can prefer the command line or using it within your application.

CLI

unstructured-ingest \
 local \
 --input-path $LOCAL_FILE_INPUT_DIR \
 --chunking-strategy by_title \
 --embedding-provider huggingface \
 --partition-by-api \
 --api-key $UNSTRUCTURED_API_KEY \
 --partition-endpoint $UNSTRUCTURED_API_URL \
 --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
 qdrant-cloud \
 --url $QDRANT_URL \
 --api-key $QDRANT_API_KEY \
 --collection-name $QDRANT_COLLECTION \
 --batch-size 50 \
 --num-processes 1

For a full list of the options the CLI accepts, run unstructured-ingest <upstream connector> qdrant --help