Qdrant 教程:文本相似度搜索

Qdrant 教程:文本相似度搜索

什么是Qdrant?

Qdrant 是一个用 Rust 构建的高性能搜索引擎和数据库,专为向量相似性而设计。即使在高负载下,它也能提供快速可靠的性能,使其成为需要速度和可扩展性的应用程序的理想选择。Qdrant 可以将您的嵌入或神经网络编码器转变为适用于各种用例的强大应用程序,例如匹配、搜索、推荐或对大型数据集执行其他复杂操作。凭借其扩展的过滤支持,它非常​​适合分面搜索和基于语义的匹配。用户友好的 API 简化了使用 Qdrant 的过程。Qdrant Cloud 提供了一个托管解决方案,只需最少的设置和维护,可以轻松部署和管理应用程序。

有关 Qdrant 的更多信息,请查看我们专门的 AI 技术页面。

我们会怎样做?

Qdrant 教程:文本相似度搜索

在本教程中,我们将利用 Qdrant 向量数据库存储来自 Cohere 模型的嵌入,并使用余弦相似度进行搜索。我们将使用 Cohere SDK 访问模型。所以,事不宜迟,让我们开始吧!

先决条件

我将使用 Qdrant Cloud 来托管我的数据库。值得一提的是,Qdrant 提供 1 GB 的永久免费内存。所以去使用 Qdrant Cloud。您可以在此处了解操作方法。

现在让我们在项目目录中创建一个新的虚拟环境并安装所需的包:

  1. python3 -m venv venvsource venv/bin/activatepip install cohere qdrant-client python-dotenv

请创建一个项目.py文件。

数据

Qdrant 教程:文本相似度搜索

我们将以 JSON 格式存储数据。随意复制它:

  1. [ { "key": "Lion", "desc": "Majestic big cat with golden fur and a loud roar." }, { "key": "Penguin", "desc": "Flightless bird with a tuxedo-like black and white coat." }, { "key": "Gorilla", "desc": "Intelligent primate with muscular build and gentle nature." }, { "key": "Elephant", "desc": "Large mammal with a long trunk and gray skin." }, { "key": "Koala", "desc": "Cute and cuddly marsupial with fluffy ears and a big nose." }, { "key": "Dolphin", "desc": "Playful marine mammal known for its intelligence and acrobatics." }, { "key": "Orangutan", "desc": "Shaggy-haired great ape found in the rainforests of Borneo and Sumatra." }, { "key": "Giraffe", "desc": "Tallest land animal with a long neck and spots on its fur." }, { "key": "Hippopotamus", "desc": "Large, semi-aquatic mammal with a wide mouth and stubby legs." }, { "key": "Kangaroo", "desc": "Marsupial with powerful hind legs and a long tail for balance." }, { "key": "Crocodile", "desc": "Large reptile with sharp teeth and a tough, scaly hide." }, { "key": "Chimpanzee", "desc": "Closest relative to humans, known for its intelligence and tool use." }, { "key": "Tiger", "desc": "Striped big cat with incredible speed and agility." }, { "key": "Zebra", "desc": "Striped mammal with a distinctive mane and tail." }, { "key": "Ostrich", "desc": "Flightless bird with long legs and a big, fluffy tail." }, { "key": "Rhino", "desc": "Large, thick-skinned mammal with a horn on its nose." }, { "key": "Cheetah", "desc": "Fastest land animal with a spotted coat and sleek build." }, { "key": "Polar Bear", "desc": "Arctic bear with a thick white coat and webbed paws for swimming." }, { "key": "Peacock", "desc": "Colorful bird with a vibrant tail of feathers." }, { "key": "Kangaroo", "desc": "Marsupial with powerful hind legs and a long tail for balance." }, { "key": "Octopus", "desc": "Intelligent sea creature with eight tentacles and the ability to change color." }, { "key": "Whale", "desc": "Enormous marine mammal with a blowhole on top of its head." }, { "key": "Sloth", "desc": "Slow-moving mammal found in the rainforests of South America." }, { "key": "Flamingo", "desc": "Tall, pink bird with long legs and a curved beak." }]

环境变量

创建.env文件并在其中存储您的 Cohere API 密钥、Qdrant API 密钥和 Qdrant 主机:

  1. QDRANT_API_KEY=<qdrant-api-keu>QDRANT_HOST=<qdrant-host>COHERE_API_KEY=<cohere-api-key>

导入库

  1. import jsonimport osimport uuidfrom typing import Dict, Listimport coherefrom dotenv import load_dotenvfrom qdrant_client import QdrantClientfrom qdrant_client.http import models

加载环境变量

  1. load_dotenv()QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")QDRANT_HOST = os.getenv("QDRANT_HOST")COHERE_API_KEY = os.getenv("COHERE_API_KEY")COHERE_SIZE_VECTOR = 4096 # Larger modelif not QDRANT_API_KEY: raise ValueError("QDRANT_API_KEY is not set")if not QDRANT_HOST: raise ValueError("QDRANT_HOST is not set")if not COHERE_API_KEY: raise ValueError("COHERE_API_KEY is not set")

如何索引数据并在以后进行搜索?

我将实现该类SearchClient,它将能够索引和访问我们的数据。类将包含所有必要的功能,例如索引和搜索,以及将数据转换为必要的格式。

  1. class SearchClient: def __init__( self, qdrabt_api_key: str = QDRANT_API_KEY, qdrant_host: str = QDRANT_HOST, cohere_api_key: str = COHERE_API_KEY, collection_name: str = "animal", ): self.qdrant_client = QdrantClient(host=qdrant_host, api_key=qdrabt_api_key) self.collection_name = collection_name self.qdrant_client.recreate_collection( collection_name=self.collection_name, vectors_config=models.VectorParams( size=COHERE_SIZE_VECTOR, distance=models.Distance.COSINE ), ) self.co_client = cohere.Client(api_key=cohere_api_key) # Qdrant requires data in float format def _float_vector(self, vector: List[float]): return list(map(float, vector)) # Embedding using Cohere Embed model def _embed(self, text: str): return self.co_client.embed(texts=[text]).embeddings[0] # Prepare Qdrant Points def _qdrant_format(self, data: List[Dict[str, str]]): points = [ models.PointStruct( id=uuid.uuid4().hex, payload={"key": point["key"], "desc": point["desc"]}, vector=self._float_vector(self._embed(point["desc"])), ) for point in data ] return points # Index data def index(self, data: List[Dict[str, str]]): """ data: list of dict with keys: "key" and "desc" """ points = self._qdrant_format(data) result = self.qdrant_client.upsert( collection_name=self.collection_name, points=points ) return result # Search using text query def search(self, query_text: str, limit: int = 3): query_vector = self._embed(query_text) return self.qdrant_client.search( collection_name=self.collection_name, query_vector=self._float_vector(query_vector), limit=limit, )

让我们使用我们的代码!

让我们尝试从文件中读取数据data.json,对其进行处理和索引。然后我们可以尝试从我们的数据库中搜索并获得前 3 个结果!

  1. if __name__ == "__main__": client = SearchClient() # import data from data.json file with open("data.json", "r") as f: data = json.load(f) index_result = client.index(data) print(index_result) print("====") search_result = client.search( "Tallest animal in the world, quite long neck.", ) print(search_result)

结果!

  1. operation_id=0 status=<UpdateStatus.COMPLETED: 'completed'>===[ScoredPoint(id='d17eb61c-8764-4bdb-bb26-ac66c3ffa220', version=0, score=0.8677041, payload={'desc': 'Tallest land animal with a long neck and spots on its fur.', 'key': 'Giraffe'}, vector=None), ScoredPoint(id='4934a842-8c55-42bc-938f-a839be2505de', version=0, score=0.71296465, payload={'desc': 'Large, semi-aquatic mammal with a wide mouth and stubby legs.', 'key': 'Hippopotamus'}, vector=None), ScoredPoint(id='05d7e73c-a8bf-44f9-a8b4-af82e06719d0', version=0, score=0.69240415, payload={'desc': 'Large, thick-skinned mammal with a horn on its nose.', 'key': 'Rhino'}, vector=None)]

正如您在第一行中看到的那样:索引操作进行得很顺利。正如我们所定义的,我们得到了 3 个结果。第一个是(正如预期的那样)一只长颈鹿。我们还有河马和犀牛。它们也很大,但我认为长颈鹿是最高的😆。

我明白了,然后……下一步是什么?

Qdrant 教程:文本相似度搜索

为了练习您的 Qdrant 技能,我建议构建一个 API,使您的应用程序能够索引数据、添加新记录和搜索。我认为您可以为此使用 FastAPI!

如果你想尝试新技能,我建议你在本周末的 Cohere x Qdrant AI 黑客马拉松期间使用它们来构建基于 AI 的应用程序!

加入我们的创新者、创造者和创新者社区,用 AI 塑造未来!并查看我们的不同活动!

谢谢你!– AI未来百科 ; 探索AI的边界与未来! 懂您的AI未来站

声明:本站部分文章来源于网络,如有侵犯您的合法权益,请您即时与我们联系,我们将在第一时间处理。如需转载本站文章,请在转载时标明出处并保留原文链接,否则我们将保留追究法律责任的权利。
AI教程

ChatGPT 教程:如何使用 ChatGPT 创建网站

2023-5-9 20:48:36

AI教程AI资源Download

so-vits-svc云端训练、本地推理视频教程-AI翻唱我是一只小小鸟

2023-5-19 5:07:28

0 条回复 A文章作者 M管理员
欢迎您,新朋友,感谢参与互动!
    暂无讨论,说说你的看法吧