LLM--打造Private GPT需要知道的一些概念及术语




  • GGML (GPT-Generated Model Language): Developed by Georgi Gerganov, GGML is a tensor library designed for machine learning, facilitating large models and high performance on various hardware, including Apple Silicon.

  • Pros

    • Early Innovation: GGML represented an early attempt to create a file format for GPT models.

    • Single File Sharing: It enabled sharing models in a single file, enhancing convenience.

    • CPU Compatibility: GGML models could run on CPUs, broadening accessibility.

  • Cons

    • Limited Flexibility: GGML struggled with adding extra information about the model.
    • Compatibility Issues: Introduction of new features often led to compatibility problems with older models.
    • Manual Adjustments Required: Users frequently had to modify settings like rope-freq-base, rope-freq-scale, gqa, and rms-norm-eps, which could be complex.


GGUF (GPT-Generated Unified Format), introduced as a successor to GGML (GPT-Generated Model Language), was released on the 21st of August, 2023. This format represents a significant step forward in the field of language model file formats, facilitating enhanced storage and processing of large language models like GPT.

  • Pros

    • Addresses GGML Limitations: GGUF is designed to overcome GGML’s shortcomings and enhance user experience.
    • Extensibility: It allows for the addition of new features while maintaining compatibility with older models.
    • Stability: GGUF focuses on eliminating breaking changes, easing the transition to newer versions.
    • Versatility: Supports various models, extending beyond the scope of llama models.
  • Cons

    • Transition Time: Converting existing models to GGUF may require significant time.
    • Adaptation Required: Users and developers must become accustomed to this new format.



Embedding 嵌入是一种机器学习概念,用于将数据映射到高维空间中,在高维空间中,相似语义的数据被放置在一起
Embedding Model 【嵌入模型】

  • 通常是来自BERT或其他Transformer家族的深度神经网络
  • 可以用一系列称为向量的数字有效地表示文本、图像和其他数据类型的语义。
  • 关键特征是在高维空间中向量之间的数学距离可以表示原始文本或图像的语义相似性


  • Dense embedding:是一种用于自然语言处理的技术,用于将单词或短语表示为高维空间中的连续、稠密向量,捕获语义关系

    • 大多数嵌入模型将信息表示为数百到数千维的浮点向量。
    • 输出: 稠密向量,因为大多数维度具有非零值。
    • 如 流行的开源嵌入模型BAAI/ big -base-en-v1.5输出768个浮点数的向量(768维浮点向量)。
  • Sparse embedding:使用大多数元素为零的向量表示单词或短语,只有一个非零元素表示词汇表中特定单词的存在。它是高效和可解释的,使它们适合于精确的术语匹配相关的任务

    • 通常具有更高的维度(数万或更多),这取决于token vocabulary【词汇表】的大小
    • 输出: 稀疏向量,因为大多数维度的值是0。
    • 通过如下两种方式生成:
      • 通过深度神经网络生成
      • 通过对文本语料库的统计分析生成
    • 由于其可解释性和更好的域外泛化能力,稀疏嵌入越来越多地被开发人员采用,作为密集嵌入的补充
  • 常用的Embedding函数

    Embedding Function Type API or Open-sourced
    openai Dense API
    sentence-transformer Dense Open-sourced
    bm25 Sparse Open-sourced
    Splade Sparse Open-sourced
    bge-m3 Hybrid Open-sourced



LlamaIndex is the leading data framework for building LLM applications



LlamaCPP: Inference of Meta’s LLaMA model (and others) in pure C/C++,是一个基于Meta公司的LLaMA模型的纯C/C++版本的推理框架。它主要用于模型推理

主要支持的是Meta公司的LLaMA系列模型,如LLaMA 2、Code Llama、Falcon、Baichuan等。这些模型都是基于LLaMA架构的,并且经过特定的格式转换(如转换为gguf格式)后,才能在LlamaCPP中进行推理。


The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.


PoetryPython packaging and dependency management tool



ASGI(Asynchronous Server Gateway Interface)是一种 Python 异步 Web 服务器和应用程序之间通信的接口标准。与传统的 WSGI(Web Server Gateway Interface)相比,ASGI 更适用于高并发和实时性要求高的应用程序,例如聊天应用、实时通知、在线游戏等。

  • Django ASGI 是 Django 框架的 ASGI 版本,它允许 Django 应用程序以异步方式处理请求和响应。
  • uvicorn: an ASGI web server implementation for Python.
  • Hypercorn is an ASGI web server based on the sans-io hyper, h11, h2, and wsproto libraries and inspired by Gunicorn. Hypercorn supports HTTP/1, HTTP/2, WebSockets (over HTTP/1 and HTTP/2), ASGI/2, and ASGI/3 specifications. Hypercorn can utilise asyncio, uvloop, or trio worker types.


FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.8+ based on standard Python type hints.


Chroma is the open-source embedding database. Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs.



  • 存储文档数据和它们的元数据:store embeddings and their metadata
  • 嵌入和查询:embed documents and queries
  • 搜索: search embeddings


  • 足够简单并且提升开发者效率:simplicity and developer productivity
  • 搜索之上再分析:analysis on top of search
  • 追求快(性能): it also happens to be very quick





Gradio is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it, anywhere!




在2024年1月底OpenAI发布新的向量模型,并提到新的向量模型支持将向量维度缩短。向量模型支持缩短维度而又不会威胁到向量的表示能力的原因在于使用了Matryoshka Representation Learning


Matryoshka Representation Learning (MRL)是2022年发表的论文,由于OpenAI的使用得到了很多关注,论文共同一作甚至写了一篇博客来解释MRL的原理。开源文本向量nomic-embed-text-v1.5 也应用了MRL支持使用时调整向量维度

  1. Matryoshka Representation Learning
  2. 论文作者写的blog
  3. 知乎文章HN讨论
  4. Nomic 开源的contrastors 实现了对比学习版的MRL


