63 questions
0
votes
0
answers
43
views
Open-WebUI Not Detecting Model from lama.cpp
I have Open-WebUI running in Docker and connected it to the lama.cpp server API.
I followed the instructions from this URL:
https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/#step-1-...
0
votes
0
answers
73
views
Missing tokenizer.json file when converting model to gguf file format [closed]
When converting mistralai/Mistral-Small-3.2-24B-Instruct-2506 to GGUF (via llama_cpp), I get an error saying the tokenizer.json file is missing. After re-examining the HF repo, there in fact, is not a ...
0
votes
0
answers
119
views
problem with llama_backend_init in flutter with llama.cpp
I'm trying to make an app with flutter dart for android.
the app that I will made is that a chatbot (like chat gpt) using built in model (the ai model will have .gguf extension). So the ai model will ...
2
votes
0
answers
304
views
error when trying to include llama.cpp to my C++ application in windows using vcpkg
I am trying to write a simple application that uses llama.cpp and I am including it to my application using cmake and vcpkg.
My CMakeList.txt is:
cmake_minimum_required(VERSION 4.0)
if(WIN32)
File(...
0
votes
1
answer
473
views
can't import llama-cpp-python
I plan to install llama-cpp-python. However, I get error about "Could not find module 'D:\anaconda\Lib\site-packages\llama_cpp\lib\llama.dll' (or one of its dependencies). "
I have Microsoft ...
0
votes
1
answer
1k
views
llama.cpp server and curl requests for multimodal models
I have llama-server up and running on a VPS with Ubuntu 24.04. I can send curl requests from an external IP and get answers for text embedding for instance. Now I want to use multimodal models through ...
3
votes
0
answers
214
views
Cannot interence with images on llama-cpp-python
I am new to this. I have been trying but could not make the the model answer on images.
from llama_cpp import Llama
import torch
from PIL import Image
import base64
llm = Llama(
model_path='Holo1-...
1
vote
1
answer
1k
views
Unsloth doesn't find Llama.cpp to convert fine-tuned LLM to GGUF
I am executing on an Azure VM this notebook from the Unsloth docs:
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb
Where in the end they save the ...
0
votes
0
answers
345
views
Very slow inference using LLAVA with LLama.cpp vs LM Studio
I'm looking for help understanding how to correctly query LLaVA models via llama.cpp.
(Note: I had to remove direct links to not have the post auto deleted)
Here’s what I’ve done so far:
I downloaded ...
0
votes
1
answer
3k
views
How to write chat template for llama.cpp? [closed]
I am trying to run the llama-cli tool in llama.cpp. However, I am encountering problems when talking to my model codellama-7b-instruct.Q5_K_M.gguf So I decided to use the conversation template. ...
-2
votes
1
answer
975
views
How to create prompt with /chat endpoint for llama.cpp? [closed]
I just installed raw llama.cpp to run codellama-7b-instruct.Q5_K_M.gguf. I started it on llama's server but unfortunatly it is responding with really weird answers, which looks like it is trying to ...
0
votes
0
answers
75
views
sub-4 bit quantized model on nvidia gpu
I was trying to run deepseek-r1-distill-llama70b-bf16.gguf (131gb on disk) on two A6000 gpus (48gb vram each) with llama.cpp. It runs with partial gpu offload but the gpu utilization is at 9-10% and ...
-1
votes
2
answers
604
views
while pip install llama-cpp-python getting error on windows pc
Creating directory "llava_shared.dir\Release".
Structured output is enabled. The formatting of compiler diagnostics will reflect the error hierarchy. See https://aka.ms/cpp/structured-output ...
0
votes
0
answers
99
views
Generating an n-gram dataset based on an LLM
I want a dataset of common n-grams and their log likelihoods. Normally I would download the Google Books Ngram Exports, but I wonder if I can generate a better dataset using a large language model. ...
0
votes
0
answers
284
views
How do you enable runtime-repack in llama cpp python?
After updating llama-cpp-python I am getting an error when trying to run an ARM optimized GGUF model TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking. After looking into it, the error comes from ...
1
vote
0
answers
300
views
Is llama able to choose tools automatically?
I run the following code expecting llama will decide whether to use the tool or not depending on my prompt:
from llama_cpp import Llama,ChatCompletionNamedToolChoice
llm = Llama(
model_path="/...
0
votes
0
answers
183
views
Unable to set top_k value in Llama cpp Python server
I start llama cpp Python server with the command:
python -m llama_cpp.server --model D:\Mistral-7B-Instruct-v0.3.Q4_K_M.gguf --n_ctx 8192 --chat_format functionary
Then I run my Python script which ...
2
votes
1
answer
985
views
How to use `llama-cpp-python` to output list of candidate tokens and their probabilities?
I want to manually choose my tokens by myself, instead of letting llama-cpp-python automatically choose one for me.
This requires me to see a list of candidate next tokens, along their probabilities, ...
2
votes
1
answer
4k
views
How to quantize a HF safetensors model and save it to llama.cpp GGUF format with less than q8_0 quantization?
I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/...
3
votes
1
answer
4k
views
How to use llm models downloaded with ollama with llama.cpp?
I'm considering switching from Ollama to llama.cpp, but I have a question before making the move. I've already downloaded several LLM models using Ollama, and I'm working with a low-speed internet ...
0
votes
1
answer
604
views
Does langchain with llama-cpp-python fail to work with very long prompts?
I'm trying to create a service using the llama3-70b model by combining langchain and llama-cpp-python on a server workstation. While the model works well with short prompts(question1, question2), it ...
1
vote
2
answers
763
views
Unable to make llama.cpp on M1 Mac
When I try installing Llam.cpp, I get the following error:
ld: warning: ignoring file '/Users/krishparikh/Projects/LLM/llama.cpp/ggml/src/ggml-metal-embed.o': found architecture 'x86_64', required ...
0
votes
0
answers
194
views
Loading int8 version of llama3 from llama.cpp
I'm trying to load an 8 bit quantized version of llama3 on my local laptop (linux) from llama.cpp, but the process is getting killed due to memory exceeding.
Is there any way around this?
I've already ...
0
votes
1
answer
523
views
Long response time with llama-server (40–60sec)
I managed to run the Llama server with the following command:
./llama-server -m models/7B/ggml-model.gguf -c 2048
My request looks like this:
time curl --request POST --url http://localhost:8080/...
0
votes
1
answer
297
views
Prompt Template for Sequence matching using LlamaIndex and Llama3-70B-Instruct
I'm trying to get llama3-70b to find all sequences that match a given list. The list contains multiple terms (which range from one word to twelve words). I want the model to match all terms in a given ...
1
vote
1
answer
514
views
Why is LlamaCPP freezing during inference?
I'm using the following code to try and recieve a response from LlamaCPP, used through the LlamaIndex library. My model is stored locally in a gguf file. I'm trying to do inference on the CPU as my ...
0
votes
0
answers
184
views
How to deploy a finetuned model on a private server?
I have a project where I need to fine-tune a Large Language Model (LLM) such as LLAMA3 for a specific task and then deploy it on the company's server as a chatbot to recommend 'questionnaires / ...
0
votes
1
answer
1k
views
Pass raw prompt in Llama 3 prompting format ot LLama.cpp webserver
I have an application running a local llama.cpp instance with llama 3. LLama 3 uses a new prompting format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a ...
0
votes
1
answer
651
views
How to get the response from the AI Model
I adapted this code from https://www.datacamp.com/tutorial/llama-cpp-tutorial
from llama_cpp import Llama
# GLOBAL VARIABLES
my_model_path = "./model/zephyr-7b-beta.Q4_0.gguf"
CONTEXT_SIZE = ...
0
votes
1
answer
661
views
Unable for sending multiple input using Llama CPP and Llama-index
I am using Mistral 77b-instruct model with llama-index and load the model using llamacpp, and when I am trying to run multiple inputs or prompts ( open 2 website and send 2 prompts) , and it give me ...
0
votes
1
answer
272
views
Is there a way to automate formatting terminal input when talking to an LLM model?
Currently when the user sends input to the LLM it has to be in the format of:
<|im_start|>user\
User input here.<|im_end|>
This is very cumbersome and makes prompt engineering very ...
2
votes
2
answers
4k
views
Detecting GPU availability in llama-cpp-python
Question
How can I programmatically check if llama-cpp-python is installed with support for a CUDA-capable GPU?
Context
In my program, I am trying to warn the developers when they fail to configure ...
1
vote
0
answers
872
views
How can I set just give main answer from llama-3-8B-Instruct and not talk to itself?
I want to use llama-3 with llama-cpp-python and get main answer for user questions like llama-2.
But answers generated by llama-3 not main answer like llama-2:
Output: Hey! 👋 What can I help you ...
3
votes
1
answer
4k
views
Llama.cpp GPU Offloading Issue - Unexpected Switch to CPU
I'm reaching out to the community for some assistance with an issue I'm encountering in llama.cpp. Previously, the program was successfully utilizing the GPU for execution. However, recently, it seems ...
2
votes
1
answer
4k
views
Running Local LLMs in Production and handling multiple requests
I am trying to run a RAG with Gemma LLM locally it is running fine but the idea is I can't handle more than one request at a time.
Is there a way to handle concurrent requests with utilizing resources ...
1
vote
1
answer
3k
views
Is there any method to fully load the GGUF models on GPU
I have been using LlamaCPP to load my llm models, the llama-index library provides methods to offload some layers onto the GPU. Why does it not provide any methods to fully load the model on GPU. If ...
0
votes
0
answers
241
views
Kernel dying repeatedly while initializing llm in local
Hi I'm using Jupyter Notebook, and trying to create instance of llama-2-7b-chat.q4_K_M.gguf (this is a quantized model) from hugging face. I'm running the following code:
from langchain_community.llms ...
1
vote
0
answers
3k
views
I have problem using n_gpu_layers in llama_cpp Llama function
I am attempting to load the Zephyr model into llama_cpp Llama, and while everything functions correctly, the performance is slow. The GPU appears to be underutilized, especially when compared to its ...
1
vote
3
answers
2k
views
Inconsistent completion for identical prompts and params with llama.cpp python and ctransformer
I've been comparing various langchain compatible llama2 runtimes, using langchain llm chain.
Having the following parameter overrides:
# llama.cpp:
model_path="../llama.cpp/models/generated/...
0
votes
0
answers
1k
views
Gradio UI is not displaying/working properly, how to fix that?
I did follow these instructions to install privateGPT:
git clone https://github.com/imartinez/privateGPT.git
conda create -n privategpt python=3.11
conda activate privategpt
#loading modules
...
0
votes
1
answer
6k
views
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based project
While installing llama-cpp-python on VS Code i am getting an error
4 Warning(s)
257 Error(s)
Time Elapsed 00:00:02.56
*** CMake build failed
[end of output]
...
1
vote
1
answer
2k
views
llama-cpp-python Log printing on Ubuntu
I use llama-cpp-python to run LLMs locally on Ubuntu. While generating responses it prints its logs.
How to stop printing of logs??
I found a way to stop log printing for llama.cpp but not for llama-...
1
vote
1
answer
6k
views
Streaming local LLM with FastAPI, Llama.cpp and Langchain
I have setup FastAPI with Llama.cpp and Langchain. Now I want to enable streaming in the FastAPI responses. Streaming works with Llama.cpp in my terminal, but I wasn't able to implement it with a ...
1
vote
0
answers
568
views
llama.cpp conversion of finetuned HF ( huggingface ) fails for LLaMA2 - 7B model
i use the simple https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py with some custom data and llama-2-7b-hf as the base model. Post training , it invokes trainer.save_model and the ...
0
votes
2
answers
2k
views
CMAKE in requirements.txt file: Install llama-cpp-python for Mac
I have put my application into a Docker and therefore I have created a requirements.txt file. Now I need to install llama-cpp-python for Mac, as I am loading my LLM with from langchain.llms import ...
1
vote
0
answers
1k
views
llama-cpp-python on GPU: Delay between prompt submission and first token generation with longer prompts
I've been building a RAG pipeline using the llama-cpp-python OpenAI compatible server functionality and have been working my way up from running on just a laptop to running this on a dedicated ...
0
votes
2
answers
1k
views
Persist VectorStoreIndex (LlamaIndex) locally
I am trying to run this
import logging
import sys
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
import torch
from llama_index.llms import LlamaCPP
from llama_index....
0
votes
1
answer
491
views
langchain with llama2 local slow inference
I am using Langchain with llama-2-13B. I have set up the llama2 on an AWS machine with 240GB RAM and 4x16GB Tesla V100 GPUs. It takes around 20s to make an inference. I want to make it faster, ...
4
votes
1
answer
4k
views
No GPU support while running llama-cpp-python inside a docker container
I'm trying to run llama index with llama cpp by following the installation docs but inside a docker container.
Following this repo for installation of llama_cpp_python==0.2.6.
DOCKERFILE
# Use the ...
2
votes
0
answers
1k
views
llama-index: multiple calls to query_engine.query always gives "Empty Response"
I have the following code that works as expected
model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"
llm = LlamaCPP(model_url=...