Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
43 views

I have Open-WebUI running in Docker and connected it to the lama.cpp server API. I followed the instructions from this URL: https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/#step-1-...
Namasivayam Chinnapillai's user avatar
0 votes
0 answers
73 views

When converting mistralai/Mistral-Small-3.2-24B-Instruct-2506 to GGUF (via llama_cpp), I get an error saying the tokenizer.json file is missing. After re-examining the HF repo, there in fact, is not a ...
s3dev's user avatar
  • 9,871
0 votes
0 answers
119 views

I'm trying to make an app with flutter dart for android. the app that I will made is that a chatbot (like chat gpt) using built in model (the ai model will have .gguf extension). So the ai model will ...
Fajar Alam's user avatar
2 votes
0 answers
304 views

I am trying to write a simple application that uses llama.cpp and I am including it to my application using cmake and vcpkg. My CMakeList.txt is: cmake_minimum_required(VERSION 4.0) if(WIN32) File(...
mans's user avatar
  • 18.4k
0 votes
1 answer
473 views

I plan to install llama-cpp-python. However, I get error about "Could not find module 'D:\anaconda\Lib\site-packages\llama_cpp\lib\llama.dll' (or one of its dependencies). " I have Microsoft ...
PuiHean's user avatar
0 votes
1 answer
1k views

I have llama-server up and running on a VPS with Ubuntu 24.04. I can send curl requests from an external IP and get answers for text embedding for instance. Now I want to use multimodal models through ...
user3102556's user avatar
3 votes
0 answers
214 views

I am new to this. I have been trying but could not make the the model answer on images. from llama_cpp import Llama import torch from PIL import Image import base64 llm = Llama( model_path='Holo1-...
Abhash Rai's user avatar
1 vote
1 answer
1k views

I am executing on an Azure VM this notebook from the Unsloth docs: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb Where in the end they save the ...
rikyeah's user avatar
  • 2,118
0 votes
0 answers
345 views

I'm looking for help understanding how to correctly query LLaVA models via llama.cpp. (Note: I had to remove direct links to not have the post auto deleted) Here’s what I’ve done so far: I downloaded ...
Sandor Mezil's user avatar
0 votes
1 answer
3k views

I am trying to run the llama-cli tool in llama.cpp. However, I am encountering problems when talking to my model codellama-7b-instruct.Q5_K_M.gguf So I decided to use the conversation template. ...
Dominik Szkotland's user avatar
-2 votes
1 answer
975 views

I just installed raw llama.cpp to run codellama-7b-instruct.Q5_K_M.gguf. I started it on llama's server but unfortunatly it is responding with really weird answers, which looks like it is trying to ...
Dominik Szkotland's user avatar
0 votes
0 answers
75 views

I was trying to run deepseek-r1-distill-llama70b-bf16.gguf (131gb on disk) on two A6000 gpus (48gb vram each) with llama.cpp. It runs with partial gpu offload but the gpu utilization is at 9-10% and ...
afsara_ben's user avatar
-1 votes
2 answers
604 views

Creating directory "llava_shared.dir\Release". Structured output is enabled. The formatting of compiler diagnostics will reflect the error hierarchy. See https://aka.ms/cpp/structured-output ...
sandeep's user avatar
  • 161
0 votes
0 answers
99 views

I want a dataset of common n-grams and their log likelihoods. Normally I would download the Google Books Ngram Exports, but I wonder if I can generate a better dataset using a large language model. ...
evashort's user avatar
0 votes
0 answers
284 views

After updating llama-cpp-python I am getting an error when trying to run an ARM optimized GGUF model TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking. After looking into it, the error comes from ...
ekcrisp's user avatar
  • 1,931
1 vote
0 answers
300 views

I run the following code expecting llama will decide whether to use the tool or not depending on my prompt: from llama_cpp import Llama,ChatCompletionNamedToolChoice llm = Llama( model_path="/...
VanechikSpace's user avatar
0 votes
0 answers
183 views

I start llama cpp Python server with the command: python -m llama_cpp.server --model D:\Mistral-7B-Instruct-v0.3.Q4_K_M.gguf --n_ctx 8192 --chat_format functionary Then I run my Python script which ...
Jengi829's user avatar
2 votes
1 answer
985 views

I want to manually choose my tokens by myself, instead of letting llama-cpp-python automatically choose one for me. This requires me to see a list of candidate next tokens, along their probabilities, ...
caveman's user avatar
  • 464
2 votes
1 answer
4k views

I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/...
arkuzo's user avatar
  • 41
3 votes
1 answer
4k views

I'm considering switching from Ollama to llama.cpp, but I have a question before making the move. I've already downloaded several LLM models using Ollama, and I'm working with a low-speed internet ...
Monet Geoffrey's user avatar
0 votes
1 answer
604 views

I'm trying to create a service using the llama3-70b model by combining langchain and llama-cpp-python on a server workstation. While the model works well with short prompts(question1, question2), it ...
bibiibibin's user avatar
1 vote
2 answers
763 views

When I try installing Llam.cpp, I get the following error: ld: warning: ignoring file '/Users/krishparikh/Projects/LLM/llama.cpp/ggml/src/ggml-metal-embed.o': found architecture 'x86_64', required ...
Krish Parikh's user avatar
0 votes
0 answers
194 views

I'm trying to load an 8 bit quantized version of llama3 on my local laptop (linux) from llama.cpp, but the process is getting killed due to memory exceeding. Is there any way around this? I've already ...
Anagha's user avatar
  • 1
0 votes
1 answer
523 views

I managed to run the Llama server with the following command: ./llama-server -m models/7B/ggml-model.gguf -c 2048 My request looks like this: time curl --request POST --url http://localhost:8080/...
didinko's user avatar
  • 572
0 votes
1 answer
297 views

I'm trying to get llama3-70b to find all sequences that match a given list. The list contains multiple terms (which range from one word to twelve words). I want the model to match all terms in a given ...
joshpopelka20's user avatar
1 vote
1 answer
514 views

I'm using the following code to try and recieve a response from LlamaCPP, used through the LlamaIndex library. My model is stored locally in a gguf file. I'm trying to do inference on the CPU as my ...
Calder Johnson's user avatar
0 votes
0 answers
184 views

I have a project where I need to fine-tune a Large Language Model (LLM) such as LLAMA3 for a specific task and then deploy it on the company's server as a chatbot to recommend 'questionnaires / ...
Akram H's user avatar
  • 71
0 votes
1 answer
1k views

I have an application running a local llama.cpp instance with llama 3. LLama 3 uses a new prompting format: <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a ...
user2741831's user avatar
  • 2,482
0 votes
1 answer
651 views

I adapted this code from https://www.datacamp.com/tutorial/llama-cpp-tutorial from llama_cpp import Llama # GLOBAL VARIABLES my_model_path = "./model/zephyr-7b-beta.Q4_0.gguf" CONTEXT_SIZE = ...
nicomp's user avatar
  • 4,741
0 votes
1 answer
661 views

I am using Mistral 77b-instruct model with llama-index and load the model using llamacpp, and when I am trying to run multiple inputs or prompts ( open 2 website and send 2 prompts) , and it give me ...
HelloALive's user avatar
0 votes
1 answer
272 views

Currently when the user sends input to the LLM it has to be in the format of: <|im_start|>user\ User input here.<|im_end|> This is very cumbersome and makes prompt engineering very ...
Fred Åberg's user avatar
2 votes
2 answers
4k views

Question How can I programmatically check if llama-cpp-python is installed with support for a CUDA-capable GPU? Context In my program, I am trying to warn the developers when they fail to configure ...
Programmer.zip's user avatar
1 vote
0 answers
872 views

I want to use llama-3 with llama-cpp-python and get main answer for user questions like llama-2. But answers generated by llama-3 not main answer like llama-2: Output: Hey! 👋 What can I help you ...
Dalipboy M's user avatar
3 votes
1 answer
4k views

I'm reaching out to the community for some assistance with an issue I'm encountering in llama.cpp. Previously, the program was successfully utilizing the GPU for execution. However, recently, it seems ...
Montassar Jaziri's user avatar
2 votes
1 answer
4k views

I am trying to run a RAG with Gemma LLM locally it is running fine but the idea is I can't handle more than one request at a time. Is there a way to handle concurrent requests with utilizing resources ...
khalidwalamri's user avatar
1 vote
1 answer
3k views

I have been using LlamaCPP to load my llm models, the llama-index library provides methods to offload some layers onto the GPU. Why does it not provide any methods to fully load the model on GPU. If ...
Shighra Sahil's user avatar
0 votes
0 answers
241 views

Hi I'm using Jupyter Notebook, and trying to create instance of llama-2-7b-chat.q4_K_M.gguf (this is a quantized model) from hugging face. I'm running the following code: from langchain_community.llms ...
Avish Wagde's user avatar
1 vote
0 answers
3k views

I am attempting to load the Zephyr model into llama_cpp Llama, and while everything functions correctly, the performance is slow. The GPU appears to be underutilized, especially when compared to its ...
reach's user avatar
  • 21
1 vote
3 answers
2k views

I've been comparing various langchain compatible llama2 runtimes, using langchain llm chain. Having the following parameter overrides: # llama.cpp: model_path="../llama.cpp/models/generated/...
JayabalanAaron's user avatar
0 votes
0 answers
1k views

I did follow these instructions to install privateGPT: git clone https://github.com/imartinez/privateGPT.git conda create -n privategpt python=3.11 conda activate privategpt #loading modules ...
anikaM's user avatar
  • 429
0 votes
1 answer
6k views

While installing llama-cpp-python on VS Code i am getting an error 4 Warning(s) 257 Error(s) Time Elapsed 00:00:02.56 *** CMake build failed [end of output] ...
Kaustubh Ratna's user avatar
1 vote
1 answer
2k views

I use llama-cpp-python to run LLMs locally on Ubuntu. While generating responses it prints its logs. How to stop printing of logs?? I found a way to stop log printing for llama.cpp but not for llama-...
San Vik's user avatar
  • 11
1 vote
1 answer
6k views

I have setup FastAPI with Llama.cpp and Langchain. Now I want to enable streaming in the FastAPI responses. Streaming works with Llama.cpp in my terminal, but I wasn't able to implement it with a ...
Maxl Gemeinderat's user avatar
1 vote
0 answers
568 views

i use the simple https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py with some custom data and llama-2-7b-hf as the base model. Post training , it invokes trainer.save_model and the ...
Vikram Murthy's user avatar
0 votes
2 answers
2k views

I have put my application into a Docker and therefore I have created a requirements.txt file. Now I need to install llama-cpp-python for Mac, as I am loading my LLM with from langchain.llms import ...
Maxl Gemeinderat's user avatar
1 vote
0 answers
1k views

I've been building a RAG pipeline using the llama-cpp-python OpenAI compatible server functionality and have been working my way up from running on just a laptop to running this on a dedicated ...
jhthompson12's user avatar
0 votes
2 answers
1k views

I am trying to run this import logging import sys from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext import torch from llama_index.llms import LlamaCPP from llama_index....
6core's user avatar
  • 1
0 votes
1 answer
491 views

I am using Langchain with llama-2-13B. I have set up the llama2 on an AWS machine with 240GB RAM and 4x16GB Tesla V100 GPUs. It takes around 20s to make an inference. I want to make it faster, ...
Muhammad Muneeb Ur Rahman's user avatar
4 votes
1 answer
4k views

I'm trying to run llama index with llama cpp by following the installation docs but inside a docker container. Following this repo for installation of llama_cpp_python==0.2.6. DOCKERFILE # Use the ...
Pratyush's user avatar
2 votes
0 answers
1k views

I have the following code that works as expected model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf" llm = LlamaCPP(model_url=...
Jamie Dixon's user avatar
  • 4,302