Newest 'llamacpp' Questions

0 votes

0 answers

43 views

Open-WebUI Not Detecting Model from lama.cpp

I have Open-WebUI running in Docker and connected it to the lama.cpp server API. I followed the instructions from this URL: https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/#step-1-...

Namasivayam Chinnapillai

19

asked Dec 3 at 1:39

0 votes

0 answers

73 views

Missing tokenizer.json file when converting model to gguf file format [closed]

When converting mistralai/Mistral-Small-3.2-24B-Instruct-2506 to GGUF (via llama_cpp), I get an error saying the tokenizer.json file is missing. After re-examining the HF repo, there in fact, is not a ...

s3dev

9,871

asked Nov 26 at 20:54

0 votes

0 answers

119 views

problem with llama_backend_init in flutter with llama.cpp

I'm trying to make an app with flutter dart for android. the app that I will made is that a chatbot (like chat gpt) using built in model (the ai model will have .gguf extension). So the ai model will ...

Fajar Alam

159

asked Sep 24 at 14:56

2 votes

0 answers

304 views

error when trying to include llama.cpp to my C++ application in windows using vcpkg

I am trying to write a simple application that uses llama.cpp and I am including it to my application using cmake and vcpkg. My CMakeList.txt is: cmake_minimum_required(VERSION 4.0) if(WIN32) File(...

mans

18.4k

asked Jun 21 at 2:55

0 votes

1 answer

473 views

can't import llama-cpp-python

I plan to install llama-cpp-python. However, I get error about "Could not find module 'D:\anaconda\Lib\site-packages\llama_cpp\lib\llama.dll' (or one of its dependencies). " I have Microsoft ...

PuiHean

1

asked Jun 11 at 9:18

0 votes

1 answer

1k views

llama.cpp server and curl requests for multimodal models

I have llama-server up and running on a VPS with Ubuntu 24.04. I can send curl requests from an external IP and get answers for text embedding for instance. Now I want to use multimodal models through ...

user3102556

89

asked Jun 10 at 19:32

3 votes

0 answers

214 views

Cannot interence with images on llama-cpp-python

I am new to this. I have been trying but could not make the the model answer on images. from llama_cpp import Llama import torch from PIL import Image import base64 llm = Llama( model_path='Holo1-...

Abhash Rai

61

asked Jun 7 at 5:50

1 vote

1 answer

1k views

Unsloth doesn't find Llama.cpp to convert fine-tuned LLM to GGUF

I am executing on an Azure VM this notebook from the Unsloth docs: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb Where in the end they save the ...

rikyeah

2,118

asked May 20 at 17:43

0 votes

0 answers

345 views

Very slow inference using LLAVA with LLama.cpp vs LM Studio

I'm looking for help understanding how to correctly query LLaVA models via llama.cpp. (Note: I had to remove direct links to not have the post auto deleted) Here’s what I’ve done so far: I downloaded ...

Sandor Mezil

405

asked May 20 at 5:08

0 votes

1 answer

3k views

How to write chat template for llama.cpp? [closed]

I am trying to run the llama-cli tool in llama.cpp. However, I am encountering problems when talking to my model codellama-7b-instruct.Q5_K_M.gguf So I decided to use the conversation template. ...

Dominik Szkotland

37

asked May 3 at 17:02

-2 votes

1 answer

975 views

How to create prompt with /chat endpoint for llama.cpp? [closed]

I just installed raw llama.cpp to run codellama-7b-instruct.Q5_K_M.gguf. I started it on llama's server but unfortunatly it is responding with really weird answers, which looks like it is trying to ...

Dominik Szkotland

37

asked Apr 28 at 20:26

0 votes

0 answers

75 views

sub-4 bit quantized model on nvidia gpu

I was trying to run deepseek-r1-distill-llama70b-bf16.gguf (131gb on disk) on two A6000 gpus (48gb vram each) with llama.cpp. It runs with partial gpu offload but the gpu utilization is at 9-10% and ...

afsara_ben

691

asked Mar 28 at 4:40

-1 votes

2 answers

604 views

while pip install llama-cpp-python getting error on windows pc

Creating directory "llava_shared.dir\Release". Structured output is enabled. The formatting of compiler diagnostics will reflect the error hierarchy. See https://aka.ms/cpp/structured-output ...

sandeep

161

asked Feb 14 at 12:48

0 votes

0 answers

99 views

Generating an n-gram dataset based on an LLM

I want a dataset of common n-grams and their log likelihoods. Normally I would download the Google Books Ngram Exports, but I wonder if I can generate a better dataset using a large language model. ...

evashort

1

asked Feb 9 at 14:21

0 votes

0 answers

284 views

How do you enable runtime-repack in llama cpp python?

After updating llama-cpp-python I am getting an error when trying to run an ARM optimized GGUF model TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking. After looking into it, the error comes from ...

ekcrisp

1,931

asked Dec 10, 2024 at 4:19

1 vote

0 answers

300 views

Is llama able to choose tools automatically?

I run the following code expecting llama will decide whether to use the tool or not depending on my prompt: from llama_cpp import Llama,ChatCompletionNamedToolChoice llm = Llama( model_path="/...

VanechikSpace

573

asked Sep 27, 2024 at 10:44

0 votes

0 answers

183 views

Unable to set top_k value in Llama cpp Python server

I start llama cpp Python server with the command: python -m llama_cpp.server --model D:\Mistral-7B-Instruct-v0.3.Q4_K_M.gguf --n_ctx 8192 --chat_format functionary Then I run my Python script which ...

Jengi829

1

asked Aug 29, 2024 at 15:13

2 votes

1 answer

985 views

How to use `llama-cpp-python` to output list of candidate tokens and their probabilities?

I want to manually choose my tokens by myself, instead of letting llama-cpp-python automatically choose one for me. This requires me to see a list of candidate next tokens, along their probabilities, ...

caveman

464

asked Aug 24, 2024 at 18:15

2 votes

1 answer

4k views

How to quantize a HF safetensors model and save it to llama.cpp GGUF format with less than q8_0 quantization?

I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/...

arkuzo

41

asked Aug 7, 2024 at 6:10

3 votes

1 answer

4k views

How to use llm models downloaded with ollama with llama.cpp?

I'm considering switching from Ollama to llama.cpp, but I have a question before making the move. I've already downloaded several LLM models using Ollama, and I'm working with a low-speed internet ...

Monet Geoffrey

31

asked Jul 30, 2024 at 19:33

0 votes

1 answer

604 views

Does langchain with llama-cpp-python fail to work with very long prompts?

I'm trying to create a service using the llama3-70b model by combining langchain and llama-cpp-python on a server workstation. While the model works well with short prompts(question1, question2), it ...

bibiibibin

1

asked Jul 18, 2024 at 15:39

1 vote

2 answers

763 views

Unable to make llama.cpp on M1 Mac

When I try installing Llam.cpp, I get the following error: ld: warning: ignoring file '/Users/krishparikh/Projects/LLM/llama.cpp/ggml/src/ggml-metal-embed.o': found architecture 'x86_64', required ...

Krish Parikh

21

asked Jul 11, 2024 at 20:11

0 votes

0 answers

194 views

Loading int8 version of llama3 from llama.cpp

I'm trying to load an 8 bit quantized version of llama3 on my local laptop (linux) from llama.cpp, but the process is getting killed due to memory exceeding. Is there any way around this? I've already ...

Anagha

1

asked Jun 27, 2024 at 9:03

0 votes

1 answer

523 views

Long response time with llama-server (40–60sec)

I managed to run the Llama server with the following command: ./llama-server -m models/7B/ggml-model.gguf -c 2048 My request looks like this: time curl --request POST --url http://localhost:8080/...

didinko

572

asked Jun 17, 2024 at 13:56

0 votes

1 answer

297 views

Prompt Template for Sequence matching using LlamaIndex and Llama3-70B-Instruct

I'm trying to get llama3-70b to find all sequences that match a given list. The list contains multiple terms (which range from one word to twelve words). I want the model to match all terms in a given ...

joshpopelka20

69

asked Jun 12, 2024 at 15:40

1 vote

1 answer

514 views

Why is LlamaCPP freezing during inference?

I'm using the following code to try and recieve a response from LlamaCPP, used through the LlamaIndex library. My model is stored locally in a gguf file. I'm trying to do inference on the CPU as my ...

Calder Johnson

119

asked Jun 1, 2024 at 19:08

0 votes

0 answers

184 views

How to deploy a finetuned model on a private server?

I have a project where I need to fine-tune a Large Language Model (LLM) such as LLAMA3 for a specific task and then deploy it on the company's server as a chatbot to recommend 'questionnaires / ...

Akram H

71

asked May 30, 2024 at 9:35

0 votes

1 answer

1k views

Pass raw prompt in Llama 3 prompting format ot LLama.cpp webserver

user2741831

2,482

asked May 27, 2024 at 14:03

0 votes

1 answer

651 views

How to get the response from the AI Model

I adapted this code from https://www.datacamp.com/tutorial/llama-cpp-tutorial from llama_cpp import Llama # GLOBAL VARIABLES my_model_path = "./model/zephyr-7b-beta.Q4_0.gguf" CONTEXT_SIZE = ...

nicomp

4,741

asked May 17, 2024 at 12:09

0 votes

1 answer

661 views

Unable for sending multiple input using Llama CPP and Llama-index

I am using Mistral 77b-instruct model with llama-index and load the model using llamacpp, and when I am trying to run multiple inputs or prompts ( open 2 website and send 2 prompts) , and it give me ...

HelloALive

1

asked May 17, 2024 at 5:50

0 votes

1 answer

272 views

Is there a way to automate formatting terminal input when talking to an LLM model?

Currently when the user sends input to the LLM it has to be in the format of: <|im_start|>user\ User input here.<|im_end|> This is very cumbersome and makes prompt engineering very ...

Fred Åberg

1

asked May 6, 2024 at 13:31

2 votes

2 answers

4k views

Detecting GPU availability in llama-cpp-python

Question How can I programmatically check if llama-cpp-python is installed with support for a CUDA-capable GPU? Context In my program, I am trying to warn the developers when they fail to configure ...

Programmer.zip

810

asked May 1, 2024 at 20:33

1 vote

0 answers

872 views

How can I set just give main answer from llama-3-8B-Instruct and not talk to itself?

I want to use llama-3 with llama-cpp-python and get main answer for user questions like llama-2. But answers generated by llama-3 not main answer like llama-2: Output: Hey! 👋 What can I help you ...

Dalipboy M

11

asked Apr 22, 2024 at 6:45

3 votes

1 answer

4k views

Llama.cpp GPU Offloading Issue - Unexpected Switch to CPU

I'm reaching out to the community for some assistance with an issue I'm encountering in llama.cpp. Previously, the program was successfully utilizing the GPU for execution. However, recently, it seems ...

Montassar Jaziri

31

asked Apr 18, 2024 at 8:26

2 votes

1 answer

4k views

Running Local LLMs in Production and handling multiple requests

I am trying to run a RAG with Gemma LLM locally it is running fine but the idea is I can't handle more than one request at a time. Is there a way to handle concurrent requests with utilizing resources ...

khalidwalamri

21

asked Apr 16, 2024 at 17:18

1 vote

1 answer

3k views

Is there any method to fully load the GGUF models on GPU

I have been using LlamaCPP to load my llm models, the llama-index library provides methods to offload some layers onto the GPU. Why does it not provide any methods to fully load the model on GPU. If ...

Shighra Sahil

11

asked Apr 16, 2024 at 6:16

0 votes

0 answers

241 views

Kernel dying repeatedly while initializing llm in local

Hi I'm using Jupyter Notebook, and trying to create instance of llama-2-7b-chat.q4_K_M.gguf (this is a quantized model) from hugging face. I'm running the following code: from langchain_community.llms ...

Avish Wagde

153

asked Apr 16, 2024 at 3:32

1 vote

0 answers

3k views

I have problem using n_gpu_layers in llama_cpp Llama function

I am attempting to load the Zephyr model into llama_cpp Llama, and while everything functions correctly, the performance is slow. The GPU appears to be underutilized, especially when compared to its ...

reach

21

asked Feb 22, 2024 at 6:21

1 vote

3 answers

2k views

Inconsistent completion for identical prompts and params with llama.cpp python and ctransformer

I've been comparing various langchain compatible llama2 runtimes, using langchain llm chain. Having the following parameter overrides: # llama.cpp: model_path="../llama.cpp/models/generated/...

JayabalanAaron

385

asked Feb 18, 2024 at 11:29

0 votes

0 answers

1k views

Gradio UI is not displaying/working properly, how to fix that?

I did follow these instructions to install privateGPT: git clone https://github.com/imartinez/privateGPT.git conda create -n privategpt python=3.11 conda activate privategpt #loading modules ...

anikaM

429

asked Feb 5, 2024 at 2:20

0 votes

1 answer

6k views

ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based project

While installing llama-cpp-python on VS Code i am getting an error 4 Warning(s) 257 Error(s) Time Elapsed 00:00:02.56 *** CMake build failed [end of output] ...

Kaustubh Ratna

41

asked Feb 2, 2024 at 12:55

1 vote

1 answer

2k views

llama-cpp-python Log printing on Ubuntu

I use llama-cpp-python to run LLMs locally on Ubuntu. While generating responses it prints its logs. How to stop printing of logs?? I found a way to stop log printing for llama.cpp but not for llama-...

San Vik

11

asked Jan 29, 2024 at 3:22

1 vote

1 answer

6k views

Streaming local LLM with FastAPI, Llama.cpp and Langchain

I have setup FastAPI with Llama.cpp and Langchain. Now I want to enable streaming in the FastAPI responses. Streaming works with Llama.cpp in my terminal, but I wasn't able to implement it with a ...

Maxl Gemeinderat

565

asked Jan 23, 2024 at 16:36

1 vote

0 answers

568 views

llama.cpp conversion of finetuned HF ( huggingface ) fails for LLaMA2 - 7B model

i use the simple https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py with some custom data and llama-2-7b-hf as the base model. Post training , it invokes trainer.save_model and the ...

Vikram Murthy

323

asked Jan 12, 2024 at 15:52

0 votes

2 answers

2k views

CMAKE in requirements.txt file: Install llama-cpp-python for Mac

I have put my application into a Docker and therefore I have created a requirements.txt file. Now I need to install llama-cpp-python for Mac, as I am loading my LLM with from langchain.llms import ...

Maxl Gemeinderat

565

asked Jan 4, 2024 at 12:28

1 vote

0 answers

1k views

llama-cpp-python on GPU: Delay between prompt submission and first token generation with longer prompts

I've been building a RAG pipeline using the llama-cpp-python OpenAI compatible server functionality and have been working my way up from running on just a laptop to running this on a dedicated ...

jhthompson12

69

asked Dec 26, 2023 at 19:19

0 votes

2 answers

1k views

Persist VectorStoreIndex (LlamaIndex) locally

I am trying to run this import logging import sys from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext import torch from llama_index.llms import LlamaCPP from llama_index....

6core

1

asked Dec 22, 2023 at 7:46

0 votes

1 answer

491 views

langchain with llama2 local slow inference

I am using Langchain with llama-2-13B. I have set up the llama2 on an AWS machine with 240GB RAM and 4x16GB Tesla V100 GPUs. It takes around 20s to make an inference. I want to make it faster, ...

Muhammad Muneeb Ur Rahman

74

asked Nov 28, 2023 at 15:06

4 votes

1 answer

4k views

No GPU support while running llama-cpp-python inside a docker container

I'm trying to run llama index with llama cpp by following the installation docs but inside a docker container. Following this repo for installation of llama_cpp_python==0.2.6. DOCKERFILE # Use the ...

Pratyush

39

asked Nov 23, 2023 at 6:09

2 votes

0 answers

1k views

llama-index: multiple calls to query_engine.query always gives "Empty Response"

I have the following code that works as expected model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf" llm = LlamaCPP(model_url=...

Jamie Dixon

4,302

asked Nov 14, 2023 at 1:05

Collectives™ on Stack Overflow