Newest 'transformer-model' Questions

0 votes

0 answers

98 views

Torch example transformer with TransformerDecoder

In the torch example provided here https://github.com/pytorch/examples/tree/main/word_language_model, tansformer only uses torch.TransformerEncoder and torch.TransformerDecoder is overwritten with a ...

cuneyttyler

1,395

asked Oct 21 at 8:48

1 vote

1 answer

2k views

AttributeError: 'DynamicCache' object has no attribute 'seen_tokens'

I'm following the Hands-On Large Language Models book to learn more about LLMs. I'm trying to generate text using the "microsoft/Phi-3-mini-4k-instruct" model which is used in the book. ...

Quinten

42.8k

asked Sep 19 at 8:39

0 votes

0 answers

159 views

Why my Transformer model did not work well when dealing with single cell multi-omic data

The complete codes and data are available at:Google Disk I'm working on a high-dimensional regression problem and have built a Transformer-based model in PyTorch. While the model trains, I'm observing ...

氢氰酸

9

asked Sep 3 at 14:31

0 votes

0 answers

239 views

Cannot import `QwenForCausalLM` after installing `v4.51.3-Qwen2.5-Omni-preview` tag; pip installs 4.52.0.dev0 instead

Description: I am trying to install the Hugging Face Transformers version that supports the Qwen2.5-Omni model. According to the official docs, the correct tag to install is v4.51.3-Qwen2.5-Omni-...

Promit Dey Sarker Arjan

1

asked Sep 3 at 10:17

1 vote

1 answer

122 views

Can I use a custom attention layer while still leveraging a pre-trained BERT model?

In the paper “Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks”, they multiply a similarity matrix with the attention scores inside the attention layer. I want to ...

Blockchain Kid

335

asked Jul 6 at 11:47

1 vote

1 answer

198 views

(NVIDIA/nv-embed-v2) ImportError: cannot import name 'MISTRAL_INPUTS_DOCSTRING' from 'transformers.models.mistral.modeling_mistral'

My code: from transformers import AutoTokenizer, AutoModel model_name = "NVIDIA/nv-embed-v2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(...

6zL

21

asked Jul 5 at 13:27

0 votes

1 answer

44 views

Multi-Head Self Attention in Transformer is permutation-invariant or equivariant how to see it in practice?

I read that a function f is equivariant if f(P(x)) = P(f(x)) where P is a permutation So to check what means equivariant and permutation invariant I wrote the following code import torch import torch....

fenaux

47

asked Jul 2 at 19:34

0 votes

0 answers

41 views

Does Temporal Fusion Transformer Learn Global Trends Across the Entire Time Series?

I'm using the Temporal Fusion Transformer (TFT) to train on time series data, aiming to make real-time forecasts for a specific process unit at any point in time during operation. However, for ...

YoungJoo Park

51

asked Jun 27 at 0:31

0 votes

0 answers

67 views

Why does adding token and positional embeddings in transformers work?

In transformer models, I've noticed that token embeddings and positional embeddings are added together before being passed into the attention layers: import torch import torch.nn as nn class ...

Yilmaz

51k

asked May 26 at 21:21

0 votes

0 answers

83 views

Why is attention scaled by sqrt(d_k) in Transformer architectures?

I have this code in transformer model: keys = x @ W_key queries = x @ W_query values = x @ W_value attention_scores = queries @ keys.T # keys.shape[-1]**0.5: used to scale the attention scores before ...

Yilmaz

51k

asked May 25 at 21:48

0 votes

0 answers

98 views

Training and validation losses do not reduce when fine-tuning ViTPose from huggingface

I am trying to fine-tune a transformer/encoder based pose estimation model available here at: https://huggingface.co/docs/transformers/en/model_doc/vitpose When passing "labels" attribute to ...

Soham Bhaumik

341

asked May 8 at 15:28

2 votes

0 answers

60 views

Why is day_size set to 32 in temporal embedding code?

I am trying to understand the code for temporal embedding inside autoformer implementation using pytorch. https://github.com/thuml/Autoformer/blob/main/layers/Embed.py class TemporalEmbedding(nn....

prem

449

asked Apr 28 at 12:45

2 votes

1 answer

86 views

Logits Don't Change in a Custom Reimplementation of a CLIP model [PyTorch]

The problem The similarity scores are almost the same for texts that describe both a photo of a cat and a dog (the photo is of a cat). Cat similarity: tensor([[-3.5724]], grad_fn=<MulBackward0>) ...

Yousef

51

asked Apr 20 at 18:46

2 votes

1 answer

193 views

I keep getting this error, cuda available 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

I'm training a transformer model using RLlib's PPO algorithm, but I encounter a device mismatch error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, ...

Thanasis Mpoulionis

23

asked Apr 11 at 11:41

0 votes

0 answers

88 views

SageMaker Real-Time Endpoint Timeout Issues with Lambda for Parallel Data Processing

I’m new to AWS and struggling with an architecture involving AWS Lambda and a SageMaker real-time endpoint. I’m trying to process large batches of data rows efficiently, but I’m running into timeout ...

Kabir Juneja

1

asked Mar 31 at 6:07

0 votes

2 answers

224 views

SFTTrainer Error : prepare_model_for_kbit_training() got an unexpected keyword argument 'gradient_checkpointing_kwargs'

I'm trying to fine-tune a model using SFTTrainer from trl. This is how my SFTConfig arguments look like, from trl import SFTConfig training_arguments = SFTConfig( output_dir=output_dir, ...

sabira kabeer

11

asked Mar 22 at 16:58

0 votes

0 answers

44 views

PyTorch Transformer Stuck in Local Minima Occasionally

I am working on a project to pre-train a custom transformer model I developed and then fine-tune it for a downstream task. I am pre-training the model on an H100 cluster and this is working great. ...

Martin Weiss

1

asked Mar 18 at 2:48

1 vote

1 answer

71 views

model.eval() return a NoneType object when using deepspeed

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet: def evaluate(self, ...

external

11

asked Mar 15 at 17:28

0 votes

0 answers

173 views

How do I resolve ImportError Using bitsandbytes 4bit quantization requires the latest version of bitsandbytes despite having version 0.45.3 installed?

I am trying to use the bitsandbytes library for 4-bit quantization in my model loading function, but I keep encountering an ImportError. The error message says, "Using bitsandbytes 4-bit ...

from

1

asked Mar 11 at 10:54

0 votes

1 answer

36 views

Anomalous behavior of the attention layer for different input vectors

I am currently trying to implement the attention layer from the transformer architecture but it is not working as I expect. I have been unable to figure out what the problem is for several days now. ...

RB2k

1

asked Mar 5 at 1:15

0 votes

0 answers

35 views

How to change last layer in finetuned model?

When I fine-tuned the model Hubert to detect phoneme, I chose a fine-tuned ASR Hubert model and I removed the last two layers and added a linear layer to the config vocab_size of phoneme. What is ...

Ngoc Anh

1

asked Feb 24 at 8:47

0 votes

1 answer

58 views

Why does my keras model with multiple inputs accept the shape of the training data for .call() but not for .evaluate()?

Im currently investigating the effekt of masking attention Scores on MultiHeadAttention Layers in a Transformer model for classification of time series data. I have build a model that accepts a time ...

Henning

31

asked Feb 16 at 21:12

0 votes

1 answer

194 views

Trouble understanding the formula for estimating dense self-attention FLOPS per Token given as 6LH(2QT)

In the appendix B of the PaLM paper (https://arxiv.org/pdf/2204.02311) it describes a metric called "Model Flops Utilization (MFU)" and the formula for estimating it. It's computation makes ...

cangozpi

159

asked Feb 12 at 21:22

0 votes

0 answers

48 views

Does temporal fusion transformer use the ground truth of target during testing?

In this neural network structure, I want the model to do the train and validation without using the historical target values and to make the prediction directly through the covariates, so I set these ...

zzhuqshun

3

asked Feb 12 at 7:47

2 votes

1 answer

499 views

What to do when the gradient explodes in a Transformer model?

General question (hopefully useful for people coming from google): What to do when the gradient explodes? When working with transformers and deep NNs (with PyTorch), do you have a mental checklist of ...

Nicholas Kryger-Nelson

21

asked Feb 6 at 17:31

0 votes

0 answers

63 views

using Segform model with keras-hub error. The structure of `inputs` doesn't match the expected structure

I am trying a semantic segmentation task using Segformer model with pretrained model 'mit_b3_cityscapes_1024' . encoder = keras_hub.models.MiTBackbone.from_preset( "mit_b3_cityscapes_1024&...

masume keshavarzi

1

asked Jan 23 at 13:32

2 votes

0 answers

319 views

Timestamps reset every 30 seconds when using distil-whisper with return_timestamps=True

Problem distil-large-v3#sequential-long-form I'm using distil-whisper through the 🤗 Transformers pipeline for speech recognition. When setting return_timestamps=True, the timestamps reset to 0 every ...

Martin Zhu

441

asked Jan 21 at 20:10

0 votes

0 answers

30 views

Why can't I access Konva.Transformer using `konvaStore.transformer!.nodes([target])` after initialization in Qwik component?

I'm building an app using Qwik, TypeScript, and Konva, and I'm trying to implement a transformer that allows users to click on a shape and resize or transform it. My goal is to create and access the ...

olu

123

asked Jan 21 at 15:59

0 votes

0 answers

12 views

Reverse Mapping of Table Elements from screenshot | Table Transformer

I am working on an end-to-end (E2E) project for websites that involves: Capturing Tight Screenshots of Data Tables: The project automatically detects and takes precise screenshots of all the data ...

Michael Dzwinel

387

asked Jan 15 at 12:43

1 vote

1 answer

149 views

Low accuracy using vision transformers for Image classification

I'be been learning workings of an Vision transformer, I couldn't get it to run at first(building the ViT from scratch). But somehow I managed to scramble up a code that shows very low accuracy(3%). ...

kiNo

11

asked Jan 14 at 10:07

1 vote

1 answer

469 views

ValueError: Exception encountered when calling layer 'tf_bert_model' (type TFBertModel)

I have been trying to run TFBertModel from Transformers, but it kept on throwing me this error ValueError Traceback (most recent call last) Cell In[9], line 1 ----> 1 ...

Faiz khan

13

asked Dec 26, 2024 at 15:53

0 votes

1 answer

39 views

How do I pass information to TestNG IAnnotationTrasformer?

I just discovered that the implementation of listener with IAnnotationTransformer annotation is executed as soon as the tests are launched even before the very first test is executed. Background: I ...

S P

69

asked Dec 25, 2024 at 18:58

0 votes

1 answer

711 views

How to correctly apply LayerNorm after MultiheadAttention with different input shapes (batch_first vs default) in PyTorch?

I’m working on an audio recognition task using a Transformer-based model in PyTorch. My input features are generated by a CNN-based embedding layer and have the shape [batch_size, d_model, n_token], ...

MuxAte

43

asked Dec 15, 2024 at 1:44

1 vote

0 answers

79 views

profiling of Llama 3.1 8B model on AI Accelerator

I have the profiling results of the inference of Llama 3.1 8b model by Meta. I deployed the model on the AI Accelerator. I managed to create a memory trace of the whole model from the Host to the ...

Sudais Alam

11

asked Nov 28, 2024 at 11:42

0 votes

1 answer

58 views

Compare two consecutive rows in datastage and throw the rows that doesn't meet a condition

I'm reading a file using a sequential file in Datastage and I'm doing some transformation in the data using a transformer, I want to compare the current row with the previous row, to check a value of ...

Chaimaa Emily

1

asked Nov 27, 2024 at 16:35

0 votes

1 answer

164 views

Scaled_dot_product_attention higher head num cost much more memory

I found Scaled_dot_product_attention cost much more memory when head number is large(>=16). This is my code to reproduce the issue. import torch length = 10000 dim = 64 head_num1 = 8 head_num2 = ...

Kerry Zhu

11

asked Nov 27, 2024 at 10:18

1 vote

0 answers

72 views

Why my trained t5-small model generate a mess after I saved and loaded the checkpoint?

I was distilling my student model (base model t5-small) based on a fine-tuned T5-xxl. Here is the config student_model = AutoModelForSeq2SeqLM.from_pretrained( args.student_model_name_or_path, ...

user28369747

11

asked Nov 20, 2024 at 5:02

0 votes

1 answer

36 views

Unable to figure out the hardware requirement(Cloud or on-prem) for open source inference for multiple users

I am trying to budget for setting up a llm based RAG application which will serve users with dynamic size(Anything from 100 to 2000). I am able to figure out the GPU requirement to host a certain llm[...

Bing

631

asked Nov 19, 2024 at 17:50

0 votes

0 answers

61 views

Extracting Swin-Vit backbone

I am wondering whether there is a way to extract a Swin-VIT backbone similar to resnet ? I am attempting to train a few self-supervised learning algorithms, where I need to get just the backbone (...

imantha

3,880

asked Nov 11, 2024 at 9:26

1 vote

0 answers

35 views

pytorch quantized linear function gives shape invalid error

I am trying to implement write a simple quantized tensor linear multiplication. Assuming the weight matrix w3 of shape (14336, 4096) and the input tensor x of shape (2, 512, 4096) where first dim is ...

hafezmg48

99

asked Oct 30, 2024 at 20:32

0 votes

1 answer

195 views

Inference error after training an IP-Adapter plus model

I downloaded packages from https://github.com/tencent-ailab/IP-Adapter run the commands to train an IP-Adapter plus model (input: text + image, output: image): accelerate launch --num_processes 2 --...

weiming

29

asked Oct 30, 2024 at 7:32

0 votes

1 answer

105 views

KV caching for varying length texts

I am trying to do some strucutured text extraction using some kv caching tricks. For this example I will use the following model and data: model_name = "Qwen/Qwen2.5-0.5B-Instruct" model = ...

sachinruk

10k

asked Oct 28, 2024 at 6:57

0 votes

0 answers

232 views

Should I interleave sin and cosine in sinusoidal positional encoding?

I'm trying to implement a sinusoidal positional encoding. I found two solutions that give different encodings. I am wondering if one of them is wrong or both are correct. I showcase visual figures of ...

Janikas

487

asked Oct 18, 2024 at 19:35

0 votes

1 answer

99 views

Tensorflow executes slow on GPU - retracing issue?

I am trying to develop a transformer sequence to vector model but encounter performance issues. I am working with a Tesla V100-PCIE-16GB. Whenever the model encounters an unseen sequence length, the (...

D. E.

1

asked Oct 13, 2024 at 20:18

0 votes

2 answers

248 views

Why even with anisotropy, we can still compute embedding similarity in (some RAG) projects

I have just noticed that the token/sentence embeddings trained from Transformer-based model will have strong anisotropy problem which means most of the embeddings are close to each other in the vector ...

yuu Mu

1

asked Oct 2, 2024 at 22:08

0 votes

1 answer

269 views

Exploding Gradient (NaN Training Loss And Validation Loss) In Multi Head Self Attention - Vision Transformer

This multihead self attention code causes the training loss and validation loss to become NaN, but when I remove this part, everything goes back to normal. I know that when the training loss and ...

Fuji

117

asked Sep 26, 2024 at 11:48

3 votes

1 answer

519 views

Get the attention scores of a pretrained transformer in pytorch

I've been trying to look at the attention scores of a pretrained transformer when I pass specific data in. It's specifically a Pytorch Transformer. I've tried using forward hooks, but I'm only able to ...

Thomas

31

asked Sep 18, 2024 at 19:06

1 vote

1 answer

8k views

cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub'

I've been using LLAMA 2 for research for a few months now and I import as follows: from transformers import AutoModelForCausalLM, AutoTokenizer device = torch.device("cuda") tokenizer = ...

lucasa.lisboa

35

asked Aug 27, 2024 at 17:20

0 votes

1 answer

120 views

Inference workflow in compile mode using transformers.pipeline()

I am trying to run an inference workflow of a Llama model in compile mode using transformers.pipeline(). I am using the following line of codes to run the inference workflow in compile mode: model = ...

Arunima Ghosh

1

asked Aug 27, 2024 at 12:47

0 votes

1 answer

159 views

How to use SegFormer encoder and decoder separately?

I am trying to understand SegFormer model, and would like to use encoder and decoder separately with different models. I have tried looking into official implementation which is based on mmseg and ...

Deep

646

asked Aug 26, 2024 at 18:03

Collectives™ on Stack Overflow