1,095 questions
0
votes
0
answers
98
views
Torch example transformer with TransformerDecoder
In the torch example provided here https://github.com/pytorch/examples/tree/main/word_language_model, tansformer only uses torch.TransformerEncoder and torch.TransformerDecoder is overwritten with a ...
1
vote
1
answer
2k
views
AttributeError: 'DynamicCache' object has no attribute 'seen_tokens'
I'm following the Hands-On Large Language Models book to learn more about LLMs. I'm trying to generate text using the "microsoft/Phi-3-mini-4k-instruct" model which is used in the book. ...
0
votes
0
answers
159
views
Why my Transformer model did not work well when dealing with single cell multi-omic data
The complete codes and data are available at:Google Disk
I'm working on a high-dimensional regression problem and have built a Transformer-based model in PyTorch. While the model trains, I'm observing ...
0
votes
0
answers
239
views
Cannot import `QwenForCausalLM` after installing `v4.51.3-Qwen2.5-Omni-preview` tag; pip installs 4.52.0.dev0 instead
Description:
I am trying to install the Hugging Face Transformers version that supports the Qwen2.5-Omni model. According to the official docs, the correct tag to install is v4.51.3-Qwen2.5-Omni-...
1
vote
1
answer
122
views
Can I use a custom attention layer while still leveraging a pre-trained BERT model?
In the paper “Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks”, they multiply a similarity matrix with the attention scores inside the attention layer. I want to ...
1
vote
1
answer
198
views
(NVIDIA/nv-embed-v2) ImportError: cannot import name 'MISTRAL_INPUTS_DOCSTRING' from 'transformers.models.mistral.modeling_mistral'
My code:
from transformers import AutoTokenizer, AutoModel
model_name = "NVIDIA/nv-embed-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(...
0
votes
1
answer
44
views
Multi-Head Self Attention in Transformer is permutation-invariant or equivariant how to see it in practice?
I read that a function f is equivariant if f(P(x)) = P(f(x)) where P is a permutation
So to check what means equivariant and permutation invariant I wrote the following code
import torch
import torch....
0
votes
0
answers
41
views
Does Temporal Fusion Transformer Learn Global Trends Across the Entire Time Series?
I'm using the Temporal Fusion Transformer (TFT) to train on time series data, aiming to make real-time forecasts for a specific process unit at any point in time during operation.
However, for ...
0
votes
0
answers
67
views
Why does adding token and positional embeddings in transformers work?
In transformer models, I've noticed that token embeddings and positional embeddings are added together before being passed into the attention layers:
import torch
import torch.nn as nn
class ...
0
votes
0
answers
83
views
Why is attention scaled by sqrt(d_k) in Transformer architectures?
I have this code in transformer model:
keys = x @ W_key
queries = x @ W_query
values = x @ W_value
attention_scores = queries @ keys.T
# keys.shape[-1]**0.5: used to scale the attention scores before ...
0
votes
0
answers
98
views
Training and validation losses do not reduce when fine-tuning ViTPose from huggingface
I am trying to fine-tune a transformer/encoder based pose estimation model available here at:
https://huggingface.co/docs/transformers/en/model_doc/vitpose
When passing "labels" attribute to ...
2
votes
0
answers
60
views
Why is day_size set to 32 in temporal embedding code?
I am trying to understand the code for temporal embedding inside autoformer implementation using pytorch.
https://github.com/thuml/Autoformer/blob/main/layers/Embed.py
class TemporalEmbedding(nn....
2
votes
1
answer
86
views
Logits Don't Change in a Custom Reimplementation of a CLIP model [PyTorch]
The problem
The similarity scores are almost the same for texts that describe both a photo of a cat and a dog (the photo is of a cat).
Cat similarity: tensor([[-3.5724]], grad_fn=<MulBackward0>)
...
2
votes
1
answer
193
views
I keep getting this error, cuda available 'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu
I'm training a transformer model using RLlib's PPO algorithm, but I encounter a device mismatch error:
RuntimeError: Expected all tensors to be on the same device, but found
at least two devices, ...
0
votes
0
answers
88
views
SageMaker Real-Time Endpoint Timeout Issues with Lambda for Parallel Data Processing
I’m new to AWS and struggling with an architecture involving AWS Lambda and a SageMaker real-time endpoint. I’m trying to process large batches of data rows efficiently, but I’m running into timeout ...
0
votes
2
answers
224
views
SFTTrainer Error : prepare_model_for_kbit_training() got an unexpected keyword argument 'gradient_checkpointing_kwargs'
I'm trying to fine-tune a model using SFTTrainer from trl.
This is how my SFTConfig arguments look like,
from trl import SFTConfig
training_arguments = SFTConfig(
output_dir=output_dir,
...
0
votes
0
answers
44
views
PyTorch Transformer Stuck in Local Minima Occasionally
I am working on a project to pre-train a custom transformer model I developed and then fine-tune it for a downstream task. I am pre-training the model on an H100 cluster and this is working great. ...
1
vote
1
answer
71
views
model.eval() return a NoneType object when using deepspeed
When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:
def evaluate(self, ...
0
votes
0
answers
173
views
How do I resolve ImportError Using bitsandbytes 4bit quantization requires the latest version of bitsandbytes despite having version 0.45.3 installed?
I am trying to use the bitsandbytes library for 4-bit quantization in my model loading function, but I keep encountering an ImportError. The error message says, "Using bitsandbytes 4-bit ...
0
votes
1
answer
36
views
Anomalous behavior of the attention layer for different input vectors
I am currently trying to implement the attention layer from the transformer architecture but it is not working as I expect. I have been unable to figure out what the problem is for several days now. ...
0
votes
0
answers
35
views
How to change last layer in finetuned model?
When I fine-tuned the model Hubert to detect phoneme, I chose a fine-tuned ASR Hubert model and I removed the last two layers and added a linear layer to the config vocab_size of phoneme. What is ...
0
votes
1
answer
58
views
Why does my keras model with multiple inputs accept the shape of the training data for .call() but not for .evaluate()?
Im currently investigating the effekt of masking attention Scores on MultiHeadAttention Layers in a Transformer model for classification of time series data. I have build a model that accepts a time ...
0
votes
1
answer
194
views
Trouble understanding the formula for estimating dense self-attention FLOPS per Token given as 6LH(2QT)
In the appendix B of the PaLM paper (https://arxiv.org/pdf/2204.02311) it describes a metric called "Model Flops Utilization (MFU)" and the formula for estimating it. It's computation makes ...
0
votes
0
answers
48
views
Does temporal fusion transformer use the ground truth of target during testing?
In this neural network structure, I want the model to do the train and validation without using the historical target values and to make the prediction directly through the covariates, so I set these ...
2
votes
1
answer
499
views
What to do when the gradient explodes in a Transformer model?
General question (hopefully useful for people coming from google): What to do when the gradient explodes? When working with transformers and deep NNs (with PyTorch), do you have a mental checklist of ...
0
votes
0
answers
63
views
using Segform model with keras-hub error. The structure of `inputs` doesn't match the expected structure
I am trying a semantic segmentation task using Segformer model with pretrained model 'mit_b3_cityscapes_1024' .
encoder = keras_hub.models.MiTBackbone.from_preset(
"mit_b3_cityscapes_1024&...
2
votes
0
answers
319
views
Timestamps reset every 30 seconds when using distil-whisper with return_timestamps=True
Problem
distil-large-v3#sequential-long-form
I'm using distil-whisper through the 🤗 Transformers pipeline for speech recognition. When setting return_timestamps=True, the timestamps reset to 0 every ...
0
votes
0
answers
30
views
Why can't I access Konva.Transformer using `konvaStore.transformer!.nodes([target])` after initialization in Qwik component?
I'm building an app using Qwik, TypeScript, and Konva, and I'm trying to implement a transformer that allows users to click on a shape and resize or transform it. My goal is to create and access the ...
0
votes
0
answers
12
views
Reverse Mapping of Table Elements from screenshot | Table Transformer
I am working on an end-to-end (E2E) project for websites that involves:
Capturing Tight Screenshots of Data Tables: The project automatically detects and takes precise screenshots of all the data ...
1
vote
1
answer
149
views
Low accuracy using vision transformers for Image classification
I'be been learning workings of an Vision transformer, I couldn't get it to run at first(building the ViT from scratch). But somehow I managed to scramble up a code that shows very low accuracy(3%).
...
1
vote
1
answer
469
views
ValueError: Exception encountered when calling layer 'tf_bert_model' (type TFBertModel)
I have been trying to run TFBertModel from Transformers, but it kept on throwing me this error
ValueError Traceback (most recent call last)
Cell In[9], line 1
----> 1 ...
0
votes
1
answer
39
views
How do I pass information to TestNG IAnnotationTrasformer?
I just discovered that the implementation of listener with IAnnotationTransformer annotation is executed as soon as the tests are launched even before the very first test is executed.
Background: I ...
0
votes
1
answer
711
views
How to correctly apply LayerNorm after MultiheadAttention with different input shapes (batch_first vs default) in PyTorch?
I’m working on an audio recognition task using a Transformer-based model in PyTorch. My input features are generated by a CNN-based embedding layer and have the shape [batch_size, d_model, n_token], ...
1
vote
0
answers
79
views
profiling of Llama 3.1 8B model on AI Accelerator
I have the profiling results of the inference of Llama 3.1 8b model by Meta. I deployed the model on the AI Accelerator. I managed to create a memory trace of the whole model from the Host to the ...
0
votes
1
answer
58
views
Compare two consecutive rows in datastage and throw the rows that doesn't meet a condition
I'm reading a file using a sequential file in Datastage and I'm doing some transformation in the data using a transformer, I want to compare the current row with the previous row, to check a value of ...
0
votes
1
answer
164
views
Scaled_dot_product_attention higher head num cost much more memory
I found Scaled_dot_product_attention cost much more memory when head number is large(>=16). This is my code to reproduce the issue.
import torch
length = 10000
dim = 64
head_num1 = 8
head_num2 = ...
1
vote
0
answers
72
views
Why my trained t5-small model generate a mess after I saved and loaded the checkpoint?
I was distilling my student model (base model t5-small) based on a fine-tuned T5-xxl. Here is the config
student_model = AutoModelForSeq2SeqLM.from_pretrained(
args.student_model_name_or_path,
...
0
votes
1
answer
36
views
Unable to figure out the hardware requirement(Cloud or on-prem) for open source inference for multiple users
I am trying to budget for setting up a llm based RAG application which will serve users with dynamic size(Anything from 100 to 2000).
I am able to figure out the GPU requirement to host a certain llm[...
0
votes
0
answers
61
views
Extracting Swin-Vit backbone
I am wondering whether there is a way to extract a Swin-VIT backbone similar to resnet ?
I am attempting to train a few self-supervised learning algorithms, where I need to get just the backbone (...
1
vote
0
answers
35
views
pytorch quantized linear function gives shape invalid error
I am trying to implement write a simple quantized tensor linear multiplication. Assuming the weight matrix w3 of shape (14336, 4096) and the input tensor x of shape (2, 512, 4096) where first dim is ...
0
votes
1
answer
195
views
Inference error after training an IP-Adapter plus model
I downloaded packages from https://github.com/tencent-ailab/IP-Adapter
run the commands to train an IP-Adapter plus model (input: text + image, output: image):
accelerate launch --num_processes 2 --...
0
votes
1
answer
105
views
KV caching for varying length texts
I am trying to do some strucutured text extraction using some kv caching tricks. For this example I will use the following model and data:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
model = ...
0
votes
0
answers
232
views
Should I interleave sin and cosine in sinusoidal positional encoding?
I'm trying to implement a sinusoidal positional encoding. I found two solutions that give different encodings. I am wondering if one of them is wrong or both are correct. I showcase visual figures of ...
0
votes
1
answer
99
views
Tensorflow executes slow on GPU - retracing issue?
I am trying to develop a transformer sequence to vector model but encounter performance issues.
I am working with a Tesla V100-PCIE-16GB. Whenever the model encounters an unseen sequence length, the (...
0
votes
2
answers
248
views
Why even with anisotropy, we can still compute embedding similarity in (some RAG) projects
I have just noticed that the token/sentence embeddings trained from Transformer-based model will have strong anisotropy problem which means most of the embeddings are close to each other in the vector ...
0
votes
1
answer
269
views
Exploding Gradient (NaN Training Loss And Validation Loss) In Multi Head Self Attention - Vision Transformer
This multihead self attention code causes the training loss and validation loss to become NaN, but when I remove this part, everything goes back to normal. I know that when the training loss and ...
3
votes
1
answer
519
views
Get the attention scores of a pretrained transformer in pytorch
I've been trying to look at the attention scores of a pretrained transformer when I pass specific data in. It's specifically a Pytorch Transformer. I've tried using forward hooks, but I'm only able to ...
1
vote
1
answer
8k
views
cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub'
I've been using LLAMA 2 for research for a few months now and I import as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda")
tokenizer = ...
0
votes
1
answer
120
views
Inference workflow in compile mode using transformers.pipeline()
I am trying to run an inference workflow of a Llama model in compile mode using transformers.pipeline(). I am using the following line of codes to run the inference workflow in compile mode:
model = ...
0
votes
1
answer
159
views
How to use SegFormer encoder and decoder separately?
I am trying to understand SegFormer model, and would like to use encoder and decoder separately with different models.
I have tried looking into official implementation which is based on mmseg and ...