23,946 questions
-2
votes
0
answers
26
views
How to efficiently train a CNN based CV model? [closed]
I would say im intermediatly experienced in Deep Learning and computer vision. However i have a task to mask certain textured objects for instance segmentation. Im not sure how to train a better model....
Advice
0
votes
0
replies
16
views
HRNet trained from scratch on MPII high PCKh but poor visual predictions
I'm training an HRNet model from scratch on the MPII dataset for human pose estimation. The model succeded to get high accuracy around 0.93 [email protected] however the visual results are not really promising....
1
vote
0
answers
17
views
instance segmentation on custom coco dataset using pytorch maskrcnn + fpn for 83 categories (+background)
I am running a training of instance segmentation on custom coco dataset using pytorch maskrcnn + fpn for 83 categories (+background).
What is the problem with following setup and why RPN head not ...
0
votes
1
answer
32
views
Installing Python with PyTorch - PowerShell not recognizing Torch package
I have a problem where I am trying to use PyCharm for PyTorch. I have installed Python separately (quite a task as it tried to install it in the Microsoft/AppData folder?).
In PyCharm, I have to first ...
0
votes
0
answers
14
views
How do I implement mult-task fixed Gaussian noise in GPyTorch?
I'm attempting to perform multi-task Gaussian process regression using GPyTorch. I have, for each of N training examples, its corresponding (T x T) cross-task noise covariance matrix. I aim to ...
0
votes
0
answers
61
views
Performance problem with automatic parallelization in `pytorch`
I'm having problems with python code that uses pytorch. The details are a bit complicated (the code is part of a quantum mechanical calculation) but the code structure is very straightforward and ...
0
votes
1
answer
232
views
PyTorch not recognizing RTX 5090 (sm_120) on Windows 11 – CUDA error: no kernel image available
I'm trying to use PyTorch with an NVIDIA GeForce RTX 5090 (Blackwell architecture, CUDA Compute Capability sm_120) on Windows 11, and I keep running into compatibility issues. PyTorch detects CUDA, ...
0
votes
0
answers
61
views
How do I visualize the latent representation produced by the Stable Diffusion VAE?
I am trying to visualize the latent representation produced by the VAE inside a Stable Diffusion pipeline
from diffusers import StableDiffusionPipeline
import torch
# A CUDA ordinal is simply the ...
0
votes
0
answers
33
views
AWS SageMaker PyTorch Model Deployment - is entry_point needed?
I'm trying to deploy a pre-trained PyTorch model to SageMaker using the Python SDK. I have a model.tar.gz file that is uploaded to S3, with the following structure:
code/
code/requirements.txt
code/...
Tooling
0
votes
0
replies
56
views
Good packages for bounded Linear Quantile Regression?
I'm looking for a good package to train a linear quantile regression model, i.e. $\hat y = \sum_{i=1}^n w_i \cdot X_i$. With $x_i$ are the input features, and $w_i$ are the bounded trainable weights. ...
0
votes
0
answers
33
views
Attribution Error when using Huggingface transformers Trainer with FSDP
I am now trying to use FSDP in Huggingface transformers Trainer. The training script is something like
train_dataset = Mydataset(...)
args = TrainingArguments(...)
model = LlamaForCausalLM....
2
votes
1
answer
90
views
Having trouble with R's torch and tensor dimensions
I am trying to follow along with this webpage: https://jtr13.github.io/cc21fall2/tutorial-on-r-torch-package.html
I am trying to understand R's implementation of PyTorch.
I am having some trouble with ...
0
votes
0
answers
57
views
How to force NCCL build to embed PTX for all kernels (prevent linker from stripping ncclDevKernel PTX)?
I am compiling NCCL 2.27.5-1 (I tried also 2.28.9-1) from source for a V100 GPU (sm_70). My goal is to have libnccl.so contain compute_70 PTX for every kernel.
Despite passing explicit -gencode=arch=...
1
vote
0
answers
113
views
PyTorch installed via uv project shows CPU-only version on Windows with CUDA specification in pyproject.toml
I'm trying to set up a Python project using uv and pyproject.toml on Windows. I want to install the CUDA-enabled PyTorch, but after installing, when I check the version, it shows CPU-only.
Here’s my ...
Advice
0
votes
0
replies
30
views
When using TensorDictPrioritizedReplayBuffer, should I apply the priority weight manually or not?
With Prioritized Experience Replay (PER), we use Beta parameter, so we can find weight that will be used to offset the bias introduced by PER. Now, with PyTorch's TensorDictPrioritizedReplayBuffer, I ...
1
vote
2
answers
130
views
pytorch Module B=A, A.to('cpu'), but the tensor in B is still in GPU, why?
After converting module A to CPU, the origin parameter tensor still stays on the GPU? When it is released? Is it wrong if I reuse the parameter?
My code:
import torch.nn as nn
class A(nn.Module):
...
2
votes
1
answer
28
views
PyTorch .view() operation to manipulate tensor dimensions vis a vis using torch.unbind followed by torch.cat
In Torch, .view() reshapes the tensor. However, there are multiple ways to reshape a multi-dimensional tensor to a target shape. How does it decide between those different ways?
For example, in Torch, ...
2
votes
1
answer
1k
views
PyTorch fails on Windows Server 2019: “Error loading c10.dll” (works fine on Windows 10)
I'm trying to deploy a Python project on Windows Server 2019, but PyTorch fails to import with a DLL loading error.
On my local machine (Windows 10, same Python version), everything works perfectly.
...
1
vote
1
answer
61
views
.so file built on same CPU but different EC2 instances lead to missing symbols
I am building a wheel of PyTorch from source, based on their https://github.com/pytorch/pytorch/blob/v2.6.0/.ci/manywheel/build_common.sh CI build script. I tested on a "local" instance of a ...
Advice
0
votes
2
replies
48
views
Fixing a UNET in pytorch that doesn't work in eval mode due to BatchNorm2d layers
I have a UNET model trained in pytorch (by someone else) that produces quite different results in eval mode to train mode (train mode results look good, eval mode they are rubbish). A bit of googling ...
0
votes
0
answers
55
views
Given groups=1, weight of size [64, 1024, 1, 1], expected input[1, 256, 1, 1] to have 1024 channels, but got 256 channels instead
I have encountered this issue and I searched on the forums but I couldnt solve it. How can I solve this problem ?
I tried to add CBAM module in yolov12 for my custom dataset to improve accuracy. I ...
0
votes
0
answers
103
views
My SimSiam is collapsing- SimSiam on CUB-200-2011 with ViT
I'm trying to implement SimSiam using a ViT backbone on the CUB-200-2011 dataset. However, during training, the embeddings collapse to a single direction despite using stop-gradient. Here’s what I ...
-1
votes
0
answers
25
views
How to use the models from huggingface from local machine server
I am trying to use the following model Emotion Llama and try to understand how to download the models and place them in the right dir from huggingface. It actually suggests to donwload three models in ...
1
vote
1
answer
76
views
Is passing ray resources as options when calling the function equivalent to setting them in the function's decorator?
Is
@ray.remote
def run_experiment(...):
(...)
if __name__ == '__main__':
ray.init()
exp_config = sys.argv[1]
params_tuples, num_cpus, num_gpus = load_exp_config(exp_config)
ray.get(...
0
votes
0
answers
48
views
Unclear formulation in Temporal Fusion Transformer paper
I am currently trying to implement the Temporal Fusion Transformer using PyTorch.
This paper (https://arxiv.org/pdf/1912.09363) is my reference.
Currently I am stuck with the variable selection ...
0
votes
0
answers
31
views
Where is EXECUTORCH_LIBRARY defined in ExecuTorch v1.0?
I’m trying to register a custom operator for ExecuTorch (v1.0, built from the PyTorch 2.5 source tree).
My goal is to create a shared library that defines a few quantum operators and runs them from a ....
0
votes
0
answers
49
views
Torch 2.4.1 doesn't utilize my system memory after CUDA memory runs out
I wrote a lot of scripts to test the compatibility of my system with PyTorch 2.4.1, and they all indicate I can run it. I don't have enough memory on my GPU, so I tried enabling expandable_segments so ...
1
vote
1
answer
150
views
How to configure uv via pyproject.toml to lock PyTorch (+cu118) to a custom index and prevent uv run from using the CPU-only version?
I am managing a project with uv (v0.9.4) that requires a specific PyTorch CUDA build. The generic installation works, but using uv run causes a package conflict, despite the environment being correct.
...
0
votes
0
answers
82
views
IndexError: index -1 is out of bounds for dimension 0 with size 0
I am currently experimenting with modifying the KV cache of the LLaVA model in order to perform controlled interventions during generation (similar to cache-steering methods in recent research). The ...
0
votes
1
answer
35
views
How can I get torch.set_grad_enabled(True) to work in ComfyUI?
I just spent hours figuring out that the following code fails when included in a ComfyUI custom node, but works perfectly fine outside (using the same Python venv). I finally found out that someone ...
0
votes
1
answer
80
views
Unable to step into torch.nn.functional.linear using VS Code debugging
I want to step into the linear function using VS Code's step-in , but it skips automatically when I click "step into". Could anyone help me with this?
I used DEBUG=1 when compiling PyTorch.
...
1
vote
0
answers
68
views
Should I use torch.inference_mode() in a prediction method even when using model.eval()? [duplicate]
I'm following the book "Deep Learning with PyTorch Step By Step" and I have a question about the predict method in the StepByStep class (from this repository: GitHub).
The current ...
1
vote
0
answers
186
views
Transformers 'could not import module pipeline' to jupyter notebook
I need to to run a series of pre-trained fine-tuned models from Hugging Face to Jupyter notebook. I have updated to the latest version of both PyTorch and Transformers, but when I run the code
from ...
Advice
2
votes
0
replies
89
views
How should I balance DSA, ML fundamentals, PyTorch implementation, and Kaggle practice for ML Engineer interviews?
I’m a Computer Science graduate preparing for ML/AI Engineer roles.
I’m facing a dilemma about what to focus on, how much to allocate time to each area, and what exact roadmap to follow to prepare ...
2
votes
0
answers
117
views
I get the error " ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory " when i try to use torch in virtual env
I have installed Cuda 13 on fedora 42 .
When i use pytorch localy, torch works fine, but when i creat a virtualenv my pytorch cant find the ibcudnn files.
I get the error
ImportError: libcudnn.so.9: ...
2
votes
2
answers
94
views
Decoder only model AI making repetitive responses
I am making a Decoder only transformer using Pytorch and my dataset of choice is the fullEnglish dataset from kaggle Plaintext Wikipedia (full English).
The problem is that my model output is ...
0
votes
1
answer
77
views
Generating response with KV Cached System Prompt throws error when Input Tokens are less than Prompt Tokens
I am trying to run Mistral-7B-Instruct-v0.2.
Each run is PROMPT + details[i].
PROMPT has instructions on how to generate JSON based on details.
As the prefix part of each input is same; kind of like a ...
2
votes
1
answer
39
views
AttributeError: 'NoneType' object has no attribute 'blocks' when running Cache-DiT example with Wan2.2 model
I’m trying to use
Cache-DiT
to accelerate inference for the Wan2.2 model.
However, when I run the example script,
python run_wan_2.2_i2v.py --steps 28 --cache
I get the following error.
Namespace(...
0
votes
0
answers
39
views
How do I interpret Gaussian process parameters?
I'm performing Gaussian process regression using GPyTorch. I'm modeling two correlated tasks as follows:
class MyModel(gpytorch.models.ExactGP):
def __init__(self, X, Y, likelihood):
super(...
2
votes
0
answers
59
views
Having problems computing PDE Residuals
I'm computing PDE residuals for The_Well datasets (e.g. turbulent_radiative_layer_2D and shear_flow) using finite differences, but the residuals are much larger than I expect. The data are generated ...
0
votes
1
answer
29
views
Can I avoid setting-up and tearing down processes when using PyTorch DataLoader?
In my scenario I use multiple DataLoaders with multiple Datasets to evaluate models against each other (I want to test models with multiple resolutions, which means each dataset has a distinct ...
1
vote
1
answer
125
views
Can uv integrate with e.g. pytorch prebuilt docker env?
So, pytorch requires a rather large bundle of packages. The prebuilt docker pytorch gpu images (https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html) are quite helpful in ...
1
vote
0
answers
186
views
Why does “Command Buffer Full” appear in PyTorch CUDA kernel launches?
I’m using the PyTorch profiler to analyze sglang, and I noticed that in the CUDA timeline, some kernels show “Command Buffer Full”. This causes the cudaLaunchKernel time to become very long, as shown ...
0
votes
0
answers
98
views
ModuleNotFoundError: No module named 'losses.loss'; 'losses' is not a package error when training MAT model (PyTorch, NVIDIA repo)
I'm trying to fine-tune the MAT (Masked Attention Transformer) model from the official repository:
https://github.com/fenglinglwb/MAT
However, I keep getting the following error during training:
...
-1
votes
0
answers
38
views
open3d.ml build for tourch==2.10 (for sm_120 GPU architecture)
I have NVIDIA GeForce RTX 5060 with the "Blackwell" architecture with compute capability 12.0 that's why i have to use nightly build of pytorch=2.10.0.dev20251017+cu128 which support for ...
0
votes
0
answers
98
views
Torch example transformer with TransformerDecoder
In the torch example provided here https://github.com/pytorch/examples/tree/main/word_language_model, tansformer only uses torch.TransformerEncoder and torch.TransformerDecoder is overwritten with a ...
0
votes
0
answers
40
views
T5-small generates only padding tokens during validation/test in PyTorch Lightning
I'm fine-tuning T5-small using PyTorch Lightning and encountering a strange issue during validation and test steps.
The Problem:
During validation_step and test_step, model.generate() consistently ...
-1
votes
0
answers
75
views
Torchvision save segmentation masks to png
There is a tutorial i try to follow https://docs.pytorch.org/tutorials/intermediate/torchvision_tutorial.html
working with .png files as segmentation masks.
The png files can be found here:
https://...
2
votes
1
answer
123
views
Fast vectorized maximal independent set greedy algorithm [closed]
I need a really fast vectorized maximal independent set algorithm implemented in pytorch, so I can use it for tasks with thousands of nodes in reasonable time.
I cannot use networkx, it is way too ...
1
vote
0
answers
68
views
How to pass P_map: dict[str, torch.Tensor] to PEFT (LoRA)?
My proxy goal is to change LoRA from h = (W +BA)x to h = (W + BAP)x. Preliminary code attached for your reference
My actual goal is to train a model with the following loss: 〖Θ ̃=(arg min)┬Δ ̂ 〗〖‖𝑓_(...