How can such an output of free -m be explained?
total used free shared buff/cache available
Mem: 32036 1012 225 3 8400 31024
Swap: 32767 24138 8629
I understand having free memory to be low is no sign of alarm as Linux uses unused memory for buffers and file system caches (buff/cache). What's important to have enough available memory.
But why is the kernel not swapping in again? Nearly all memory is available.
I took this output from a continuously log-to-disk I setup as "every minute" cronjob. At that point in time the system was so unresponsive I could not even locally login anymore. After slowly typing username and password, there was a timeout (Login timed out after 60 seconds.), so I could not reach a shell and had to power-cycle the server to recover.
The journal is full of take too long, timeout and broken pipe messages as everything on the system is crawling and therefore malfunctioning.
I played around with vm.swappiness, having the default value of 60 reduced to 10 (to put the kernel more onto "only swap if it's really necessary"), but I have similar results.
I was hesitant to try a swapoff && swapon to bring the available memory back into play. Does the oom-killer take over if not everything fits into RAM? Or does the system crash then?
A little more background information about the concrete case:
I have a Proxmox setup, evaluating how stable everything runs. I really stress the machine having allocated more RAM to the VMs in total than I have. To my unterstanding, this should still work with paying a little price of using swap space, slowing things down.
I noticed that everything works stabile as I expected. I play around suspending VMs to disk, then starting other VMs. Swap gets used if needed and when VMs are suspended, Swap is being freed again.
But lately I added backup into my evaluation and this really crashes the machine. Over night, when PVE Backup is started RAM gets more and more available by consuming Swap. Backup speed falls from "1% per few seconds" to "1% per several hours" and eventually no progress at all. The machine gets unresponsive with that memory picture. The VMs are still running, but also their applications are malfunctioning as their system gets errors like interrupt took 2.2s, Watchdog timeout (limit 3min)!, CPU stuck for 23s!. In the morning I find myself an unresponsive host.
cat /proc/meminfowould give a lot more detail thanfreeand might give a clue how the "available" memory is actually being used.freed, the swap space is likewise released.