Diagnosis of random reboot
What to do when our Slimbook restarts by itself

Random reboots are often caused by hardware problems, such as overheating, RAM failures, damaged disks or unstable power supplies, or software conflicts, such as incompatible drivers, kernel errors or outdated firmware. External factors, such as electrical fluctuations, can also play a role. Analyzing system logs and using diagnostic tools is key to identifying the cause.

In this tutorial you will find the steps to diagnose unexpected reboots.

System logs are essential to understand what happens just before a reboot.

1. Open a terminal.
2. Collect the logs.

sudo journalctl --since "2 days ago" > logs_reboot.txt

What to look for in the logs:

- Messages indicating critical errors (`Critical`, `Kernel Panic`, `OOM Killer`).
- Hardware related failures `GPU`, `CPU`, `thermal`, `power supply`, o controladores.

Check kernel messages

The kernel often logs hardware-related problems. Extract kernel messages:

sudo dmesg > logs_dmesg.txt

Analyzes the logs_dmesg.txt file looking for:

- Hardware-related error messages: `nouveau`, `nvidia`, `radeon`, `amdgpu`, `thermal throttling`, etc.
- Device disconnect/reconnect messages (`X disconnected`).

Monitoring temperatures and overheating

Abrupt reboots are common when the system reaches critical temperatures.

1. Install s-tui y stress:

sudo apt install s-tui stress      # Ubuntu
sudo dnf install s-tui stress      # Fedora
sudo pacman -S s-tui stress    # Manjaro

2. Monitors sensors and temperatures in real time for critical values, above 90-95 ºC.
Depending on the CPU model the maximum supported  temperature may differ. For more information we recommend consulting the official AMD or Intel website and searching for "Maximum operating temperature"

sudo s-tui

Runs a load test by choosing the stress option for 30 minutes. Monitors the temperature.


Diagnosing RAM memory

RAM errors may cause reboots. Use memtest86+ following this tutorial:

Rule out RAM failure/error with Memtest86+ | SLIMBOOK

Check disk status

Hard disk or SSD problems can cause instability. Use smartctl to check:

1. Install nvme-cli:

sudo apt install nvme-cli         # Ubuntu
sudo dnf install nvme-cli         # Fedora
sudo pacman -S nvme-cli       # Manjaro

2. Check disk status:

sudo nvme smart-log /dev/nvme0

Change `/dev/nvme0` to your disk identifier (e.g., `//dev/nvme0, /dev/nvme1, etc`).


3. Check indicators such as:

critical_warning: This field indicates if there are critical problema on the NVMe device. It is a bitmask field in the NVMe SMART/Health register, where each bit represents a critical condition. Possible values are:

- 0x00: Everything is fine, there aren't critical warnings.
- 0x01: Space on the device is low (almost full).
- 0x02: Degraded performance due to high or low temperature.
- 0x04: NAND memory lifetime has expired.
- 0x08: Low reserve capacity (lack of spare blocks).
- 0x10: Nand memory has reached or exceeded the allowable wear limit.
- 0x20: Critical temperature.
- 0x40: and higher: Reserved for future use

num_err_log_entries: This field shows the cumulative number of entries in the device error log. It is useful for checking whether operation errors have occurred, such as

- Read or write failures.
- Host connection problems.
- NVMe controller internal errors.

Warning Temperature Time: Indicates the cumulative time (in minutes or hours, depending on the device) that the NVMe has been working out of the recommended temperature range, but not necessarily in a critical state. There are two temperature-related values for NVMe drives:

- Warning Temperature: An upper or lower range which is not ideal but fatal either.
- Critical Temperature: Temperatura extrema que puede dañar el dispositivo.

Update firmware/BIOS

An outdated firmware may cause hardware problems. Update tre BIOS and EC following this tutorial:

How to update BIOS and EC on your Slimbook | SLIMBOOK

What to do if reboots continue

1. Provide us with:

      - journalctl and dmesg logs.
   - stress and s-tui tests and results.
   - Disk status with nvme-cli.

2. Check for possible hardware failures:

      - Power supply.
   - Battery failures.

3. Consider contacting ys through
support if you do not identify the cause.

Diagnosis of random reboot
Vaja Benidze Slimbook
17 December, 2024
share
archive