220 lines
7.1 KiB
Markdown
220 lines
7.1 KiB
Markdown
# Robot B-Side Boot Chain
|
|
|
|
This directory contains the robot-side boot and recovery scripts.
|
|
|
|
Normal usage is:
|
|
|
|
```bash
|
|
sudo bash scripts/boot/install-systemd.sh
|
|
sudo systemctl start blitz-robot.target
|
|
```
|
|
|
|
After installation, `blitz-robot.target` is enabled and will start automatically on reboot.
|
|
|
|
To stop the chain now and disable boot-time autostart for future reboots:
|
|
|
|
```bash
|
|
sudo bash scripts/boot/disable-systemd.sh
|
|
```
|
|
|
|
## Current Startup Order
|
|
|
|
The current cold-start chain is:
|
|
|
|
1. `blitz-boot-gate.service`
|
|
2. `blitz-5g-dial.service`
|
|
3. `blitz-ros-receiver.service`
|
|
4. `blitz-b-side-omnid.service`
|
|
5. `blitz-watchdog.service`
|
|
|
|
There is no longer any automatic time-sync step in the boot chain.
|
|
|
|
## What Each Script Does
|
|
|
|
- `robot-boot.env`: default boot configuration
|
|
- `robot-boot.env.local`: machine-local overrides
|
|
- `common.sh`: shared env loading, logging, and helper functions
|
|
- `boot-gate.sh`: fixed startup delay gate
|
|
- `5g-dial.sh`: brings up the 5G modem path and verifies routing
|
|
- `start-ros-receiver-service.sh`: boot wrapper for ROS receiver
|
|
- `wait-for-unix-socket.sh`: waits for the ROS receiver unix socket
|
|
- `start-b-side-omnid-service.sh`: boot wrapper for `b_side_omnid`
|
|
- `blitz-watchdog.sh`: runtime health watchdog and recovery orchestrator
|
|
- `blitz-fault-inject.sh`: fault injection entrypoint
|
|
- `install-systemd.sh`: installs systemd units into `/etc/systemd/system`
|
|
- `disable-systemd.sh`: stops the boot chain and disables autostart
|
|
|
|
## Important Configuration
|
|
|
|
Most machine-specific overrides should go into:
|
|
|
|
```text
|
|
scripts/boot/robot-boot.env.local
|
|
```
|
|
|
|
Typical settings:
|
|
|
|
```bash
|
|
BLITZ_BOOT_DELAY_SEC="30"
|
|
BLITZ_LOG_FILE="/var/log/blitz-robot/startup.log"
|
|
BLITZ_RUNTIME_DIR="/run/blitz-robot"
|
|
|
|
BLITZ_5G_DIAL_DIR="${OMNISOCKETGO_ROOT}/scripts/boot"
|
|
BLITZ_5G_SERIAL_PORT="/dev/ttyUSB2"
|
|
BLITZ_5G_INTERFACE=""
|
|
BLITZ_5G_MODEM_SUBNET="192.168.224.0/22"
|
|
BLITZ_5G_GATEWAY="192.168.225.1"
|
|
BLITZ_5G_REMOVE_DEFAULT_ROUTE="1"
|
|
BLITZ_5G_ROUTE_TARGETS="106.55.173.235"
|
|
BLITZ_5G_INFO_JSON="${OMNISOCKETGO_ROOT}/scripts/boot/modem_network_info.json"
|
|
|
|
BLITZ_TIME_SERVER_IP="81.70.156.140"
|
|
|
|
BLITZ_ROS_USER="nvidia"
|
|
BLITZ_ROS_SOCKET_WAIT_SEC="20"
|
|
BLITZ_WATCHDOG_INTERVAL_SEC="5"
|
|
BLITZ_HEALTH_STALE_SEC="15"
|
|
BLITZ_OMNID_THREAD_HEARTBEAT_TIMEOUT_SEC="15"
|
|
BLITZ_NETWORK_FAIL_THRESHOLD="3"
|
|
BLITZ_NETWORK_RECOVERY_COOLDOWN_SEC="30"
|
|
BLITZ_GPS_MONITOR_ENABLED="1"
|
|
BLITZ_GPS_DEVICE_GLOB="/dev/ttyCH341USB*"
|
|
BLITZ_GPS_CHECK_INTERVAL_SEC="10"
|
|
BLITZ_GPS_RESTART_UNITS="gpsd.socket gpsd.service"
|
|
BLITZ_WATCHDOG_ALLOW_FAULT_INJECTION="0"
|
|
```
|
|
|
|
`BLITZ_TIME_SERVER_IP` is still used, but only as the 5G route/ping health-check target. It is no longer used for automatic clock synchronization.
|
|
|
|
If `BLITZ_TIME_SERVER_IP` is left empty, the scripts fall back to the host part of `ROBOT_SIDE_OMNISOCKET_SERVER_ADDR`.
|
|
|
|
## Install Or Upgrade
|
|
|
|
Run:
|
|
|
|
```bash
|
|
sudo bash scripts/boot/install-systemd.sh
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl restart blitz-robot.target
|
|
```
|
|
|
|
`install-systemd.sh` will also remove any old `blitz-time-sync.service` unit left over from earlier versions.
|
|
|
|
## Disable Autostart
|
|
|
|
To stop the currently running services and disable autostart for future reboots:
|
|
|
|
```bash
|
|
sudo bash scripts/boot/disable-systemd.sh
|
|
```
|
|
|
|
To re-enable later:
|
|
|
|
```bash
|
|
sudo bash scripts/boot/install-systemd.sh
|
|
sudo systemctl start blitz-robot.target
|
|
```
|
|
|
|
## Logs
|
|
|
|
All boot-chain and watchdog logs are appended to:
|
|
|
|
```text
|
|
/var/log/blitz-robot/startup.log
|
|
```
|
|
|
|
Follow the log live:
|
|
|
|
```bash
|
|
sudo tail -f /var/log/blitz-robot/startup.log
|
|
```
|
|
|
|
Check service state:
|
|
|
|
```bash
|
|
sudo systemctl status blitz-robot.target
|
|
sudo systemctl status blitz-5g-dial.service
|
|
sudo systemctl status blitz-ros-receiver.service
|
|
sudo systemctl status blitz-b-side-omnid.service
|
|
sudo systemctl status blitz-watchdog.service
|
|
```
|
|
|
|
Check systemd journal:
|
|
|
|
```bash
|
|
sudo journalctl -u blitz-robot.target -u blitz-5g-dial.service \
|
|
-u blitz-ros-receiver.service -u blitz-b-side-omnid.service \
|
|
-u blitz-watchdog.service -f
|
|
```
|
|
|
|
## Runtime Status Files
|
|
|
|
The runtime status directory is:
|
|
|
|
```text
|
|
/run/blitz-robot
|
|
```
|
|
|
|
Key files:
|
|
|
|
- `b-side-omnid.status.json`
|
|
- `ros-receiver.status.json`
|
|
- `watchdog.status.json`
|
|
|
|
`watchdog.status.json` now also records `gps_ok` and `gps_device_present` so you can quickly tell whether the GPS USB serial node is currently visible and whether the last `gpsd` reconnect attempt succeeded.
|
|
|
|
Pretty-print them:
|
|
|
|
```bash
|
|
sudo python3 -m json.tool /run/blitz-robot/watchdog.status.json
|
|
sudo python3 -m json.tool /run/blitz-robot/b-side-omnid.status.json
|
|
sudo python3 -m json.tool /run/blitz-robot/ros-receiver.status.json
|
|
```
|
|
|
|
## Fault Injection
|
|
|
|
Available test commands:
|
|
|
|
```bash
|
|
sudo bash scripts/boot/blitz-fault-inject.sh bside-crash
|
|
sudo bash scripts/boot/blitz-fault-inject.sh bside-process-freeze
|
|
sudo bash scripts/boot/blitz-fault-inject.sh bside-video-thread-stall
|
|
sudo bash scripts/boot/blitz-fault-inject.sh bside-control-thread-stall
|
|
sudo bash scripts/boot/blitz-fault-inject.sh ros-crash
|
|
sudo bash scripts/boot/blitz-fault-inject.sh ros-freeze
|
|
```
|
|
|
|
For synthetic network fault injection, first enable it in `robot-boot.env.local`:
|
|
|
|
```bash
|
|
BLITZ_WATCHDOG_ALLOW_FAULT_INJECTION="1"
|
|
```
|
|
|
|
Then restart watchdog and inject:
|
|
|
|
```bash
|
|
sudo systemctl restart blitz-watchdog.service
|
|
sudo bash scripts/boot/blitz-fault-inject.sh network-down on
|
|
sudo bash scripts/boot/blitz-fault-inject.sh network-down off
|
|
```
|
|
|
|
## Recovery Behavior Summary
|
|
|
|
- If `b_side_omnid` dies or its status file goes stale, watchdog first tries a targeted `b_side` restart.
|
|
- If ROS receiver dies, loses its socket, or its heartbeat goes stale, watchdog performs an ordered full restart:
|
|
- stop `b_side`
|
|
- restart ROS receiver
|
|
- wait for unix socket
|
|
- start `b_side`
|
|
- If network checks fail repeatedly, watchdog stops `b_side`, runs `5g-dial.sh`, waits for route recovery, and then restores services.
|
|
- While 5G is healthy, watchdog keeps every host route listed by `BLITZ_TIME_SERVER_IP` and `BLITZ_5G_ROUTE_TARGETS` pinned to the resolved 5G interface. When 5G becomes unhealthy, watchdog deletes those host routes so traffic can fall back to the remaining default network path. If that fallback path is still reachable, watchdog keeps `b_side_omnid` running instead of treating it as a full network outage.
|
|
- Whenever watchdog changes or restores those host routes, it logs `route-path` lines for each target so you can see which interface Linux currently chooses for `81.70.156.140`, `106.55.173.235`, and any other configured 5G-pinned target.
|
|
- If GPS monitoring is enabled, watchdog checks `BLITZ_GPS_DEVICE_GLOB` every `BLITZ_GPS_CHECK_INTERVAL_SEC` seconds. When the GPS serial device disappears and later reappears, watchdog restarts the units in `BLITZ_GPS_RESTART_UNITS` so `gpsd` can bind to the new device node again.
|
|
- Camera disappearance is logged as degraded state. Reappearance triggers a `b_side` restart after the device is stable.
|
|
|
|
## Notes
|
|
|
|
- `time-sync.sh` and `blitz-time-sync.service` are intentionally removed from the automatic boot path.
|
|
- `b_side_omnid` must already be built before boot-time startup.
|
|
- `bin/b_side_omnid` missing, ROS env missing, or modem script missing will all show up in `startup.log`.
|