feat: 自启动与自恢复机制

This commit is contained in:
2026-04-13 21:55:40 +08:00
parent 2f507a7546
commit 25c68530ba
19 changed files with 1151 additions and 451 deletions

View File

@@ -1,385 +1,210 @@
# 机器人 B 端开机自启说明
# Robot B-Side Boot Chain
这个目录是给机器人端做开机自启用的。
This directory contains the robot-side boot and recovery scripts.
你看到这里多了不少脚本和 `systemd` 单元,不是为了让你手工一条条执行,而是为了把开机流程拆开管理:
1. 固定启动顺序
2. 某一步失败时可单独重试
3. 所有动作统一写到一个本地日志文件
4. 后面如果要把“固定延时 30 秒”换成“等待机器人原有自检完成”,只改最前面的闸门即可
所以平时真正需要人工执行的,通常只有这两步:
Normal usage is:
```bash
sudo bash scripts/boot/install-systemd.sh
sudo systemctl start blitz-robot.target
```
以后机器人重启时,就不需要你再手工执行这些脚本了。
After installation, `blitz-robot.target` is enabled and will start automatically on reboot.
## 启动顺序
To stop the chain now and disable boot-time autostart for future reboots:
当前开机链路如下:
```bash
sudo bash scripts/boot/disable-systemd.sh
```
## Current Startup Order
The current cold-start chain is:
1. `blitz-boot-gate.service`
2. `blitz-5g-dial.service`
3. `blitz-time-sync.service`
4. `blitz-ros-receiver.service`
5. `blitz-b-side-omnid.service`
3. `blitz-ros-receiver.service`
4. `blitz-b-side-omnid.service`
5. `blitz-watchdog.service`
对应业务顺序就是:
There is no longer any automatic time-sync step in the boot chain.
1. 先固定等待 30 秒,给机器人原有自检/自启程序让路
2. 运行 5G 自动拨号
3. 运行时钟同步
4. 启动 `start-ros-receiver.sh`
5. 启动 `start-b-side-omnid.sh`
## What Each Script Does
## 日志文件
- `robot-boot.env`: default boot configuration
- `robot-boot.env.local`: machine-local overrides
- `common.sh`: shared env loading, logging, and helper functions
- `boot-gate.sh`: fixed startup delay gate
- `5g-dial.sh`: brings up the 5G modem path and verifies routing
- `start-ros-receiver-service.sh`: boot wrapper for ROS receiver
- `wait-for-unix-socket.sh`: waits for the ROS receiver unix socket
- `start-b-side-omnid-service.sh`: boot wrapper for `b_side_omnid`
- `blitz-watchdog.sh`: runtime health watchdog and recovery orchestrator
- `blitz-fault-inject.sh`: fault injection entrypoint
- `install-systemd.sh`: installs systemd units into `/etc/systemd/system`
- `disable-systemd.sh`: stops the boot chain and disables autostart
所有关键操作都会统一写到这个本地文件:
## Important Configuration
```text
/var/log/blitz-robot/startup.log
```
每一行日志格式如下:
```text
timestamp | step | action | result | details | exit_code
```
日志里会记录:
- 做了什么
- 实际执行了什么命令
- 前置检查是否通过
- 成功还是失败
- 失败原因
- 退出码
- 是否发生了重试
## 这些文件分别是干什么的
- `robot-boot.env`:开机自启默认配置
- `robot-boot.env.local`:本机覆盖配置,建议把你自己的配置写这里
- `common.sh`:公共环境加载和统一日志函数
- `boot-gate.sh`:启动闸门,当前逻辑是固定等待 30 秒
- `5g-dial.sh`:等待 5G 串口出现,执行 `rndis_dial.py`,删除 5G 默认路由并补齐目标主机路由,然后检查路由是否真的起来
- `time-sync.sh`:把 `chrony` 指向白名单服务器 IP 和端口,并执行一次同步
- `start-ros-receiver-service.sh`:开机版 ROS receiver 启动包装
- `wait-for-unix-socket.sh`:等待 ROS receiver 建好本地 unix socket
- `start-b-side-omnid-service.sh`:开机版 `b_side_omnid` 启动包装
- `install-systemd.sh`:把 `systemd` 单元安装到 `/etc/systemd/system`
- `systemd/*.service.in``systemd/*.target.in``systemd` 模板文件
## 前置条件
你前面说过,除了时钟同步以外,其他程序环境都应该已经配好了。按这个前提,这里只强调必须确认的前置条件。
### 1. 机器人侧必须已有的条件
默认认为下面这些已经具备:
- 系统是 Ubuntu且使用 `systemd`
- `OmniSocketGo` 仓库已经放在机器人上
- `scripts/dev/start-ros-receiver.sh` 原本就能正常启动
- `scripts/dev/start-b-side-omnid.sh` 原本就能正常启动
- `bin/b_side_omnid` 已经提前编译好
- 5G 拨号脚本存在:`/home/nvidia/5g-test/5G/rndis_dial.py`
- 5G 串口设备是:`/dev/ttyUSB7`
注意:
- 开机模式下不会自动编译 `b_side_omnid`
- 如果 `bin/b_side_omnid` 不存在,服务会直接报错并写日志
### 2. 时钟同步需要的前置安装
时钟同步这一步依赖 `chrony`
如果机器人侧没有安装,请先安装:
```bash
sudo apt update
sudo apt install -y chrony
```
安装后建议确认:
```bash
systemctl status chrony
chronyc tracking
```
### 3. 云服务器侧需要的前置条件
因为你的 5G 是白名单网络,所以时钟同步不能依赖公网域名或默认 NTP 池,必须只用你的白名单云服务器 IP。
云服务器侧需要满足:
- 服务器上运行 `chronyd`
- 安全组 / 防火墙放通你实际使用的 UDP 端口
- 机器人能访问这台服务器的 IP
如果云服务器还没有安装 `chrony`,可以参考:
```bash
sudo apt update
sudo apt install -y chrony
sudo systemctl enable chrony
sudo systemctl restart chrony
```
如果你不能使用标准的 `123/udp`,完全可以改成你自己的端口,例如 `10910/udp`
例如云服务器 /etc/chrony/chrony.conf 里改成监听 10910
```conf
port 10910
allow 0/0
```
然后重启:
```bash
sudo systemctl restart chrony
```
机器人端则在 `robot-boot.env.local` 里配置:
```bash
BLITZ_TIME_SERVER_IP="你的云服务器IP"
BLITZ_TIME_SERVER_PORT="10910"
```
这样 `time-sync.sh` 会自动生成:
```conf
server 你的云服务器IP port 10910 iburst
```
注意:这里必须是你自己可控的 `chronyd` 服务端。公网标准 NTP 服务通常只监听 `123/udp`,不能要求它们改到 `10910`
## 需要改哪些配置
不要直接改 `robot-boot.env`,更推荐新建:
Most machine-specific overrides should go into:
```text
scripts/boot/robot-boot.env.local
```
常见要改的是这些:
Typical settings:
```bash
BLITZ_BOOT_DELAY_SEC="30"
BLITZ_LOG_FILE="/var/log/blitz-robot/startup.log"
BLITZ_RUNTIME_DIR="/run/blitz-robot"
BLITZ_5G_DIAL_DIR="/home/nvidia/5g-test/5G"
BLITZ_5G_SERIAL_PORT="/dev/ttyUSB7"
BLITZ_5G_DIAL_DIR="${OMNISOCKETGO_ROOT}/scripts/boot"
BLITZ_5G_SERIAL_PORT="/dev/ttyUSB2"
BLITZ_5G_INTERFACE=""
BLITZ_5G_MODEM_SUBNET="192.168.224.0/22"
BLITZ_5G_GATEWAY="192.168.225.1"
BLITZ_5G_REMOVE_DEFAULT_ROUTE="1"
BLITZ_5G_ROUTE_TARGETS="106.55.173.235"
BLITZ_5G_INFO_JSON="${OMNISOCKETGO_ROOT}/scripts/boot/modem_network_info.json"
BLITZ_TIME_SERVER_IP="你的白名单云服务器IP"
BLITZ_TIME_SERVER_PORT="10910"
BLITZ_TIME_SERVER_IP="81.70.156.140"
BLITZ_ROS_USER="nvidia"
BLITZ_ROS_SOCKET_WAIT_SEC="20"
BLITZ_WATCHDOG_INTERVAL_SEC="5"
BLITZ_HEALTH_STALE_SEC="15"
BLITZ_OMNID_THREAD_HEARTBEAT_TIMEOUT_SEC="15"
BLITZ_NETWORK_FAIL_THRESHOLD="3"
BLITZ_NETWORK_RECOVERY_COOLDOWN_SEC="30"
BLITZ_WATCHDOG_ALLOW_FAULT_INJECTION="0"
```
如果 `BLITZ_TIME_SERVER_IP` 留空,脚本会自动回退到 `ROBOT_SIDE_OMNISOCKET_SERVER_ADDR` 的 IP 部分。
`BLITZ_TIME_SERVER_IP` is still used, but only as the 5G route/ping health-check target. It is no longer used for automatic clock synchronization.
`BLITZ_5G_REMOVE_DEFAULT_ROUTE="1"` 时,脚本会在 5G 拨号完成后删除该接口上的默认路由,避免整机默认出口切到 5G。此时 `BLITZ_TIME_SERVER_IP``BLITZ_5G_ROUTE_TARGETS` 中的目标 IP 会显式走 5G其它流量继续走有线或 Wi-Fi 的默认路由。
If `BLITZ_TIME_SERVER_IP` is left empty, the scripts fall back to the host part of `ROBOT_SIDE_OMNISOCKET_SERVER_ADDR`.
## 如何安装和使用
## Install Or Upgrade
下面假设你当前目录就在 `OmniSocketGo` 仓库根目录。
### 第一步:准备本机配置
建议先创建:
```bash
cp scripts/boot/robot-boot.env scripts/boot/robot-boot.env.local
```
然后编辑:
```bash
vim scripts/boot/robot-boot.env.local
```
至少确认这几个值是对的:
- `BLITZ_5G_DIAL_DIR`
- `BLITZ_5G_SERIAL_PORT`
- `BLITZ_TIME_SERVER_IP`
- `BLITZ_TIME_SERVER_PORT`
- `BLITZ_ROS_USER`
### 第二步:安装 systemd 单元
执行:
Run:
```bash
sudo bash scripts/boot/install-systemd.sh
sudo systemctl daemon-reload
sudo systemctl restart blitz-robot.target
```
这个安装脚本会做这些事情:
`install-systemd.sh` will also remove any old `blitz-time-sync.service` unit left over from earlier versions.
1. 创建日志目录和日志文件
2. 渲染 `systemd` 模板
3. 把 unit 文件复制到 `/etc/systemd/system`
4. 执行 `systemctl daemon-reload`
5. 执行 `systemctl enable blitz-robot.target`
## Disable Autostart
### 第三步:立刻启动一次
执行:
To stop the currently running services and disable autostart for future reboots:
```bash
sudo bash scripts/boot/disable-systemd.sh
```
To re-enable later:
```bash
sudo bash scripts/boot/install-systemd.sh
sudo systemctl start blitz-robot.target
```
### 第四步:以后重启自动生效
## Logs
因为安装脚本已经做了 `enable`,所以后续机器人重启时会自动拉起,不需要你再手工执行。
如果想手工确认,也可以执行:
```bash
sudo systemctl enable blitz-robot.target
```
## 如何查看是否正常
### 看总日志文件
最直接:
```bash
tail -f /var/log/blitz-robot/startup.log
```
### 看各个服务状态
```bash
systemctl status blitz-robot.target
systemctl status blitz-boot-gate.service
systemctl status blitz-5g-dial.service
systemctl status blitz-time-sync.service
systemctl status blitz-ros-receiver.service
systemctl status blitz-b-side-omnid.service
```
### 看 journal
```bash
journalctl -u blitz-robot.target -u blitz-boot-gate.service -u blitz-5g-dial.service \
-u blitz-time-sync.service -u blitz-ros-receiver.service \
-u blitz-b-side-omnid.service -f
```
## 当前时钟同步会做什么
`time-sync.sh` 当前逻辑是:
1. 读取 `BLITZ_TIME_SERVER_IP`
2. 读取 `BLITZ_TIME_SERVER_PORT`
3. 修改 `/etc/chrony/chrony.conf`
4. 注释掉原有的 `pool``server`
5. 保留一个备份文件:`/etc/chrony/chrony.conf.blitz-bak`
6. 写入:
All boot-chain and watchdog logs are appended to:
```text
/etc/chrony/sources.d/blitz-robot.sources
/var/log/blitz-robot/startup.log
```
7. 生成类似下面这一行:
```conf
server 你的云服务器IP port 10910 iburst
```
8. 重启 `chrony`
9. 执行 `chronyc burst`
10. 执行 `chronyc waitsync`
注意:
- 如果同步超时,会记日志为 `soft_fail`
- 但不会阻塞后面的 ROS 和 `b_side_omnid` 启动
## 常见问题
### 1. 为什么会突然多出这么多脚本?
因为把开机流程拆成了多个稳定的小步骤:
- 更容易排查哪一步失败
- 更容易让 `systemd` 自动重启
- 更容易记录完整日志
- 后续更容易替换“30 秒延时”为真正的机器人 ready 条件
你平时不需要手工逐个执行这些脚本。
### 2. 我是不是要手工跑 `5g-dial.sh`、`time-sync.sh`、`start-ros-receiver-service.sh`
正常情况下不用。
你只需要:
Follow the log live:
```bash
sudo bash scripts/boot/install-systemd.sh
sudo systemctl start blitz-robot.target
sudo tail -f /var/log/blitz-robot/startup.log
```
### 3. 如果时钟同步失败怎么办?
先看:
Check service state:
```bash
tail -f /var/log/blitz-robot/startup.log
systemctl status blitz-time-sync.service
chronyc sources -v
chronyc tracking
sudo systemctl status blitz-robot.target
sudo systemctl status blitz-5g-dial.service
sudo systemctl status blitz-ros-receiver.service
sudo systemctl status blitz-b-side-omnid.service
sudo systemctl status blitz-watchdog.service
```
优先检查:
- `BLITZ_TIME_SERVER_IP` 是否填对
- `BLITZ_TIME_SERVER_PORT` 是否填对
- 云服务器是否真的跑了 `chronyd`
- 云服务器防火墙 / 安全组是否放通你配置的 UDP 端口,例如 `10910`
- 5G 白名单是否确实允许访问这个服务器 IP
### 4. 如果 ROS receiver 没起来怎么办?
先看:
Check systemd journal:
```bash
systemctl status blitz-ros-receiver.service
tail -f /var/log/blitz-robot/startup.log
sudo journalctl -u blitz-robot.target -u blitz-5g-dial.service \
-u blitz-ros-receiver.service -u blitz-b-side-omnid.service \
-u blitz-watchdog.service -f
```
再检查:
## Runtime Status Files
- `/opt/ros/${ROS_DISTRO}/setup.bash` 是否存在
- `${ROS_CONTROL_PY_DIR}/install/setup.bash` 是否存在
- `ROBOT_RECEIVER_LOCAL_SOCKET_PATH` 对应的 socket 是否出现
The runtime status directory is:
### 5. 如果 b_side_omnid 没起来怎么办?
```text
/run/blitz-robot
```
先看:
Key files:
- `b-side-omnid.status.json`
- `ros-receiver.status.json`
- `watchdog.status.json`
Pretty-print them:
```bash
systemctl status blitz-b-side-omnid.service
tail -f /var/log/blitz-robot/startup.log
sudo python3 -m json.tool /run/blitz-robot/watchdog.status.json
sudo python3 -m json.tool /run/blitz-robot/b-side-omnid.status.json
sudo python3 -m json.tool /run/blitz-robot/ros-receiver.status.json
```
再检查:
## Fault Injection
- `bin/b_side_omnid` 是否已经提前编译好
- 摄像头设备是否存在
- `robot-remote.env` / `robot-boot.env.local` 里的地址配置是否正确
Available test commands:
```bash
sudo bash scripts/boot/blitz-fault-inject.sh bside-crash
sudo bash scripts/boot/blitz-fault-inject.sh bside-process-freeze
sudo bash scripts/boot/blitz-fault-inject.sh bside-video-thread-stall
sudo bash scripts/boot/blitz-fault-inject.sh bside-control-thread-stall
sudo bash scripts/boot/blitz-fault-inject.sh ros-crash
sudo bash scripts/boot/blitz-fault-inject.sh ros-freeze
```
For synthetic network fault injection, first enable it in `robot-boot.env.local`:
```bash
BLITZ_WATCHDOG_ALLOW_FAULT_INJECTION="1"
```
Then restart watchdog and inject:
```bash
sudo systemctl restart blitz-watchdog.service
sudo bash scripts/boot/blitz-fault-inject.sh network-down on
sudo bash scripts/boot/blitz-fault-inject.sh network-down off
```
## Recovery Behavior Summary
- If `b_side_omnid` dies or its status file goes stale, watchdog first tries a targeted `b_side` restart.
- If ROS receiver dies, loses its socket, or its heartbeat goes stale, watchdog performs an ordered full restart:
- stop `b_side`
- restart ROS receiver
- wait for unix socket
- start `b_side`
- If network checks fail repeatedly, watchdog stops `b_side`, runs `5g-dial.sh`, waits for route recovery, and then restores services.
- Camera disappearance is logged as degraded state. Reappearance triggers a `b_side` restart after the device is stable.
## Notes
- `time-sync.sh` and `blitz-time-sync.service` are intentionally removed from the automatic boot path.
- `b_side_omnid` must already be built before boot-time startup.
- `bin/b_side_omnid` missing, ROS env missing, or modem script missing will all show up in `startup.log`.