Symptoms → Cause → Resolution. Start with the section that matches your situation; use the Error Reference for specific error messages.
Common Startup Problems
API container fails to start or stays unhealthy
Check the logs first:
docker compose logs api
Cause: cannot reach PostgreSQL
fail: Microsoft.EntityFrameworkCore[...] — Connection refused
The API exits immediately if it cannot reach PostgreSQL on startup. Verify PostgreSQL is healthy before the API starts:
docker compose ps # is postgres "healthy"?
docker compose logs postgres
docker exec workflowengine-postgres pg_isready -U postgres
If PostgreSQL is not healthy: check disk space, check the Docker volume is mounted, and check POSTGRES_PASSWORD matches between the central server and station .env files.
Cause: migration failure
System.Exception: An error occurred while applying migrations
The API applies EF Core migrations on every startup. A failed migration leaves the database in a partial state. Do not restart in a loop — restore from a backup and contact support with the full stack trace from docker compose logs api.
Cause: port already in use
System.Net.Sockets.SocketException: Address already in use
Another process is listening on port 7000. Find and stop it:
# Windows
netstat -ano | findstr :7000
# Linux
sudo ss -tlnp | grep :7000
UI is blank or shows a connection error on load
The Blazor UI requires the API to be reachable at its configured ApiUrl. Open the browser developer console (F12) and check for network errors.
Check ApiUrl configuration:
docker compose exec ui env | grep ApiUrl
# Should show: ApiUrl=http://api:7000 (Docker) or http://localhost:7000 (local)
If the UI and API are on different machines, ApiUrl must use the API machine's IP address, not localhost.
Check the API health endpoint:
curl http://localhost:7000/health
# Expected: {"status":"Healthy"}
SignalR connection shows a persistent red dot
The UI shows a red connection indicator (🔴) when the WebSocket connection to the API drops. It reconnects automatically — if the dot stays red for more than 10 seconds:
-
Verify the API is running:
curl <http://localhost:7000/health> -
If behind a reverse proxy, confirm the proxy forwards
UpgradeandConnectionheaders (required for WebSocket) -
Check for a network timeout or idle-connection reset policy on the proxy or load balancer — SignalR keepalives require WebSocket to stay open
-
Check browser console for
WebSocket connection failederrors
Runner does not appear in system health
curl http://localhost:7000/api/health/version
# Expected: "runners": [{"identifier": "dotnet", "healthy": true}, ...]
If a runner is missing or "healthy": false:
docker compose logs dotnet-runner # startup errors
docker compose logs python-runner
docker compose ps # is the runner container running?
docker compose restart dotnet-runner
Cause: runner container failed to start. Look for DLL load errors, missing volume mounts, or port conflicts. The gRPC ports default to 7050 (.NET) and 7051 (Python).
Cause: runner not yet warm after fresh install. After docker compose up the runners take 5–15 seconds to load assemblies before accepting gRPC connections. If get_system_health shows unhealthy immediately after startup, wait 15 seconds and recheck before investigating further.
pgAdmin cannot connect to PostgreSQL
pgAdmin is a convenience tool; it has no effect on Maestro operation. For connection issues:
-
Open
<http://localhost:5088> -
In pgAdmin, go to Servers → right-click → Connect Server
-
Hostname:
postgres(Docker network name, notlocalhost) -
Username:
postgres -
Password: the value of
POSTGRES_PASSWORDin your.env
See the pgAdmin guide in the Maestro documentation for a full setup walkthrough.
Common Execution Failures
"File not found on server" — test will not start
The Test Monitor validates the YAML file path against the API server before enabling the Start button.
Cause: The path entered does not exist on the server.
Resolution:
-
Call
GET /api/packages/test-filesto list every available YAML file with its exact absolute server-side path -
Use that path verbatim — do not guess or construct paths manually
-
Verify the package has been activated (
POST /api/packages/{name}/activate)
Test starts but immediately shows ABORTED or UNDETERMINED
Cause 1: runner unavailable at execution time
The API sent a gRPC ExecuteStep request and received no response (timeout or connection refused). Check runner health:
docker compose logs dotnet-runner --tail 50
curl http://localhost:7000/api/health/version
If the runner is shown as healthy but executions still abort, check whether timeout_ms on the failing step is set too aggressively short.
Cause 2: first run after a fresh package install
The .NET runner loads assemblies lazily on first use. The very first execution after installing a new package may fail with runner unavailable while the CLR loads the new DLL. This is normal. Wait 10 seconds and retry.
Cause 3: YAML references a runner or assembly that does not exist
gRPC status: NOT_FOUND — assembly 'MyTests.dll' not registered
The YAML step references an assembly or class that is not in the installed package. Check the assembly:, class:, and method: fields against what is actually deployed in the package's assemblies/ directory.
A step hangs indefinitely
Cause: timeout_ms not set on a hardware step
A step that communicates with an instrument but has no timeout_ms declared will block the runner indefinitely if the instrument stops responding. The run never transitions to a terminal state.
Resolution:
-
Click Abort in the Test Monitor UI, or call
POST /api/testexecution/{id}/abort -
Add
timeout_msto the step — every step that calls hardware must have one:YAMLtimeout_ms: 10000 # 10 seconds — adjust to the realistic worst-case duration
Cause: prompt step waiting for operator input in an automated run
A type: prompt step without unattendedMode: true blocks until a button is clicked. In a CI/CD or automated context, no one is clicking. Add unattendedMode: true to the execution request, and ensure all prompt steps in the test have either a Continue button or a declared input.default.
Step fails with verdict FAIL but the instrument is reading correctly
Cause 1: limits inverted in YAML
low_limit: 5.25
high_limit: 4.75 ← inverted
The validator catches this (low_limit must be < high_limit), but it is possible to deploy an unvalidated file. Expand the measurement in the Test Monitor or Test Results to check the stored limit values.
Cause 2: wrong variable used in measurement.value
The measurement evaluates {{variable_name}}, but the variable was not set by the step (typo in outputs:, or the runner returned a different key). Expand the step in Test Results and check the actual stored value — if it is 0, empty, or a default, the variable was not resolved.
Cause 3: measurement type mismatch
A string or boolean measurement evaluated as numeric, or vice versa. Check the type: field in the YAML measurement block.
Execution completes but no measurements appear in results
Cause: step uses type: mock without a measurement: block
Mock steps without a measurement block record no data. This is intentional for stub steps — add a measurement: block if the step should record a result.
Cause: runner returned measurements but in an unrecognised format
The .NET or Python runner must return measurement objects using the SDK types (NumericMeasurementPoint, BooleanMeasurementPoint, StringMeasurementPoint). Returning a plain Dictionary<string, string> stores output variables but not measurements.
Cause: step was skipped
A skipped step (precondition false, enabled: false) records no measurements. Check the step status column in the results table.
Variable substitution produces literal {{variable_name}}
The template was not resolved. Causes, in order of likelihood:
-
Variable not declared. The variable must appear in the
variables:block at the top of the YAML. If it is set only by a runner step, it still needs to be declared (with any default value) invariables:. -
Legacy Scriban syntax.
{{test.variable}}is old syntax — use{{variable}}(flat name, no dot prefix). -
Dot-namespaced variable in wrong context.
{{cfg.X}}works inparameters:andmeasurement.value:; it does not work inprecondition:. -
Arithmetic in template.
{{voltage + offset}}is not supported — compute in runner code and return the result as a variable.
Precondition always evaluates to false (or throws)
Cause 1: dot-namespaced variable
precondition: "cfg.DMM_VISA != ''" # ❌ — DynamicExpresso reads this as member access
Copy the value into a bare variable first:
variables:
dmm_visa: ""
# setup step sets dmm_visa from cfg.DMM_VISA...
precondition: "dmm_visa != ''" # ✅
Cause 2: type mismatch in comparison
Variables from runner output are always strings. precondition: "count == 3" will fail if count is the string "3" — use precondition: "count == '3'" or convert explicitly in runner code.
Cause 3: exception during evaluation
If the precondition expression throws (e.g. referencing an undefined variable), the step is marked ABORTED. Check the step's log output in Test Results for the exception message.
Package download fails or hangs
Cause: registry refresh not run before download
The registry must be refreshed before packages appear in the catalog. Click ↻ Refresh Registry in the Packages UI or call POST /api/packages/refresh.
Cause: Git authentication failure
The API runs git clone using the system Git binary. If authentication is not configured, the clone hangs waiting for credentials (or fails with 401).
Test Git connectivity directly on the server:
docker compose exec api git clone <package-git-url> /tmp/test-clone
If this fails, configure a deploy key or embed a token in the URL (see installation guide, §4.4).
Cause: RegistryUrl not configured
RegistryRefreshFailed: RegistryUrl is not configured
Add PackageRegistry:RegistryUrl to appsettings.json or set the environment variable PackageRegistry__RegistryUrl in docker-compose.yml.
Performance Issues
Steps are slow to start (high inter-step latency)
Expected inter-step overhead is 5–20 ms. Significantly higher values indicate:
Database latency. The API writes a step_result row after every step. If PostgreSQL is on a slow network link (multi-station deployment), this adds latency to every step. The database should be on the same LAN segment as the station.
Redis latency. Variable reads and writes go to Redis synchronously during step execution. Verify Redis is healthy and local:
docker exec workflowengine-redis redis-cli PING # → PONG
docker exec workflowengine-redis redis-cli --latency -i 1
Expected latency from within the Docker network: < 1 ms.
Runner startup time. The first step in a test after a cold container start includes JVM/CLR warm-up time. This is a one-time cost per session, not per step.
Test Results search is slow
The measurements table grows quickly at high throughput. Without adequate indexes or table partitioning, large-range queries slow down.
Check query plan:
EXPLAIN ANALYZE
SELECT * FROM measurements
WHERE measurement_name = 'VOUT_3V3'
AND timestamp > NOW() - INTERVAL '30 days';
If the query plan shows a sequential scan, verify the indexes exist:
\d measurements -- in psql — check for index on (measurement_name, timestamp)
For sites running > 1,000 executions per day, partition the measurements table by month and archive older partitions. See the scaling section in How Maestro Works.
Blazor UI is slow or unresponsive during a test
The Blazor UI receives SignalR events on every step and every measurement. A test with many rapid steps (< 100 ms per step) or many measurements per step can saturate the browser's event loop.
Mitigation:
-
Reduce measurement count per step — log only measurements that need limit evaluation; use
operator: logfor informational values -
Close unused browser tabs pointed at the Test Monitor
-
Use the REST API for polling in CI/CD pipelines instead of the Blazor UI
Intermittent / Flaky Behaviour
A step sometimes times out but usually succeeds
Cause: timeout_ms set too close to the instrument's typical response time
Add headroom. If an instrument typically responds in 800 ms, set timeout_ms: 3000. Instrument response time varies with network conditions, temperature, and measurement range.
Cause: instrument in an unready state on retry
The instrument takes time to settle between measurements. The previous step leaves the instrument in an intermediate state. Use a type: delay step (not Thread.Sleep inside runner code) between steps to let the instrument settle:
- name: "Settle delay"
type: delay
duration: 0.5 # 500 ms
Cause: MSTest parallelization enabled
If test code is being developed with MSTest and multiple methods run concurrently, they can issue conflicting commands to the same instrument. Add [assembly: DoNotParallelize] to MSTestSettings.cs. See the SDK debugging guide.
Results are inconsistent across runs on the same unit
Cause: station configuration changed between runs
Every execution records a config snapshot. Compare the config_snapshot JSONB field on two execution records to see if any cfg.* values changed between runs:
SELECT id, started_at, config_snapshot
FROM test_executions
WHERE serial_number = 'UNIT-042'
ORDER BY started_at DESC
LIMIT 5;
Cause: variable from a previous run persisted
Variables are scoped to the execution realm in Redis and deleted when the test completes. If a test aborted before Redis cleanup, the realm may have stale entries. Redis TTLs should expire these within minutes, but if the issue persists, call KEYS realm:* in the Redis CLI and manually delete orphaned realms.
Cause: force: "pass" or force: "fail" left in YAML
The force: field is a development override that makes the YAML validator emit a warning but does not block the run. Search YAML files for force: before promoting a package to Released.
SignalR events are delayed or dropped
Cause: network congestion or proxy buffering
SignalR events are sent over WebSocket. A proxy or firewall that buffers WebSocket frames can delay live step updates. Verify the proxy configuration passes WebSocket traffic without buffering.
Cause: browser tab was in the background
Chrome and Edge throttle JavaScript in background tabs. The SignalR client processes events when the tab becomes active — the result appears correct but the live stream appeared to pause. This does not affect test execution or result storage; only the live display is affected.
Runner reports a different version after upgrade
After docker compose pull && docker compose up -d, verify the runner images were actually replaced:
docker compose images # check the IMAGE ID column changed
docker compose exec dotnet-runner dotnet --version # confirm .NET version
If the old image is still running, Docker may be using a cached layer. Force a full pull:
docker compose pull --no-cache
docker compose up -d --force-recreate
Configuration Conflicts
Station-local config not overriding global config
Global config has StationId = NULL; station-local has StationId = "ST-01". The merge rule is: station-local wins. Check that:
-
The station-local entry has the correct
StationId— a typo here means it never matches the running station -
The
StationIdinappsettings.json(orSTATION_NAMEin.env) matches exactly what was used when creating the station-local entry
# View the merged config for the running station
Invoke-RestMethod http://localhost:7000/api/config/merged/ST-01
# Compare against raw global config
Invoke-RestMethod http://localhost:7000/api/config?stationId=null
# Compare against raw station-local
Invoke-RestMethod http://localhost:7000/api/config?stationId=ST-01
A config value change has no effect
Configuration is merged and injected at test start. Changes to Station Config take effect on the next execution — not mid-run. If a test is already running, it uses the snapshot taken at the moment it started.
Changes to appsettings.json or .env files require an API container restart before they take effect.
Multiple stations writing conflicting config
In a shared-database deployment, all stations write to the same station_config table. Global config entries (StationId = NULL) are shared. If two stations both write a global key simultaneously, the last write wins. Use station-local entries for any value that should differ per station, and global entries only for values that truly apply to all stations.
Reading Logs in Maestro
Logs in the Test Monitor UI
During an active test, the Logs panel on the Test Monitor page shows structured log entries streamed in real time from runner code. Entries are colour-coded by level:
|
Level |
Colour |
Meaning |
|---|---|---|
|
Information |
White |
Normal progress messages |
|
Warning |
Yellow |
Non-fatal issues — investigate but execution continues |
|
Error |
Red |
Exceptions and failures — usually indicates a step about to fail |
Click any log row to expand the full message. Log entries are also stored in PostgreSQL and accessible via GET /api/testexecution/{id}/logs after the run completes.
Logs in Test Results
Open any historical result and scroll to the Logs section. Logs are stored per execution and filtered by step — use the step filter dropdown to isolate log output from a specific step.
Container logs
For infrastructure-level diagnostics (startup errors, container crashes, OOM kills):
# Follow all service logs
docker compose logs -f
# Single service
docker compose logs api
docker compose logs dotnet-runner
docker compose logs python-runner
docker compose logs postgres
docker compose logs redis
# Last N lines
docker compose logs --tail 100 api
MCP / AI assistant log access
The get_service_logs MCP tool retrieves recent log lines from any named service without requiring SSH access:
get_service_logs(service="dotnet-runner", tail=200)
get_service_logs(service="api", tail=100)
The get_system_events MCP tool returns recent Docker lifecycle events (container starts, exits, OOM kills) — useful for diagnosing crash-loop patterns.
Where Logs Are Stored
|
Log type |
Storage location |
Access method |
|---|---|---|
|
Execution logs (runner output) |
PostgreSQL |
UI: Test Results → Logs tab; API: |
|
Step results |
PostgreSQL |
UI: Test Results; API: |
|
API application logs |
Container stdout/stderr |
|
|
Runner application logs |
Container stdout/stderr |
|
|
Database logs |
PostgreSQL container |
|
|
System events |
Docker daemon |
MCP |
Logs are not written to disk files by default. All structured logging goes to container stdout (captured by Docker). To forward logs to a centralised log system (Seq, Elasticsearch, Splunk), add a logging sink to the API's appsettings.json:
"Serilog": {
"WriteTo": [
{ "Name": "Console" },
{ "Name": "Seq", "Args": { "serverUrl": "http://seq-server:5341" } }
]
}
Error Reference
Specific error messages, their causes, and resolutions. Messages are quoted as they appear in logs or the UI.
RegistryUrl is not configured
Symptom: Clicking Refresh Registry in the Packages UI shows a failure banner with this message.
Cause: The PackageRegistry:RegistryUrl setting is missing from the API configuration.
Resolution: Add to appsettings.json:
"PackageRegistry": {
"RegistryUrl": "https://gitlab.example.com/testdevelopment/tat-registry.git"
}
Or set the environment variable PackageRegistry__RegistryUrl in docker-compose.yml. Restart the API container after changing appsettings.json.
Authentication failed for repository
Symptom: Registry refresh or package download fails with a Git authentication error.
Cause: The git clone or git pull running inside the API container cannot authenticate to GitLab.
Resolution:
-
Test Git connectivity from inside the container:
docker compose exec api git clone <url> /tmp/test -
Configure authentication: embed a token in the URL (
<https://oauth2:<TOKEN>>@gitlab.example.com/...), or mount an SSH key into the container -
For production: use SSH deploy keys (read-only) on the registry and each package repository
low_limit must be less than high_limit
Symptom: YAML Validator returns this error; or the step fails with this error at runtime.
Cause: The low_limit value is greater than or equal to high_limit in a measurement block.
Resolution: Swap the values or correct whichever limit is wrong. Run py validate.py before committing YAML.
Undeclared variable 'x'
Symptom: YAML Validator returns this error for a {{x}} template reference.
Cause: x is used in a template but not declared in the variables: block.
Resolution: Add x to the variables: block with an appropriate default value. Variables set only by runner output still need to be declared in variables: for the validator to accept them.
Unknown step type 'foo'
Symptom: YAML Validator rejects a step definition.
Cause: The type: field contains an unrecognised value.
Resolution: Use one of the supported step types: delay, mock, prompt, sequence. For runner steps, omit type: entirely and use the runner: field (runner: dotnet, runner: python).
max_iterations is required
Symptom: YAML Validator rejects a step with repeat: defined.
Cause: A repeat: block is present but max_iterations: was not declared.
Resolution: Add max_iterations: N to the repeat: block. This is a mandatory safety cap — there is no default. Choose a value high enough for the realistic worst case but low enough to prevent runaway loops.
runner unavailable / gRPC status: UNAVAILABLE
Symptom: A step fails immediately with this error; or a test aborts at the first runner step.
Cause 1: The runner container is not running. Check docker compose ps and start it.
Cause 2: The runner is still initialising (cold start after package install). Wait 10–15 seconds and retry.
Cause 3: The PythonRunnerUrl or .NET runner gRPC address in appsettings.json is wrong. Verify the ports match the runner containers.
Cause 4: A firewall or container network policy is blocking the gRPC port (7050/7051).
assembly 'MyAssembly.dll' not found
Symptom: A .NET runner step fails with this message.
Cause: The YAML step's assembly: field references a DLL that is not present in the activated package's assemblies/ directory.
Resolution:
-
Verify the DLL was built and included in the package
-
Confirm the package is activated (
POST /api/packages/{name}/activate) -
Check the exact file name — the match is case-sensitive on Linux
-
After fixing and re-deploying, call
trigger_package_refreshand re-activate
Unattended mode: prompt step '...' requires value input but no default is defined
Symptom: A step in an unattended run fails with this error and records a FAIL verdict.
Cause: A type: prompt step with a value-input control (input.mode: number / text / boolean / list) does not declare input.default. In unattended mode, there is no operator to type a value, so the step cannot proceed.
Resolution: Add input.default: to the prompt step:
input:
mode: number
variable: ambient_temp
default: "25"
unit: "°C"
Or remove the value-input from the step if it is not needed for automated runs.
[UNATTENDED] Auto-responding to prompt '...'
Symptom: This warning appears in the execution logs. It is not an error.
Cause: Unattended mode is active and the executor auto-clicked a prompt button. This is the designed behaviour.
Action: Review whether the auto-selected button (Continue, Pass, or Fail) is appropriate for this prompt in an automated context. If Fail was selected, it means no Continue or Pass button exists in the step — add one to give the executor a safe path through.
Station ID mismatch / config changes not appearing
Symptom: Station-local configuration is set for ST-01 but changes are not visible in the merged config.
Cause: The StationId configured in appsettings.json or STATION_NAME in .env does not match the StationId used when creating the config entries.
Resolution:
-
Check the station ID the API is using:
GET /api/config/station-id -
Check the key used in the config entry:
GET /api/config?stationId=ST-01 -
If they do not match, update the config entries to use the actual station ID, or correct
STATION_NAMEin.envand restart
no matching manifest for linux/arm/v7
Symptom: docker compose pull fails on a Raspberry Pi.
Cause: The 32-bit (armhf) Raspberry Pi OS is installed. Maestro images are built for linux/amd64 and linux/arm64 only.
Resolution: Reinstall the operating system using the 64-bit Raspberry Pi OS (Raspberry Pi OS (64-bit) in Raspberry Pi Imager). The 32-bit OS cannot be upgraded in place.
pg_isready: could not connect to server
Symptom: Station cannot connect to the central PostgreSQL database.
Cause: PostgreSQL port 5432 is not reachable from the station.
Diagnosis:
docker run --rm postgres:16-alpine \
pg_isready -h <central-ip> -p 5432 -U postgres
Resolution:
-
Verify the central server is running:
docker compose pson the central server -
Check firewall rules on the central server allow inbound TCP on port 5432 from the station IP
-
Verify
POSTGRES_HOSTin the station.envis the correct IP address of the central server
force: "pass" in released YAML
Symptom: A step always returns PASS regardless of the actual measurement value. The verdict shows VerdictForced = true in the detailed report.
Cause: The force: "pass" field was left in the YAML after development. This bypasses measurement evaluation entirely.
Resolution: Remove force: from the step before promoting the package to Released. Search all YAML files: grep -r "force:" . before any release promotion.
Test execution hangs with no step progress
Symptom: The Test Monitor shows the execution as Running but no steps are updating. The Logs panel shows nothing new.
Cause 1: timeout_ms missing on a hardware step. The runner is waiting for an instrument that will never respond. Click Abort, add timeout_ms to the step, and redeploy.
Cause 2: prompt step in an automated run without unattended mode. The engine is waiting for a button click that will never come. Click Abort in the UI, or call POST /api/testexecution/{id}/abort. Add unattendedMode: true to future automated executions.
Cause 3: runner process crash. The gRPC connection from the API to the runner was dropped mid-execution. Check docker compose logs dotnet-runner for an exception or OOM kill. The API will not automatically detect a runner crash during a step — it waits for timeout_ms before aborting.
type: mock step in a production run records no real measurement
Symptom: A step completes with PASS but the measured value is always the same static value from the YAML target: field.
Cause: The step type is type: mock, which is a development stub that returns the target: value without calling any runner code.
Resolution: Change type: mock to runner: dotnet or runner: python with the correct assembly:/module:, class:/function:, and method: fields. Mock steps must never appear in a Released package.
Measurements stored with correct verdict but wrong limit values
Symptom: The verdict column in the database is correct, but the low_limit / high_limit columns do not match what you expected.
Cause: The YAML used dynamic limits via variable substitution (e.g. low_limit: "{{computed_lower}}"). The stored limit columns reflect the YAML text, not the runtime-resolved value. The verdict is always computed against the resolved runtime value.
Resolution: For post-hoc analysis comparing actual values against stored limits, use the verdict column (authoritative) rather than re-evaluating actual_value against the stored limit columns when dynamic limits are in use.