Maestro Helpcenter

Troubleshooting & Errors

Symptoms → Cause → Resolution. Start with the section that matches your situation; use the Error Reference for specific error messages.


Common Startup Problems

API container fails to start or stays unhealthy

Check the logs first:

Bash
docker compose logs api

Cause: cannot reach PostgreSQL

fail: Microsoft.EntityFrameworkCore[...] — Connection refused

The API exits immediately if it cannot reach PostgreSQL on startup. Verify PostgreSQL is healthy before the API starts:

Bash
docker compose ps           # is postgres "healthy"?
docker compose logs postgres
docker exec workflowengine-postgres pg_isready -U postgres

If PostgreSQL is not healthy: check disk space, check the Docker volume is mounted, and check POSTGRES_PASSWORD matches between the central server and station .env files.

Cause: migration failure

System.Exception: An error occurred while applying migrations

The API applies EF Core migrations on every startup. A failed migration leaves the database in a partial state. Do not restart in a loop — restore from a backup and contact support with the full stack trace from docker compose logs api.

Cause: port already in use

System.Net.Sockets.SocketException: Address already in use

Another process is listening on port 7000. Find and stop it:

PowerShell
# Windows
netstat -ano | findstr :7000

# Linux
sudo ss -tlnp | grep :7000

UI is blank or shows a connection error on load

The Blazor UI requires the API to be reachable at its configured ApiUrl. Open the browser developer console (F12) and check for network errors.

Check ApiUrl configuration:

Bash
docker compose exec ui env | grep ApiUrl
# Should show: ApiUrl=http://api:7000  (Docker) or http://localhost:7000 (local)

If the UI and API are on different machines, ApiUrl must use the API machine's IP address, not localhost.

Check the API health endpoint:

Bash
curl http://localhost:7000/health
# Expected: {"status":"Healthy"}

SignalR connection shows a persistent red dot

The UI shows a red connection indicator (🔴) when the WebSocket connection to the API drops. It reconnects automatically — if the dot stays red for more than 10 seconds:

  1. Verify the API is running: curl <http://localhost:7000/health>

  2. If behind a reverse proxy, confirm the proxy forwards Upgrade and Connection headers (required for WebSocket)

  3. Check for a network timeout or idle-connection reset policy on the proxy or load balancer — SignalR keepalives require WebSocket to stay open

  4. Check browser console for WebSocket connection failed errors


Runner does not appear in system health

Bash
curl http://localhost:7000/api/health/version
# Expected: "runners": [{"identifier": "dotnet", "healthy": true}, ...]

If a runner is missing or "healthy": false:

Bash
docker compose logs dotnet-runner   # startup errors
docker compose logs python-runner

docker compose ps                   # is the runner container running?
docker compose restart dotnet-runner

Cause: runner container failed to start. Look for DLL load errors, missing volume mounts, or port conflicts. The gRPC ports default to 7050 (.NET) and 7051 (Python).

Cause: runner not yet warm after fresh install. After docker compose up the runners take 5–15 seconds to load assemblies before accepting gRPC connections. If get_system_health shows unhealthy immediately after startup, wait 15 seconds and recheck before investigating further.


pgAdmin cannot connect to PostgreSQL

pgAdmin is a convenience tool; it has no effect on Maestro operation. For connection issues:

  1. Open <http://localhost:5088>

  2. In pgAdmin, go to Servers → right-click → Connect Server

  3. Hostname: postgres (Docker network name, not localhost)

  4. Username: postgres

  5. Password: the value of POSTGRES_PASSWORD in your .env

See the pgAdmin guide in the Maestro documentation for a full setup walkthrough.


Common Execution Failures

"File not found on server" — test will not start

The Test Monitor validates the YAML file path against the API server before enabling the Start button.

Cause: The path entered does not exist on the server.

Resolution:

  1. Call GET /api/packages/test-files to list every available YAML file with its exact absolute server-side path

  2. Use that path verbatim — do not guess or construct paths manually

  3. Verify the package has been activated (POST /api/packages/{name}/activate)


Test starts but immediately shows ABORTED or UNDETERMINED

Cause 1: runner unavailable at execution time

The API sent a gRPC ExecuteStep request and received no response (timeout or connection refused). Check runner health:

Bash
docker compose logs dotnet-runner --tail 50
curl http://localhost:7000/api/health/version

If the runner is shown as healthy but executions still abort, check whether timeout_ms on the failing step is set too aggressively short.

Cause 2: first run after a fresh package install

The .NET runner loads assemblies lazily on first use. The very first execution after installing a new package may fail with runner unavailable while the CLR loads the new DLL. This is normal. Wait 10 seconds and retry.

Cause 3: YAML references a runner or assembly that does not exist

gRPC status: NOT_FOUND — assembly 'MyTests.dll' not registered

The YAML step references an assembly or class that is not in the installed package. Check the assembly:, class:, and method: fields against what is actually deployed in the package's assemblies/ directory.


A step hangs indefinitely

Cause: timeout_ms not set on a hardware step

A step that communicates with an instrument but has no timeout_ms declared will block the runner indefinitely if the instrument stops responding. The run never transitions to a terminal state.

Resolution:

  1. Click Abort in the Test Monitor UI, or call POST /api/testexecution/{id}/abort

  2. Add timeout_ms to the step — every step that calls hardware must have one:

    YAML
    timeout_ms: 10000   # 10 seconds — adjust to the realistic worst-case duration
    

Cause: prompt step waiting for operator input in an automated run

A type: prompt step without unattendedMode: true blocks until a button is clicked. In a CI/CD or automated context, no one is clicking. Add unattendedMode: true to the execution request, and ensure all prompt steps in the test have either a Continue button or a declared input.default.


Step fails with verdict FAIL but the instrument is reading correctly

Cause 1: limits inverted in YAML

low_limit: 5.25
high_limit: 4.75   ← inverted

The validator catches this (low_limit must be < high_limit), but it is possible to deploy an unvalidated file. Expand the measurement in the Test Monitor or Test Results to check the stored limit values.

Cause 2: wrong variable used in measurement.value

The measurement evaluates {{variable_name}}, but the variable was not set by the step (typo in outputs:, or the runner returned a different key). Expand the step in Test Results and check the actual stored value — if it is 0, empty, or a default, the variable was not resolved.

Cause 3: measurement type mismatch

A string or boolean measurement evaluated as numeric, or vice versa. Check the type: field in the YAML measurement block.


Execution completes but no measurements appear in results

Cause: step uses type: mock without a measurement: block

Mock steps without a measurement block record no data. This is intentional for stub steps — add a measurement: block if the step should record a result.

Cause: runner returned measurements but in an unrecognised format

The .NET or Python runner must return measurement objects using the SDK types (NumericMeasurementPoint, BooleanMeasurementPoint, StringMeasurementPoint). Returning a plain Dictionary<string, string> stores output variables but not measurements.

Cause: step was skipped

A skipped step (precondition false, enabled: false) records no measurements. Check the step status column in the results table.


Variable substitution produces literal {{variable_name}}

The template was not resolved. Causes, in order of likelihood:

  1. Variable not declared. The variable must appear in the variables: block at the top of the YAML. If it is set only by a runner step, it still needs to be declared (with any default value) in variables:.

  2. Legacy Scriban syntax. {{test.variable}} is old syntax — use {{variable}} (flat name, no dot prefix).

  3. Dot-namespaced variable in wrong context. {{cfg.X}} works in parameters: and measurement.value:; it does not work in precondition:.

  4. Arithmetic in template. {{voltage + offset}} is not supported — compute in runner code and return the result as a variable.


Precondition always evaluates to false (or throws)

Cause 1: dot-namespaced variable

YAML
precondition: "cfg.DMM_VISA != ''"   # ❌ — DynamicExpresso reads this as member access

Copy the value into a bare variable first:

YAML
variables:
  dmm_visa: ""

# setup step sets dmm_visa from cfg.DMM_VISA...

precondition: "dmm_visa != ''"   # ✅

Cause 2: type mismatch in comparison

Variables from runner output are always strings. precondition: "count == 3" will fail if count is the string "3" — use precondition: "count == '3'" or convert explicitly in runner code.

Cause 3: exception during evaluation

If the precondition expression throws (e.g. referencing an undefined variable), the step is marked ABORTED. Check the step's log output in Test Results for the exception message.


Package download fails or hangs

Cause: registry refresh not run before download

The registry must be refreshed before packages appear in the catalog. Click ↻ Refresh Registry in the Packages UI or call POST /api/packages/refresh.

Cause: Git authentication failure

The API runs git clone using the system Git binary. If authentication is not configured, the clone hangs waiting for credentials (or fails with 401).

Test Git connectivity directly on the server:

Bash
docker compose exec api git clone <package-git-url> /tmp/test-clone

If this fails, configure a deploy key or embed a token in the URL (see installation guide, §4.4).

Cause: RegistryUrl not configured

RegistryRefreshFailed: RegistryUrl is not configured

Add PackageRegistry:RegistryUrl to appsettings.json or set the environment variable PackageRegistry__RegistryUrl in docker-compose.yml.


Performance Issues

Steps are slow to start (high inter-step latency)

Expected inter-step overhead is 5–20 ms. Significantly higher values indicate:

Database latency. The API writes a step_result row after every step. If PostgreSQL is on a slow network link (multi-station deployment), this adds latency to every step. The database should be on the same LAN segment as the station.

Redis latency. Variable reads and writes go to Redis synchronously during step execution. Verify Redis is healthy and local:

Bash
docker exec workflowengine-redis redis-cli PING   # → PONG
docker exec workflowengine-redis redis-cli --latency -i 1

Expected latency from within the Docker network: < 1 ms.

Runner startup time. The first step in a test after a cold container start includes JVM/CLR warm-up time. This is a one-time cost per session, not per step.


Test Results search is slow

The measurements table grows quickly at high throughput. Without adequate indexes or table partitioning, large-range queries slow down.

Check query plan:

SQL
EXPLAIN ANALYZE
SELECT * FROM measurements
WHERE measurement_name = 'VOUT_3V3'
  AND timestamp > NOW() - INTERVAL '30 days';

If the query plan shows a sequential scan, verify the indexes exist:

SQL
\d measurements   -- in psql — check for index on (measurement_name, timestamp)

For sites running > 1,000 executions per day, partition the measurements table by month and archive older partitions. See the scaling section in How Maestro Works.


Blazor UI is slow or unresponsive during a test

The Blazor UI receives SignalR events on every step and every measurement. A test with many rapid steps (< 100 ms per step) or many measurements per step can saturate the browser's event loop.

Mitigation:

  • Reduce measurement count per step — log only measurements that need limit evaluation; use operator: log for informational values

  • Close unused browser tabs pointed at the Test Monitor

  • Use the REST API for polling in CI/CD pipelines instead of the Blazor UI


Intermittent / Flaky Behaviour

A step sometimes times out but usually succeeds

Cause: timeout_ms set too close to the instrument's typical response time

Add headroom. If an instrument typically responds in 800 ms, set timeout_ms: 3000. Instrument response time varies with network conditions, temperature, and measurement range.

Cause: instrument in an unready state on retry

The instrument takes time to settle between measurements. The previous step leaves the instrument in an intermediate state. Use a type: delay step (not Thread.Sleep inside runner code) between steps to let the instrument settle:

YAML
- name: "Settle delay"
  type: delay
  duration: 0.5   # 500 ms

Cause: MSTest parallelization enabled

If test code is being developed with MSTest and multiple methods run concurrently, they can issue conflicting commands to the same instrument. Add [assembly: DoNotParallelize] to MSTestSettings.cs. See the SDK debugging guide.


Results are inconsistent across runs on the same unit

Cause: station configuration changed between runs

Every execution records a config snapshot. Compare the config_snapshot JSONB field on two execution records to see if any cfg.* values changed between runs:

SQL
SELECT id, started_at, config_snapshot
FROM test_executions
WHERE serial_number = 'UNIT-042'
ORDER BY started_at DESC
LIMIT 5;

Cause: variable from a previous run persisted

Variables are scoped to the execution realm in Redis and deleted when the test completes. If a test aborted before Redis cleanup, the realm may have stale entries. Redis TTLs should expire these within minutes, but if the issue persists, call KEYS realm:* in the Redis CLI and manually delete orphaned realms.

Cause: force: "pass" or force: "fail" left in YAML

The force: field is a development override that makes the YAML validator emit a warning but does not block the run. Search YAML files for force: before promoting a package to Released.


SignalR events are delayed or dropped

Cause: network congestion or proxy buffering

SignalR events are sent over WebSocket. A proxy or firewall that buffers WebSocket frames can delay live step updates. Verify the proxy configuration passes WebSocket traffic without buffering.

Cause: browser tab was in the background

Chrome and Edge throttle JavaScript in background tabs. The SignalR client processes events when the tab becomes active — the result appears correct but the live stream appeared to pause. This does not affect test execution or result storage; only the live display is affected.


Runner reports a different version after upgrade

After docker compose pull && docker compose up -d, verify the runner images were actually replaced:

Bash
docker compose images   # check the IMAGE ID column changed

docker compose exec dotnet-runner dotnet --version   # confirm .NET version

If the old image is still running, Docker may be using a cached layer. Force a full pull:

Bash
docker compose pull --no-cache
docker compose up -d --force-recreate

Configuration Conflicts

Station-local config not overriding global config

Global config has StationId = NULL; station-local has StationId = "ST-01". The merge rule is: station-local wins. Check that:

  1. The station-local entry has the correct StationId — a typo here means it never matches the running station

  2. The StationId in appsettings.json (or STATION_NAME in .env) matches exactly what was used when creating the station-local entry

PowerShell
# View the merged config for the running station
Invoke-RestMethod http://localhost:7000/api/config/merged/ST-01

# Compare against raw global config
Invoke-RestMethod http://localhost:7000/api/config?stationId=null

# Compare against raw station-local
Invoke-RestMethod http://localhost:7000/api/config?stationId=ST-01

A config value change has no effect

Configuration is merged and injected at test start. Changes to Station Config take effect on the next execution — not mid-run. If a test is already running, it uses the snapshot taken at the moment it started.

Changes to appsettings.json or .env files require an API container restart before they take effect.


Multiple stations writing conflicting config

In a shared-database deployment, all stations write to the same station_config table. Global config entries (StationId = NULL) are shared. If two stations both write a global key simultaneously, the last write wins. Use station-local entries for any value that should differ per station, and global entries only for values that truly apply to all stations.


Reading Logs in Maestro

Logs in the Test Monitor UI

During an active test, the Logs panel on the Test Monitor page shows structured log entries streamed in real time from runner code. Entries are colour-coded by level:

Level

Colour

Meaning

Information

White

Normal progress messages

Warning

Yellow

Non-fatal issues — investigate but execution continues

Error

Red

Exceptions and failures — usually indicates a step about to fail

Click any log row to expand the full message. Log entries are also stored in PostgreSQL and accessible via GET /api/testexecution/{id}/logs after the run completes.

Logs in Test Results

Open any historical result and scroll to the Logs section. Logs are stored per execution and filtered by step — use the step filter dropdown to isolate log output from a specific step.

Container logs

For infrastructure-level diagnostics (startup errors, container crashes, OOM kills):

Bash
# Follow all service logs
docker compose logs -f

# Single service
docker compose logs api
docker compose logs dotnet-runner
docker compose logs python-runner
docker compose logs postgres
docker compose logs redis

# Last N lines
docker compose logs --tail 100 api

MCP / AI assistant log access

The get_service_logs MCP tool retrieves recent log lines from any named service without requiring SSH access:

get_service_logs(service="dotnet-runner", tail=200)
get_service_logs(service="api", tail=100)

The get_system_events MCP tool returns recent Docker lifecycle events (container starts, exits, OOM kills) — useful for diagnosing crash-loop patterns.


Where Logs Are Stored

Log type

Storage location

Access method

Execution logs (runner output)

PostgreSQL execution_logs table

UI: Test Results → Logs tab; API: GET /api/testexecution/{id}/logs

Step results

PostgreSQL step_results table

UI: Test Results; API: GET /api/testexecution/{id}/report

API application logs

Container stdout/stderr

docker compose logs api

Runner application logs

Container stdout/stderr

docker compose logs dotnet-runner / python-runner

Database logs

PostgreSQL container

docker compose logs postgres

System events

Docker daemon

MCP get_system_events; or docker events on the host

Logs are not written to disk files by default. All structured logging goes to container stdout (captured by Docker). To forward logs to a centralised log system (Seq, Elasticsearch, Splunk), add a logging sink to the API's appsettings.json:

JSON
"Serilog": {
  "WriteTo": [
    { "Name": "Console" },
    { "Name": "Seq", "Args": { "serverUrl": "http://seq-server:5341" } }
  ]
}

Error Reference

Specific error messages, their causes, and resolutions. Messages are quoted as they appear in logs or the UI.


RegistryUrl is not configured

Symptom: Clicking Refresh Registry in the Packages UI shows a failure banner with this message.

Cause: The PackageRegistry:RegistryUrl setting is missing from the API configuration.

Resolution: Add to appsettings.json:

JSON
"PackageRegistry": {
  "RegistryUrl": "https://gitlab.example.com/testdevelopment/tat-registry.git"
}

Or set the environment variable PackageRegistry__RegistryUrl in docker-compose.yml. Restart the API container after changing appsettings.json.


Authentication failed for repository

Symptom: Registry refresh or package download fails with a Git authentication error.

Cause: The git clone or git pull running inside the API container cannot authenticate to GitLab.

Resolution:

  1. Test Git connectivity from inside the container: docker compose exec api git clone <url> /tmp/test

  2. Configure authentication: embed a token in the URL (<https://oauth2:<TOKEN>>@gitlab.example.com/...), or mount an SSH key into the container

  3. For production: use SSH deploy keys (read-only) on the registry and each package repository


low_limit must be less than high_limit

Symptom: YAML Validator returns this error; or the step fails with this error at runtime.

Cause: The low_limit value is greater than or equal to high_limit in a measurement block.

Resolution: Swap the values or correct whichever limit is wrong. Run py validate.py before committing YAML.


Undeclared variable 'x'

Symptom: YAML Validator returns this error for a {{x}} template reference.

Cause: x is used in a template but not declared in the variables: block.

Resolution: Add x to the variables: block with an appropriate default value. Variables set only by runner output still need to be declared in variables: for the validator to accept them.


Unknown step type 'foo'

Symptom: YAML Validator rejects a step definition.

Cause: The type: field contains an unrecognised value.

Resolution: Use one of the supported step types: delay, mock, prompt, sequence. For runner steps, omit type: entirely and use the runner: field (runner: dotnet, runner: python).


max_iterations is required

Symptom: YAML Validator rejects a step with repeat: defined.

Cause: A repeat: block is present but max_iterations: was not declared.

Resolution: Add max_iterations: N to the repeat: block. This is a mandatory safety cap — there is no default. Choose a value high enough for the realistic worst case but low enough to prevent runaway loops.


runner unavailable / gRPC status: UNAVAILABLE

Symptom: A step fails immediately with this error; or a test aborts at the first runner step.

Cause 1: The runner container is not running. Check docker compose ps and start it.

Cause 2: The runner is still initialising (cold start after package install). Wait 10–15 seconds and retry.

Cause 3: The PythonRunnerUrl or .NET runner gRPC address in appsettings.json is wrong. Verify the ports match the runner containers.

Cause 4: A firewall or container network policy is blocking the gRPC port (7050/7051).


assembly 'MyAssembly.dll' not found

Symptom: A .NET runner step fails with this message.

Cause: The YAML step's assembly: field references a DLL that is not present in the activated package's assemblies/ directory.

Resolution:

  1. Verify the DLL was built and included in the package

  2. Confirm the package is activated (POST /api/packages/{name}/activate)

  3. Check the exact file name — the match is case-sensitive on Linux

  4. After fixing and re-deploying, call trigger_package_refresh and re-activate


Unattended mode: prompt step '...' requires value input but no default is defined

Symptom: A step in an unattended run fails with this error and records a FAIL verdict.

Cause: A type: prompt step with a value-input control (input.mode: number / text / boolean / list) does not declare input.default. In unattended mode, there is no operator to type a value, so the step cannot proceed.

Resolution: Add input.default: to the prompt step:

YAML
input:
  mode: number
  variable: ambient_temp
  default: "25"
  unit: "°C"

Or remove the value-input from the step if it is not needed for automated runs.


[UNATTENDED] Auto-responding to prompt '...'

Symptom: This warning appears in the execution logs. It is not an error.

Cause: Unattended mode is active and the executor auto-clicked a prompt button. This is the designed behaviour.

Action: Review whether the auto-selected button (Continue, Pass, or Fail) is appropriate for this prompt in an automated context. If Fail was selected, it means no Continue or Pass button exists in the step — add one to give the executor a safe path through.


Station ID mismatch / config changes not appearing

Symptom: Station-local configuration is set for ST-01 but changes are not visible in the merged config.

Cause: The StationId configured in appsettings.json or STATION_NAME in .env does not match the StationId used when creating the config entries.

Resolution:

  1. Check the station ID the API is using: GET /api/config/station-id

  2. Check the key used in the config entry: GET /api/config?stationId=ST-01

  3. If they do not match, update the config entries to use the actual station ID, or correct STATION_NAME in .env and restart


no matching manifest for linux/arm/v7

Symptom: docker compose pull fails on a Raspberry Pi.

Cause: The 32-bit (armhf) Raspberry Pi OS is installed. Maestro images are built for linux/amd64 and linux/arm64 only.

Resolution: Reinstall the operating system using the 64-bit Raspberry Pi OS (Raspberry Pi OS (64-bit) in Raspberry Pi Imager). The 32-bit OS cannot be upgraded in place.


pg_isready: could not connect to server

Symptom: Station cannot connect to the central PostgreSQL database.

Cause: PostgreSQL port 5432 is not reachable from the station.

Diagnosis:

Bash
docker run --rm postgres:16-alpine \
  pg_isready -h <central-ip> -p 5432 -U postgres

Resolution:

  • Verify the central server is running: docker compose ps on the central server

  • Check firewall rules on the central server allow inbound TCP on port 5432 from the station IP

  • Verify POSTGRES_HOST in the station .env is the correct IP address of the central server


force: "pass" in released YAML

Symptom: A step always returns PASS regardless of the actual measurement value. The verdict shows VerdictForced = true in the detailed report.

Cause: The force: "pass" field was left in the YAML after development. This bypasses measurement evaluation entirely.

Resolution: Remove force: from the step before promoting the package to Released. Search all YAML files: grep -r "force:" . before any release promotion.


Test execution hangs with no step progress

Symptom: The Test Monitor shows the execution as Running but no steps are updating. The Logs panel shows nothing new.

Cause 1: timeout_ms missing on a hardware step. The runner is waiting for an instrument that will never respond. Click Abort, add timeout_ms to the step, and redeploy.

Cause 2: prompt step in an automated run without unattended mode. The engine is waiting for a button click that will never come. Click Abort in the UI, or call POST /api/testexecution/{id}/abort. Add unattendedMode: true to future automated executions.

Cause 3: runner process crash. The gRPC connection from the API to the runner was dropped mid-execution. Check docker compose logs dotnet-runner for an exception or OOM kill. The API will not automatically detect a runner crash during a step — it waits for timeout_ms before aborting.


type: mock step in a production run records no real measurement

Symptom: A step completes with PASS but the measured value is always the same static value from the YAML target: field.

Cause: The step type is type: mock, which is a development stub that returns the target: value without calling any runner code.

Resolution: Change type: mock to runner: dotnet or runner: python with the correct assembly:/module:, class:/function:, and method: fields. Mock steps must never appear in a Released package.


Measurements stored with correct verdict but wrong limit values

Symptom: The verdict column in the database is correct, but the low_limit / high_limit columns do not match what you expected.

Cause: The YAML used dynamic limits via variable substitution (e.g. low_limit: "{{computed_lower}}"). The stored limit columns reflect the YAML text, not the runtime-resolved value. The verdict is always computed against the resolved runtime value.

Resolution: For post-hoc analysis comparing actual values against stored limits, use the verdict column (authoritative) rather than re-evaluating actual_value against the stored limit columns when dynamic limits are in use.