Evaluations - workflow details¶
This page contains the detailed workflow for the evaluations (batch processing) entry point.
Command¶
cd src/opensage/evaluations
python cybergym/cybergym_vul_detection.py run \
--agent-id my_agent \
--config-path /path/to/config.toml \
--max_llm_calls 75 \
--use_multiprocessing \
--max_workers 3
Step-by-Step Workflow¶
Step 1: Script Initialization¶
- Fire library parses command-line arguments
- Creates an
Evaluationclass instance with parameters: agent_id: Identifier for the agentconfig_path: Path to TOML configurationmax_llm_calls: Maximum LLM calls per taskuse_multiprocessing: Use processes vs threadsmax_workers: Number of parallel workers- Sets up logging and instrumentation (Langfuse, OpenTelemetry)
Step 2: Load Dataset¶
- Loads benchmark dataset (e.g., HuggingFace datasets, JSON files)
- Dataset contains multiple samples/tasks to evaluate
- Example: CyberGym dataset has vulnerability detection tasks
- Each sample contains:
- Task description
- Expected outputs (ground truth)
- Metadata (file paths, vulnerability info, etc.)
Step 3: Prepare General Environment (_prepare_general_env)¶
This sets up shared resources used across all evaluation tasks.
3.1 Create Base Configuration¶
- Loads base configuration from TOML file
- Expands template variables
- Stores in class for later use
3.2 Setup Evaluation Directories¶
self.eval_output_dir = Path(f"evals/{self.agent_id}/...")
self.eval_output_dir.mkdir(parents=True, exist_ok=True)
- Creates output directories for results
- Structure:
evals/{agent_id}/{benchmark_name}/{timestamp}/ - Stores agent outputs, logs, artifacts
Step 4: Generate Samples (Parallel Execution)¶
The evaluation runs tasks in parallel. Choose one mode:
Mode A: Multiprocessing (generate())¶
with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
futures = {
executor.submit(_run_sample_in_process, self, sample): sample
for sample in self.dataset
}
- Each sample runs in separate process
- True parallelism (bypasses Python GIL)
- Processes are isolated (no shared memory)
- Requires serializable data
Mode B: Multithreading (generate_threaded())¶
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
futures = {
executor.submit(run_sample_in_thread, sample): sample
for sample in self.dataset
}
- Each sample runs in separate thread
- Shared memory (can share resources)
- Limited by GIL for CPU-bound tasks
- Better for I/O-bound operations
Mode C: Single Thread (generate_single_thread())¶
- Sequential execution, one sample at a time
- Used for debugging
- Easier to debug issues
- Much slower
Step 5: Process Each Sample (_generate_sample or _run_sample_in_process)¶
For each sample in the dataset:
5.1 Create Evaluation Task¶
- Extracts sample data
- Creates an
EvaluationTaskobject with: session_id: Unique ID for this tasksample: Original sample datasession: Will be created next- Metadata (task name, description, etc.)
5.2 Create OpenSage Session¶
import opensage
session = opensage.get_session(
session_id=task.session_id,
config_path=self.config_path,
)
- Creates isolated OpenSage session for this task
- Loads configuration
- Each task gets its own session (isolation)
5.3 Prepare Task-Specific Environment (_prepare_environment)¶
This is benchmark-specific. Example for CyberGym:
-
Extract code/data
- Extracts source code to sandbox
- Copies test files, build scripts
- Sets up project structure
-
Initialize sandboxes
session.sandboxes.initialize_shared_volumes() await session.sandboxes.launch_all_sandboxes() await session.sandboxes.initialize_all_sandboxes()- Creates shared volumes
- Launches required sandbox containers
- Initializes sandboxes (tools, dependencies)
-
Set source directory
- Tells tools where to find source code
-
Git repository setup (if applicable)
- Finds git repository in sandbox
- Checks out main/master branch
- Updates
src_dir_in_sandboxto repo path
5.4 Load Agent¶
- Imports agent module
- Calls
mk_agent()function with session ID - Agent is configured for this specific session
- Agent has access to task-specific sandboxes and resources
5.5 Create ADK Session and Runner¶
inner_session_service = InMemorySessionService()
await inner_session_service.create_session(
app_name=app_name,
user_id=self.user_id + "_" + meta_data,
session_id=task.session_id,
state={"opensage_session_id": task.session_id},
)
runner = Runner(
agent=agent,
app_name=app_name,
session_service=inner_session_service,
)
- Creates ADK session that maps to OpenSage session
- Stores
opensage_session_idin session state - Creates ADK Runner for agent execution
5.6 Run Agent¶
run_config = RunConfig(max_llm_calls=self.max_llm_calls)
async for event in runner.run_async(
user_id=user_id,
session_id=task.session_id,
run_config=run_config,
new_message=types.Content(parts=[types.Part(text=task.prompt)]),
):
...
-
Runner starts agent execution
- Sends prompt to agent
- Agent enters reason-act loop
-
Agent reasoning
- Calls LLM for reasoning
- Decides which tools to use
- Generates function calls
-
Tool execution
- Runner executes tools in sandbox
- Tools access session resources
- Results returned to agent
-
Iteration
- Agent processes tool results
- Decides next action
- Continues until completion or max calls
-
Completion
- Agent generates final response
- Runner finishes execution
- Events collected
5.7 Collect Results¶
result = {
"session_id": task.session_id,
"prompt": task.prompt,
"response": agent_response,
"events": events,
"metadata": {...},
}
- Extracts agent response
- Collects execution metadata:
- Number of LLM calls
- Tools used
- Execution time
- Errors (if any)
5.8 Save Results¶
- Saves result to file (JSON)
- Location:
evals/{agent_id}/{benchmark}/results/{task_id}.json - Includes full event history for analysis
5.9 Cleanup Task Session¶
- Stops sandbox containers
- Removes shared volumes
- Cleans up session resources
- Frees Docker resources
Step 6: Collect All Results¶
After all samples complete:
- Aggregates results from all tasks
- Collects statistics:
- Success rate
- Average execution time
- Tool usage patterns
- Error rates
Step 7: Evaluate Results (evaluate())¶
-
Load ground truth
- Loads expected outputs from dataset
- Loads agent results from files
-
Compare outputs
- Compares agent output vs ground truth
- Calculates metrics:
- Accuracy
- Precision/Recall (if applicable)
- Custom benchmark metrics
-
Generate report
- Creates evaluation report
- Includes metrics, statistics, examples
- Saves to
evals/{agent_id}/{benchmark}/evaluation_report.json
-
Display summary
- Prints metrics to console
- Shows top failures/successes
- Provides analysis