WatsonX Tool Test Report

Comprehensive analysis of AI model tool calling capabilities

Generated: 2026-04-04 06:41:31 UTC
Iterations: 50Client: watsonxModels Tested: 26

📊 Summary Statistics

Total Models

26
Models tested

Tool Support

10
Models with tool calling (38.5%)

Response Handling

9
Models using tool results (90.0%)

Reliability

8/10
Models with ≥90% success rate (80.0%)

Fastest Model

0.78s
meta-llama/llama-3-2-11b-vision-instruct

Average Response Time

1.20s
Total processing time
Full Support: 8
⚠️ Partial Support: 2
No Support: 16

Model Performance History (Last 30 Days)

Working Reliably
Tool Calls Only
Inconsistent Results
Previously Worked
Not Tested
Model
03/06
03/09
03/12
03/15
03/18
03/21
03/24
03/27
03/30
04/02
gpt-oss-120b
granite-3-2-8b-instruct
granite-3-2b-instruct
granite-3-3-8b-instruct
granite-3-8b-instruct
granite-4-h-small
llama-3-2-11b-vision-instruct
llama-3-2-3b-instruct
llama-3-2-90b-vision-instruct
llama-3-3-70b-instruct
llama-3-405b-instruct
llama-4-maverick-17b-128e-instruct-fp8
mistral-large
mistral-large-2512
mistral-medium-2505
mistral-small-3-1-24b-instruct-2503

🔍 Models with Tool Support

Model Tool Support Response Handling Call Time Response Time Total Time Details
ibm/granite-3-8b-instruct ✅ Reliable (50/50) ⚠️ Partial (48/50) 0.95s 1.14s 2.08s Consistent success across all iterations
ibm/granite-4-h-small ✅ Reliable (50/50) ⚠️ Partial (49/50) 0.47s 0.37s 0.85s Consistent success across all iterations
meta-llama/llama-3-2-11b-vision-instruct ✅ Reliable (50/50) ✅ Correct (50/50) 0.48s 0.30s 0.78s Consistent success across all iterations
meta-llama/llama-3-2-90b-vision-instruct ✅ Reliable (50/50) ⚠️ Partial (40/50) 1.31s 0.70s 2.01s Inconsistent results across 50 iterations
meta-llama/llama-3-3-70b-instruct ✅ Reliable (50/50) ✅ Correct (50/50) 1.17s 1.18s 2.35s Consistent success across all iterations
meta-llama/llama-4-maverick-17b-128e-instruct-fp8 ✅ Reliable (50/50) ✅ Correct (50/50) 0.44s 0.36s 0.81s Consistent success across all iterations
mistral-large-2512 ✅ Reliable (50/50) ❌ Never Handles (0/50) 0.32s 0.10s 0.42s Inconsistent results across 50 iterations
mistralai/mistral-medium-2505 ⚠️ Unreliable (49/50) ⚠️ Partial (48/49) 0.55s 0.33s 0.87s Consistent success across all iterations
mistralai/mistral-small-3-1-24b-instruct-2503 ✅ Reliable (50/50) ⚠️ Partial (45/50) 0.49s 0.31s 0.79s Consistent success across all iterations
openai/gpt-oss-120b ✅ Reliable (50/50) ✅ Correct (50/50) 0.52s 0.50s 1.02s Consistent success across all iterations

📋 Latest Test Results

Results from the most recent test execution (2026-04-04)

Date
Model
Status
Time
Tool Success
Details
2026-04-04
gpt-oss-120b
Working
1.023s
100.0%
Test Details

Details: Consistent success across all iterations

2026-04-04
granite-3-8b-instruct
Working
2.082s
100.0%
Test Details

Details: Consistent success across all iterations

2026-04-04
granite-4-h-small
Working
0.847s
100.0%
Test Details

Details: Consistent success across all iterations

2026-04-04
llama-3-2-11b-vision-instruct
Working
0.778s
100.0%
Test Details

Details: Consistent success across all iterations

2026-04-04
llama-3-3-70b-instruct
Working
2.347s
100.0%
Test Details

Details: Consistent success across all iterations

2026-04-04
llama-4-maverick-17b-128e-instruct-fp8
Working
0.809s
100.0%
Test Details

Details: Consistent success across all iterations

2026-04-04
mistral-medium-2505
Working
0.873s
98.0%
Test Details

Details: Consistent success across all iterations

2026-04-04
mistral-small-3-1-24b-instruct-2503
Working
0.794s
100.0%
Test Details

Details: Consistent success across all iterations

2026-04-04
llama-3-2-90b-vision-instruct
Unreliable
2.010s
100.0%
Test Details

Details: Inconsistent results across 50 iterations

2026-04-04
mistral-large-2512
Partial
0.416s
100.0%
Test Details

Details: Inconsistent results across 50 iterations

2026-04-04
all-minilm-l6-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
granite-3-1-8b-base
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
granite-3-3-8b-instruct-np
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
granite-8b-code-instruct
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
granite-embedding-278m-multilingual
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
granite-guardian-3-8b
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
granite-ttm-1024-96-r2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
granite-ttm-1536-96-r2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
granite-ttm-512-96-r2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
llama-3-1-70b-gptq
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
llama-3-1-8b
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
llama-guard-3-11b-vision
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
ms-marco-minilm-l-12-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
multilingual-e5-large
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
slate-125m-english-rtrvr-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-04-04
slate-30m-english-rtrvr-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

❌ Models Without Tool Support (16 models)

These models do not support tool calling and are listed here for reference.

cross-encoder/ms-marco-minilm-l-12-v2 Failed probe iterations (0/5 successes)
ibm/granite-3-1-8b-base Failed probe iterations (0/5 successes)
ibm/granite-3-3-8b-instruct-np Failed probe iterations (0/5 successes)
ibm/granite-8b-code-instruct Failed probe iterations (0/5 successes)
ibm/granite-embedding-278m-multilingual Failed probe iterations (0/5 successes)
ibm/granite-guardian-3-8b Failed probe iterations (0/5 successes)
ibm/granite-ttm-1024-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-ttm-1536-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-ttm-512-96-r2 Failed probe iterations (0/5 successes)
ibm/slate-125m-english-rtrvr-v2 Failed probe iterations (0/5 successes)
ibm/slate-30m-english-rtrvr-v2 Failed probe iterations (0/5 successes)
intfloat/multilingual-e5-large Failed probe iterations (0/5 successes)
meta-llama/llama-3-1-70b-gptq Failed probe iterations (0/5 successes)
meta-llama/llama-3-1-8b Failed probe iterations (0/5 successes)
meta-llama/llama-guard-3-11b-vision Failed probe iterations (0/5 successes)
sentence-transformers/all-minilm-l6-v2 Failed probe iterations (0/5 successes)