WatsonX Tool Test Report

Comprehensive analysis of AI model tool calling capabilities

Generated: 2026-05-19 08:35:21 UTC
Iterations: 50Client: watsonxModels Tested: 26

📊 Summary Statistics

Total Models

26
Models tested

Tool Support

10
Models with tool calling (38.5%)

Response Handling

9
Models using tool results (90.0%)

Reliability

9/10
Models with ≥90% success rate (90.0%)

Fastest Model

0.79s
mistralai/mistral-small-3-1-24b-instruct-2503

Average Response Time

8.09s
Total processing time
Full Support: 9
⚠️ Partial Support: 1
No Support: 16

Model Performance History (Last 30 Days)

Working Reliably
Tool Calls Only
Inconsistent Results
Previously Worked
Not Tested
Model
04/20
04/23
04/26
04/29
05/02
05/05
05/08
05/11
05/14
05/17
gpt-oss-120b
granite-3-2-8b-instruct
granite-3-2b-instruct
granite-3-3-8b-instruct
granite-3-8b-instruct
granite-4-h-small
llama-3-2-11b-vision-instruct
llama-3-2-3b-instruct
llama-3-2-90b-vision-instruct
llama-3-3-70b-instruct
llama-3-405b-instruct
llama-4-maverick-17b-128e-instruct-fp8
mistral-large
mistral-large-2512
mistral-medium-2505
mistral-small-3-1-24b-instruct-2503

🔍 Models with Tool Support

Model Tool Support Response Handling Call Time Response Time Total Time Details
ibm/granite-3-8b-instruct ✅ Reliable (50/50) ⚠️ Partial (48/50) 35.94s 34.78s 70.73s Consistent success across all iterations
ibm/granite-4-h-small ✅ Reliable (50/50) ⚠️ Partial (47/50) 0.49s 0.36s 0.85s Consistent success across all iterations
meta-llama/llama-3-2-11b-vision-instruct ⚠️ Unreliable (45/50) ✅ Correct (45/45) 0.57s 0.32s 0.89s Consistent success across all iterations
meta-llama/llama-3-2-90b-vision-instruct ✅ Reliable (50/50) ⚠️ Partial (47/50) 1.20s 0.58s 1.77s Consistent success across all iterations
meta-llama/llama-3-3-70b-instruct ✅ Reliable (50/50) ✅ Correct (50/50) 1.21s 1.06s 2.27s Consistent success across all iterations
meta-llama/llama-4-maverick-17b-128e-instruct-fp8 ✅ Reliable (50/50) ✅ Correct (50/50) 0.46s 0.35s 0.81s Consistent success across all iterations
mistral-large-2512 ✅ Reliable (50/50) ❌ Never Handles (0/50) 0.32s 0.08s 0.40s Inconsistent results across 50 iterations
mistralai/mistral-medium-2505 ✅ Reliable (50/50) ✅ Correct (50/50) 0.58s 0.36s 0.93s Consistent success across all iterations
mistralai/mistral-small-3-1-24b-instruct-2503 ⚠️ Unreliable (49/50) ⚠️ Partial (47/49) 0.48s 0.31s 0.79s Consistent success across all iterations
openai/gpt-oss-120b ✅ Reliable (50/50) ✅ Correct (50/50) 0.68s 0.75s 1.43s Consistent success across all iterations

📋 Latest Test Results

Results from the most recent test execution (2026-05-19)

Date
Model
Status
Time
Tool Success
Details
2026-05-19
gpt-oss-120b
Working
1.427s
100.0%
Test Details

Details: Consistent success across all iterations

2026-05-19
granite-3-8b-instruct
Working
70.733s
100.0%
Test Details

Details: Consistent success across all iterations

2026-05-19
granite-4-h-small
Working
0.852s
100.0%
Test Details

Details: Consistent success across all iterations

2026-05-19
llama-3-2-11b-vision-instruct
Working
0.885s
90.0%
Test Details

Details: Consistent success across all iterations

2026-05-19
llama-3-2-90b-vision-instruct
Working
1.774s
100.0%
Test Details

Details: Consistent success across all iterations

2026-05-19
llama-3-3-70b-instruct
Working
2.271s
100.0%
Test Details

Details: Consistent success across all iterations

2026-05-19
llama-4-maverick-17b-128e-instruct-fp8
Working
0.810s
100.0%
Test Details

Details: Consistent success across all iterations

2026-05-19
mistral-medium-2505
Working
0.933s
100.0%
Test Details

Details: Consistent success across all iterations

2026-05-19
mistral-small-3-1-24b-instruct-2503
Working
0.793s
98.0%
Test Details

Details: Consistent success across all iterations

2026-05-19
mistral-large-2512
Partial
0.401s
100.0%
Test Details

Details: Inconsistent results across 50 iterations

2026-05-19
all-minilm-l6-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
granite-3-1-8b-base
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
granite-3-3-8b-instruct-np
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
granite-8b-code-instruct
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
granite-embedding-278m-multilingual
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
granite-guardian-3-8b
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
granite-ttm-1024-96-r2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
granite-ttm-1536-96-r2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
granite-ttm-512-96-r2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
llama-3-1-70b-gptq
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
llama-3-1-8b
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
llama-guard-3-11b-vision
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
ms-marco-minilm-l-12-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
multilingual-e5-large
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
slate-125m-english-rtrvr-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2026-05-19
slate-30m-english-rtrvr-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

❌ Models Without Tool Support (16 models)

These models do not support tool calling and are listed here for reference.

cross-encoder/ms-marco-minilm-l-12-v2 Failed probe iterations (0/5 successes)
ibm/granite-3-1-8b-base Failed probe iterations (0/5 successes)
ibm/granite-3-3-8b-instruct-np Failed probe iterations (0/5 successes)
ibm/granite-8b-code-instruct Failed probe iterations (0/5 successes)
ibm/granite-embedding-278m-multilingual Failed probe iterations (0/5 successes)
ibm/granite-guardian-3-8b Failed probe iterations (0/5 successes)
ibm/granite-ttm-1024-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-ttm-1536-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-ttm-512-96-r2 Failed probe iterations (0/5 successes)
ibm/slate-125m-english-rtrvr-v2 Failed probe iterations (0/5 successes)
ibm/slate-30m-english-rtrvr-v2 Failed probe iterations (0/5 successes)
intfloat/multilingual-e5-large Failed probe iterations (0/5 successes)
meta-llama/llama-3-1-70b-gptq Failed probe iterations (0/5 successes)
meta-llama/llama-3-1-8b Failed probe iterations (0/5 successes)
meta-llama/llama-guard-3-11b-vision Failed probe iterations (0/5 successes)
sentence-transformers/all-minilm-l6-v2 Failed probe iterations (0/5 successes)