WatsonX Tool Test Report

Comprehensive analysis of AI model tool calling capabilities

Generated: 2025-10-22 06:25:39 UTC
Iterations: 50Client: watsonxModels Tested: 33

📊 Summary Statistics

Total Models

33
Models tested

Tool Support

13
Models with tool calling (39.4%)

Response Handling

13
Models using tool results (100.0%)

Reliability

10/13
Models with ≥90% success rate (76.9%)

Fastest Model

0.70s
meta-llama/llama-3-2-11b-vision-instruct

Average Response Time

1.17s
Total processing time
Full Support: 10
⚠️ Partial Support: 3
No Support: 20

Model Performance History (Last 30 Days)

Working Reliably
Tool Calls Only
Inconsistent Results
Previously Worked
Not Tested
Model
09/23
09/26
09/29
10/02
10/05
10/08
10/11
10/14
10/17
10/20
gpt-oss-120b
granite-3-2-8b-instruct
granite-3-2b-instruct
granite-3-3-8b-instruct
granite-3-8b-instruct
granite-4-h-small
llama-3-2-11b-vision-instruct
llama-3-2-3b-instruct
llama-3-2-90b-vision-instruct
llama-3-3-70b-instruct
llama-3-405b-instruct
llama-4-maverick-17b-128e-instruct-fp8
mistral-large
mistral-medium-2505
mistral-small-3-1-24b-instruct-2503

🔍 Models with Tool Support

Model Tool Support Response Handling Call Time Response Time Total Time Details
ibm/granite-3-2-8b-instruct ✅ Reliable (50/50) ✅ Correct (50/50) 0.54s 0.26s 0.80s Consistent success across all iterations
ibm/granite-3-2b-instruct ⚠️ Unreliable (49/50) ✅ Correct (49/49) 0.45s 0.30s 0.75s Consistent success across all iterations
ibm/granite-3-3-8b-instruct ⚠️ Unreliable (22/50) ⚠️ Partial (21/22) 0.41s 0.10s 0.51s Inconsistent results across 50 iterations
ibm/granite-3-8b-instruct ✅ Reliable (50/50) ⚠️ Partial (47/50) 0.43s 0.52s 0.95s Consistent success across all iterations
ibm/granite-4-h-small ✅ Reliable (50/50) ⚠️ Partial (49/50) 0.81s 0.53s 1.34s Consistent success across all iterations
meta-llama/llama-3-2-11b-vision-instruct ⚠️ Unreliable (48/50) ✅ Correct (48/48) 0.44s 0.26s 0.70s Consistent success across all iterations
meta-llama/llama-3-2-90b-vision-instruct ✅ Reliable (50/50) ⚠️ Partial (38/50) 1.04s 0.52s 1.56s Inconsistent results across 50 iterations
meta-llama/llama-3-3-70b-instruct ✅ Reliable (50/50) ✅ Correct (50/50) 0.73s 0.92s 1.65s Consistent success across all iterations
meta-llama/llama-3-405b-instruct ⚠️ Unreliable (49/50) ✅ Correct (49/49) 0.90s 0.35s 1.25s Consistent success across all iterations
meta-llama/llama-4-maverick-17b-128e-instruct-fp8 ✅ Reliable (50/50) ✅ Correct (50/50) 0.54s 0.61s 1.15s Consistent success across all iterations
mistralai/mistral-medium-2505 ✅ Reliable (50/50) ✅ Correct (50/50) 1.21s 0.33s 1.54s Consistent success across all iterations
mistralai/mistral-small-3-1-24b-instruct-2503 ✅ Reliable (50/50) ✅ Correct (50/50) 0.45s 0.27s 0.72s Consistent success across all iterations
openai/gpt-oss-120b ⚠️ Unreliable (43/50) ✅ Correct (43/43) 1.41s 0.88s 2.30s Inconsistent results across 50 iterations

📋 Latest Test Results

Results from the most recent test execution (2025-10-22)

Date
Model
Status
Time
Tool Success
Details
2025-10-22
granite-3-2-8b-instruct
Working
0.802s
100.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
granite-3-2b-instruct
Working
0.754s
98.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
granite-3-8b-instruct
Working
0.951s
100.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
granite-4-h-small
Working
1.340s
100.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
llama-3-2-11b-vision-instruct
Working
0.698s
96.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
llama-3-3-70b-instruct
Working
1.651s
100.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
llama-3-405b-instruct
Working
1.252s
98.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
llama-4-maverick-17b-128e-instruct-fp8
Working
1.152s
100.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
mistral-medium-2505
Working
1.543s
100.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
mistral-small-3-1-24b-instruct-2503
Working
0.722s
100.0%
Test Details

Details: Consistent success across all iterations

2025-10-22
gpt-oss-120b
Unreliable
2.298s
86.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-10-22
granite-3-3-8b-instruct
Unreliable
0.510s
44.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-10-22
llama-3-2-90b-vision-instruct
Unreliable
1.560s
100.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-10-22
all-minilm-l6-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-3-1-8b-base
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-3-3-8b-instruct-np
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-8b-code-instruct
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-embedding-107m-multilingual
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-embedding-278m-multilingual
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-guardian-3-8b
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-ttm-1024-96-r2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-ttm-1536-96-r2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-ttm-512-96-r2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
granite-vision-3-2-2b
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
llama-3-1-70b-gptq
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
llama-3-1-8b
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
llama-guard-3-11b-vision
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
ms-marco-minilm-l-12-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
multilingual-e5-large
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
slate-125m-english-rtrvr
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
slate-125m-english-rtrvr-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
slate-30m-english-rtrvr
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-10-22
slate-30m-english-rtrvr-v2
Not Supported
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

❌ Models Without Tool Support (20 models)

These models do not support tool calling and are listed here for reference.

cross-encoder/ms-marco-minilm-l-12-v2 Failed probe iterations (0/5 successes)
ibm/granite-3-1-8b-base Failed probe iterations (0/5 successes)
ibm/granite-3-3-8b-instruct-np Failed probe iterations (0/5 successes)
ibm/granite-8b-code-instruct Failed probe iterations (0/5 successes)
ibm/granite-embedding-107m-multilingual Failed probe iterations (0/5 successes)
ibm/granite-embedding-278m-multilingual Failed probe iterations (0/5 successes)
ibm/granite-guardian-3-8b Failed probe iterations (0/5 successes)
ibm/granite-ttm-1024-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-ttm-1536-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-ttm-512-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-vision-3-2-2b Failed probe iterations (0/5 successes)
ibm/slate-125m-english-rtrvr Failed probe iterations (0/5 successes)
ibm/slate-125m-english-rtrvr-v2 Failed probe iterations (0/5 successes)
ibm/slate-30m-english-rtrvr Failed probe iterations (0/5 successes)
ibm/slate-30m-english-rtrvr-v2 Failed probe iterations (0/5 successes)
intfloat/multilingual-e5-large Failed probe iterations (0/5 successes)
meta-llama/llama-3-1-70b-gptq Failed probe iterations (0/5 successes)
meta-llama/llama-3-1-8b Failed probe iterations (0/5 successes)
meta-llama/llama-guard-3-11b-vision Failed probe iterations (0/5 successes)
sentence-transformers/all-minilm-l6-v2 Failed probe iterations (0/5 successes)