WatsonX Tool Test Report

Comprehensive analysis of AI model tool calling capabilities

Generated: 2025-08-02 06:22:34 UTC
Iterations: 50Client: watsonxModels Tested: 38

📊 Summary Statistics

Total Models

38
Models tested

Tool Support

13
Models with tool calling (34.2%)

Response Handling

13
Models using tool results (100.0%)

Reliability

5/13
Models with 100% consistency (38.5%)

Fastest Model

0.73s
meta-llama/llama-3-2-11b-vision-instruct

Average Response Time

1.09s
Total processing time
Full Support: 5
⚠️ Partial Support: 8
No Support: 25

Model Performance History (Last 30 Days)

Working Reliably
Tool Calls Only
Inconsistent Results
Previously Worked
Not Tested
Model
07/04
07/07
07/10
07/13
07/16
07/19
07/22
07/25
07/28
07/31
granite-3-2-8b-instruct
granite-3-2b-instruct
granite-3-3-8b-instruct
granite-3-8b-instruct
llama-3-2-11b-vision-instruct
llama-3-2-3b-instruct
llama-3-2-90b-vision-instruct
llama-3-3-70b-instruct
llama-3-405b-instruct
llama-4-maverick-17b-128e-instruct-fp8
mistral-large
mistral-medium-2505
mistral-small-3-1-24b-instruct-2503

🔍 Models with Tool Support

Model Tool Support Response Handling Call Time Response Time Total Time Details
ibm/granite-3-2-8b-instruct ✅ Reliable (50/50) ⚠️ Partial (49/50) 0.46s 0.26s 0.72s Inconsistent results across 50 iterations
ibm/granite-3-2b-instruct ⚠️ Unreliable (49/50) ⚠️ Partial (47/49) 0.49s 0.34s 0.82s Inconsistent results across 50 iterations
ibm/granite-3-3-8b-instruct ⚠️ Unreliable (25/50) ✅ Correct (25/25) 0.55s 0.28s 0.83s Inconsistent results across 50 iterations
ibm/granite-3-8b-instruct ✅ Reliable (50/50) ⚠️ Partial (49/50) 0.54s 0.47s 1.01s Inconsistent results across 50 iterations
meta-llama/llama-3-2-11b-vision-instruct ✅ Reliable (50/50) ✅ Correct (50/50) 0.46s 0.27s 0.73s Consistent success across all iterations
meta-llama/llama-3-2-3b-instruct ✅ Reliable (50/50) ⚠️ Partial (28/50) 0.45s 0.96s 1.41s Inconsistent results across 50 iterations
meta-llama/llama-3-2-90b-vision-instruct ✅ Reliable (50/50) ✅ Correct (50/50) 1.10s 0.69s 1.79s Consistent success across all iterations
meta-llama/llama-3-3-70b-instruct ✅ Reliable (50/50) ✅ Correct (50/50) 0.89s 0.88s 1.77s Consistent success across all iterations
meta-llama/llama-3-405b-instruct ✅ Reliable (50/50) ✅ Correct (50/50) 0.91s 0.37s 1.28s Consistent success across all iterations
meta-llama/llama-4-maverick-17b-128e-instruct-fp8 ✅ Reliable (50/50) ✅ Correct (50/50) 0.46s 0.35s 0.81s Consistent success across all iterations
mistralai/mistral-large ⚠️ Unreliable (46/50) ⚠️ Partial (45/46) 0.78s 0.40s 1.17s Inconsistent results across 50 iterations
mistralai/mistral-medium-2505 ✅ Reliable (50/50) ⚠️ Partial (48/50) 0.65s 0.37s 1.02s Inconsistent results across 50 iterations
mistralai/mistral-small-3-1-24b-instruct-2503 ✅ Reliable (50/50) ⚠️ Partial (46/50) 0.46s 0.30s 0.76s Inconsistent results across 50 iterations

📋 Latest Test Results

Results from the most recent test execution (2025-08-02)

Date
Model
Status
Time
Tool Success
Details
2025-08-02
llama-3-2-11b-vision-instruct
Working
0.732s
100.0%
Test Details

Details: Consistent success across all iterations

2025-08-02
llama-3-2-90b-vision-instruct
Working
1.790s
100.0%
Test Details

Details: Consistent success across all iterations

2025-08-02
llama-3-3-70b-instruct
Working
1.772s
100.0%
Test Details

Details: Consistent success across all iterations

2025-08-02
llama-3-405b-instruct
Working
1.285s
100.0%
Test Details

Details: Consistent success across all iterations

2025-08-02
llama-4-maverick-17b-128e-instruct-fp8
Working
0.809s
100.0%
Test Details

Details: Consistent success across all iterations

2025-08-02
granite-3-2-8b-instruct
Unreliable
0.721s
100.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-08-02
granite-3-2b-instruct
Unreliable
0.823s
98.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-08-02
granite-3-3-8b-instruct
Unreliable
0.830s
50.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-08-02
granite-3-8b-instruct
Unreliable
1.007s
100.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-08-02
llama-3-2-3b-instruct
Unreliable
1.414s
100.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-08-02
mistral-large
Unreliable
1.174s
92.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-08-02
mistral-medium-2505
Unreliable
1.021s
100.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-08-02
mistral-small-3-1-24b-instruct-2503
Unreliable
0.764s
100.0%
Test Details

Details: Inconsistent results across 50 iterations

2025-08-02
all-minilm-l12-v2
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
all-minilm-l6-v2
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
flan-t5-xl
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-13b-instruct-v2
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-3-1-8b-base
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-8b-code-instruct
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-embedding-107m-multilingual
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-embedding-278m-multilingual
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-guardian-3-2b
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-guardian-3-8b
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-ttm-1024-96-r2
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-ttm-1536-96-r2
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-ttm-512-96-r2
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
granite-vision-3-2-2b
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
llama-2-13b-chat
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
llama-3-1-8b
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
llama-3-2-1b-instruct
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
llama-guard-3-11b-vision
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
ms-marco-minilm-l-12-v2
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
multilingual-e5-large
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
pixtral-12b
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
slate-125m-english-rtrvr
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
slate-125m-english-rtrvr-v2
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
slate-30m-english-rtrvr
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

2025-08-02
slate-30m-english-rtrvr-v2
Broken
0.000s
0.0%
Test Details

Details: Failed probe iterations (0/5 successes)

❌ Models Without Tool Support (25 models)

These models do not support tool calling and are listed here for reference.

cross-encoder/ms-marco-minilm-l-12-v2 Failed probe iterations (0/5 successes)
google/flan-t5-xl Failed probe iterations (0/5 successes)
ibm/granite-13b-instruct-v2 Failed probe iterations (0/5 successes)
ibm/granite-3-1-8b-base Failed probe iterations (0/5 successes)
ibm/granite-8b-code-instruct Failed probe iterations (0/5 successes)
ibm/granite-embedding-107m-multilingual Failed probe iterations (0/5 successes)
ibm/granite-embedding-278m-multilingual Failed probe iterations (0/5 successes)
ibm/granite-guardian-3-2b Failed probe iterations (0/5 successes)
ibm/granite-guardian-3-8b Failed probe iterations (0/5 successes)
ibm/granite-ttm-1024-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-ttm-1536-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-ttm-512-96-r2 Failed probe iterations (0/5 successes)
ibm/granite-vision-3-2-2b Failed probe iterations (0/5 successes)
ibm/slate-125m-english-rtrvr Failed probe iterations (0/5 successes)
ibm/slate-125m-english-rtrvr-v2 Failed probe iterations (0/5 successes)
ibm/slate-30m-english-rtrvr Failed probe iterations (0/5 successes)
ibm/slate-30m-english-rtrvr-v2 Failed probe iterations (0/5 successes)
intfloat/multilingual-e5-large Failed probe iterations (0/5 successes)
meta-llama/llama-2-13b-chat Failed probe iterations (0/5 successes)
meta-llama/llama-3-1-8b Failed probe iterations (0/5 successes)
meta-llama/llama-3-2-1b-instruct Failed probe iterations (0/5 successes)
meta-llama/llama-guard-3-11b-vision Failed probe iterations (0/5 successes)
mistralai/pixtral-12b Failed probe iterations (0/5 successes)
sentence-transformers/all-minilm-l12-v2 Failed probe iterations (0/5 successes)
sentence-transformers/all-minilm-l6-v2 Failed probe iterations (0/5 successes)