Applied AI · Model Evaluation

Pre-Trained Model
Trade-Off Matrix

A comparative analysis of six leading pre-trained models across NLP/Generative AI, Computer Vision, and Tabular Data — evaluating size, accuracy, speed, explainability, and energy cost to guide deployment decisions.

Author Dayo Egonu
Program Applied AI — IWU / AIML 501
Models Evaluated 6 Models · 3 Domains · 7 Criteria

Why Trade-Offs Matter

Selecting a pre-trained model is rarely straightforward. Practitioners must weigh competing pressures — raw accuracy vs. inference latency, model size vs. deployment cost, and prediction power vs. explainability. This is especially critical across domains where constraints differ dramatically.

Explainability adds a further dimension: regulated industries — finance, healthcare, law — require models whose decisions can be audited. A model that achieves 96% accuracy while remaining a black box may be entirely unsuitable in a compliance-driven context. This report synthesizes benchmark data, deployment considerations, and ethical trade-offs to produce actionable recommendations for each domain.

⬡ NLP / Generative AI ⬡ Computer Vision ⬡ Tabular Data

How Models Were Classified

The classification process followed three deliberate stages. Models were first grouped by data modality (the primary organizing principle), then paired along scale and purpose axes within each domain, then rated across seven directional criteria.

01

Domain Classification

Models were grouped by data modality: language models, image classifiers, and tabular ensembles. Domain is the primary organizing principle because cross-domain accuracy benchmarks are incommensurable — MMLU scores, ImageNet Top-1, and AUC measure fundamentally different capabilities and cannot be directly compared.

02

Scale & Purpose Axis

Within each domain, models are classified along two axes: scale (large/frontier vs. lightweight/efficient) and purpose (generative/detection vs. discriminative/classification). This produced meaningful paired comparisons: GPT-4 vs. BERT-base, EfficientNet-B4 vs. MobileNetV3, XGBoost vs. LightGBM.

03

Directional Criteria Rating

Each of the seven criteria is classified directly (High = favorable: Accuracy, Speed, Explainability, Integration) or inversely (Low = favorable: Size, Energy Cost). Ratings reflect performance relative to the cohort of six models, not against external absolute benchmarks.

Seven Evaluation Criteria

📐

Model Size

Parameter count & memory footprint

↓ Lower = Favorable
🎯

Accuracy

Domain-specific benchmark score

↑ Higher = Favorable

Inference Speed

Latency per unit of output

↑ Higher = Favorable
🔍

Explainability

Availability of interpretation tools

↑ Higher = Favorable
🌱

Energy Cost

Compute & environmental footprint

↓ Lower = Favorable
🔌

Integration

SDK, framework, deployment ease

↑ Higher = Favorable
📊

Benchmark Score

Quantitative reference by domain

↑ Higher = Favorable

Selected Models & Data Sources

Six models spanning three domains were selected for documented production adoption, published benchmark data (Hugging Face, MMLU, GLUE, ImageNet, MLPerf v3.1), and intentional pairing of frontier vs. lightweight alternatives within each domain.

GPT-4
OpenAI · 2023
NLP / Gen AI

Frontier large language model using Mixture-of-Experts architecture (~1.8T params). Excels at complex reasoning, code generation, and multi-modal inputs. Accessed via API.

SIZE
~1.8T params
ACCURACY
MMLU 86.4%
SPEED
~60ms/tok
BERT-base
Google · 2018
NLP

Bidirectional encoder-only transformer (110M params). Industry standard for classification, NER, and Q&A fine-tuning. Open-source on Hugging Face.

SIZE
110M params
ACCURACY
GLUE 79.6
SPEED
~5ms/seq
EfficientNet-B4
Google Brain · 2019
Vision

Compound-scaled CNN (19M params) jointly scaling depth, width, and resolution. Strong accuracy-per-parameter ratio for production image classifiers.

SIZE
19M params
ACCURACY
Top-1 82.9%
SPEED
~380ms/img
MobileNetV3
Google · 2019
Vision

Ultra-lightweight depthwise separable CNN (5.4M params) designed for mobile and edge devices via Neural Architecture Search.

SIZE
5.4M params
ACCURACY
Top-1 75.2%
SPEED
~19ms/img
XGBoost
DMLC · 2016
Tabular

Gradient-boosted decision trees with regularized learning. Native SHAP support for feature-level explainability. Kaggle benchmark champion for structured data.

SIZE
Config-defined
ACCURACY
AUC ≥ 0.97
SPEED
<1ms/row
LightGBM
Microsoft · 2017
Tabular

Histogram-based GBDT with leaf-wise growth. Trains 10–20× faster than XGBoost at scale while maintaining comparable AUC. SHAP compatible.

SIZE
Config-defined
ACCURACY
AUC ≥ 0.96
SPEED
<0.5ms/row

Model Comparison Table

Each model is scored across seven criteria. Color indicates performance tier: High / Favorable · Medium / Partial · Low / Unfavorable. Note: Size and Energy Cost rate inversely — Low = favorable.

Model Domain Model Size Accuracy Inf. Speed Explainability Energy Cost Integration Benchmark
GPT-4
OpenAI
NLP / Gen AI Very Large Very High Low Low — Black Box Very High High (API)
MMLU 86.4%
BERT-base
Google
NLP Medium Moderate Fast Partial — Attn. Viz Moderate High (HuggingFace)
GLUE 79.6
EfficientNet-B4
Google Brain
Vision Medium High Moderate Partial — Grad-CAM Low High (TF / PyTorch)
Top-1 82.9%
MobileNetV3
Google
Vision Very Small Moderate Very Fast Partial — Grad-CAM Very Low Excellent (TF Lite)
Top-1 75.2%
XGBoost
DMLC
Tabular Flexible High Fast High — Native SHAP Very Low Excellent (pip)
AUC ≥ 0.97
LightGBM
Microsoft
Tabular Flexible High Fastest High — SHAP Compatible Very Low Excellent (pip)
AUC ≥ 0.96

Benchmark Breakdown

Normalized scores (0–100) grouped by domain. Cross-domain comparison is illustrative only — benchmarks differ by task and are not commensurable across rows.

⬡ NLP / Generative AI — Accuracy Score (normalized)

GPT-4
96 / 100
BERT-base
72 / 100

⬡ Computer Vision — ImageNet Top-1 Accuracy

EfficientNet-B4
82.9%
MobileNetV3
75.2%

⬡ Tabular Data — Inference Speed Score (inversely scaled — higher = faster)

LightGBM
98 / 100
XGBoost
90 / 100

⬡ Explainability Score — All Domains

XGBoost
95 — Native SHAP
LightGBM
90 — SHAP Compat.
EfficientNet-B4
45 — Grad-CAM
MobileNetV3
45 — Grad-CAM
BERT-base
35 — Attn. Viz
GPT-4
10 — Black Box

Strengths, Weaknesses & Use Cases

Each model occupies a distinct niche. The right choice depends on application constraints — not just peak accuracy.

🧠

GPT-4

Best for: Complex reasoning, content generation, and multi-modal applications.

  • Use when accuracy is non-negotiable and cost is secondary
  • Ideal for enterprise chatbots, legal/medical Q&A
  • Avoid for real-time, on-device, or budget-constrained deployments
  • Poor explainability rules it out for regulated industries
📖

BERT-base

Best for: Text classification, NER, semantic search on a tight compute budget.

  • Fine-tune on domain-specific labeled data for strong results
  • Ideal for on-premise NLP in regulated sectors
  • No external API dependency — data sovereignty preserved
  • Not suitable for open-ended text generation
🖼️

EfficientNet-B4

Best for: High-accuracy image classification on cloud or GPU-equipped servers.

  • Medical imaging, quality control, satellite analysis
  • Grad-CAM visualization aids clinical explainability
  • Not suitable for real-time mobile inference (>380ms)
📱

MobileNetV3

Best for: Real-time inference on mobile and edge devices with power constraints.

  • Retail apps, AR overlays, IoT cameras
  • Quantizable to INT8 for further efficiency
  • Trade-off: 7.7% accuracy gap vs. EfficientNet-B4
📊

XGBoost

Best for: Structured data prediction in regulated industries needing full explainability.

  • Credit scoring, fraud detection, insurance pricing
  • Native SHAP values satisfy regulatory requirements
  • Slower training than LightGBM on very large datasets

LightGBM

Best for: High-throughput tabular prediction where training time and speed are critical.

  • Real-time bidding, recommendation engines, telemetry
  • 10–20× faster training than XGBoost at scale
  • Marginally less stable on very small datasets

What This Project Taught Me

This comparative analysis challenged several assumptions I held about AI model selection and revealed new dimensions of the decision-making process I had previously underweighted.

"The right model is not the most accurate model — it is the model that most faithfully satisfies the constraints of the deployment environment, including regulatory, ethical, and operational constraints that performance benchmarks never capture."
🔄

Accuracy as One Variable, Not the Variable

Prior to this project, my default approach to model selection treated accuracy as the primary decision variable. Assembling the seven-criterion matrix challenged this assumption at every domain. When the highest-accuracy model was consistently the slowest, largest, and least explainable, I was compelled to reconceptualize model selection as a multi-objective optimization problem — not a ranking problem.

🧩

The Classification Challenge Across Domains

The hardest analytical task was building a classification scheme applicable across three fundamentally different data modalities. NLP, image, and tabular models don't share benchmarks, deployment infrastructure, or failure modes. The "Partial" explainability tier was not in my original plan — I introduced it specifically to distinguish Grad-CAM spatial visualizations from native SHAP feature attribution, which operate at very different levels of precision. That classification decision reflects genuine learning during the process.

🌳

Tabular Models Were the Surprise

Prior coursework emphasized deep learning architectures. Researching XGBoost and LightGBM in depth revealed a class of models that consistently outperforms neural networks on structured tabular data in both accuracy and interpretability — at a fraction of the compute cost. This directly reinforced a key principle from AIML 501: architectural novelty is not a proxy for fitness for purpose. The right model is context-driven, not trend-driven.

⚖️

Explainability as an Ethical Imperative

Building the explainability column of the matrix deepened my appreciation for responsible AI. The gap between GPT-4 (effectively zero native explainability) and XGBoost (per-prediction SHAP attribution) is not just a technical difference — it represents different levels of accountability to the humans affected by model decisions. In clinical, financial, or legal contexts, choosing a black-box model when a transparent alternative exists is an ethical choice, not just an engineering preference.

📐

Designing for Comprehension, Not Complexity

Building the portfolio visualization required translating abstract trade-off analysis into spatial, color-coded, and typographic decisions that a reader could parse in seconds. The discipline of information design — choosing what to foreground, what to summarize, what to make interactive — is itself an analytical skill. I found that the design process forced me to clarify ideas that felt clear in prose but were actually ambiguous when I tried to represent them visually.

🔭

Areas for Future Development

This project identified two gaps I intend to address in future work. First, the framework does not incorporate bias and fairness benchmarking — demographic parity, equalized odds — despite their importance for responsible deployment. Second, using compute cost as a proxy for energy consumption is imprecise; quantifying actual carbon footprint per inference would make the sustainability analysis actionable rather than directional.

Growth Points from This Assignment

Companion Document

This artifact is the interactive visualization component of a full academic paper. The companion explanatory document — which includes extended methodology, full APA references, and detailed reflection — is submitted alongside this artifact for AIML 501.

Egonu, D. (2026). Pre-trained model trade-off matrix: A comparative analysis across NLP, Computer Vision, and Tabular Data domains [Explanatory document, AIML 501]. Indiana Wesleyan University.

Key Takeaways

No single model wins across all dimensions. Effective AI deployment requires matching the model's strengths to the application's constraints — speed, accuracy, explainability, and cost must all be weighed against one another.

01

Accuracy and speed are inversely correlated at scale — GPT-4 leads accuracy but trails all models in inference speed.

02

Tabular models (XGBoost, LightGBM) offer the best explainability-to-performance ratio for regulated industries.

03

Choosing EfficientNet vs. MobileNet depends almost entirely on deployment context — cloud vs. edge — not raw accuracy alone.

04

BERT-base remains the most practical NLP fine-tuning baseline for organizations avoiding proprietary API dependency.

05

Ethical AI deployment must account for the carbon cost of large model inference — energy scales with model size.

06

Hybrid architectures (e.g., BERT + XGBoost) often outperform single-model approaches in real production pipelines.

← Return to Portfolio

Portfolio Reference

Egonu, D. (2026). Pre-trained model trade-off matrix: A comparative analysis across NLP, Computer Vision, and Tabular Data domains [Interactive portfolio artifact, AIML 501]. Indiana Wesleyan University. https://didiegons.github.io/model_evaluation_matrix.html

Portfolio landing page: https://didiegons.github.io