Pre-Trained Model Evaluation Matrix

02 — Classification Framework

How Models Were Classified

The classification process followed three deliberate stages. Models were first grouped by data modality (the primary organizing principle), then paired along scale and purpose axes within each domain, then rated across seven directional criteria.

Domain Classification

Models were grouped by data modality: language models, image classifiers, and tabular ensembles. Domain is the primary organizing principle because cross-domain accuracy benchmarks are incommensurable — MMLU scores, ImageNet Top-1, and AUC measure fundamentally different capabilities and cannot be directly compared.

Scale & Purpose Axis

Within each domain, models are classified along two axes: scale (large/frontier vs. lightweight/efficient) and purpose (generative/detection vs. discriminative/classification). This produced meaningful paired comparisons: GPT-4 vs. BERT-base, EfficientNet-B4 vs. MobileNetV3, XGBoost vs. LightGBM.

Directional Criteria Rating

Each of the seven criteria is classified directly (High = favorable: Accuracy, Speed, Explainability, Integration) or inversely (Low = favorable: Size, Energy Cost). Ratings reflect performance relative to the cohort of six models, not against external absolute benchmarks.

Seven Evaluation Criteria

📐

Model Size

Parameter count & memory footprint

↓ Lower = Favorable

🎯

Accuracy

Domain-specific benchmark score

↑ Higher = Favorable

⚡

Inference Speed

Latency per unit of output

↑ Higher = Favorable

🔍

Explainability

Availability of interpretation tools

↑ Higher = Favorable

🌱

Energy Cost

Compute & environmental footprint

↓ Lower = Favorable

🔌

Integration

SDK, framework, deployment ease

↑ Higher = Favorable

📊

Benchmark Score

Quantitative reference by domain

↑ Higher = Favorable

03 — Model Profiles

Selected Models & Data Sources

Six models spanning three domains were selected for documented production adoption, published benchmark data (Hugging Face, MMLU, GLUE, ImageNet, MLPerf v3.1), and intentional pairing of frontier vs. lightweight alternatives within each domain.

GPT-4

OpenAI · 2023

NLP / Gen AI

Frontier large language model using Mixture-of-Experts architecture (~1.8T params). Excels at complex reasoning, code generation, and multi-modal inputs. Accessed via API.

SIZE

~1.8T params

ACCURACY

MMLU 86.4%

SPEED

~60ms/tok

BERT-base

Google · 2018

NLP

Bidirectional encoder-only transformer (110M params). Industry standard for classification, NER, and Q&A fine-tuning. Open-source on Hugging Face.

SIZE

110M params

ACCURACY

GLUE 79.6

SPEED

~5ms/seq

EfficientNet-B4

Google Brain · 2019

Vision

Compound-scaled CNN (19M params) jointly scaling depth, width, and resolution. Strong accuracy-per-parameter ratio for production image classifiers.

SIZE

19M params

ACCURACY

Top-1 82.9%

SPEED

~380ms/img

MobileNetV3

Google · 2019

Vision

Ultra-lightweight depthwise separable CNN (5.4M params) designed for mobile and edge devices via Neural Architecture Search.

SIZE

5.4M params

ACCURACY

Top-1 75.2%

SPEED

~19ms/img

XGBoost

DMLC · 2016

Tabular

Gradient-boosted decision trees with regularized learning. Native SHAP support for feature-level explainability. Kaggle benchmark champion for structured data.

SIZE

Config-defined

ACCURACY

AUC ≥ 0.97

SPEED

<1ms/row

LightGBM

Microsoft · 2017

Tabular

Histogram-based GBDT with leaf-wise growth. Trains 10–20× faster than XGBoost at scale while maintaining comparable AUC. SHAP compatible.

SIZE

Config-defined

ACCURACY

AUC ≥ 0.96

SPEED

<0.5ms/row

04 — Decision Matrix

Model Comparison Table

Each model is scored across seven criteria. Color indicates performance tier: High / Favorable · Medium / Partial · Low / Unfavorable. Note: Size and Energy Cost rate inversely — Low = favorable.

Model	Domain	Model Size	Accuracy	Inf. Speed	Explainability	Energy Cost	Integration	Benchmark
GPT-4 OpenAI	NLP / Gen AI	Very Large	Very High	Low	Low — Black Box	Very High	High (API)	MMLU 86.4%
BERT-base Google	NLP	Medium	Moderate	Fast	Partial — Attn. Viz	Moderate	High (HuggingFace)	GLUE 79.6
EfficientNet-B4 Google Brain	Vision	Medium	High	Moderate	Partial — Grad-CAM	Low	High (TF / PyTorch)	Top-1 82.9%
MobileNetV3 Google	Vision	Very Small	Moderate	Very Fast	Partial — Grad-CAM	Very Low	Excellent (TF Lite)	Top-1 75.2%
XGBoost DMLC	Tabular	Flexible	High	Fast	High — Native SHAP	Very Low	Excellent (pip)	AUC ≥ 0.97
LightGBM Microsoft	Tabular	Flexible	High	Fastest	High — SHAP Compatible	Very Low	Excellent (pip)	AUC ≥ 0.96

06 — Analysis & Recommendations

Strengths, Weaknesses & Use Cases

Each model occupies a distinct niche. The right choice depends on application constraints — not just peak accuracy.

🧠

GPT-4

Best for: Complex reasoning, content generation, and multi-modal applications.

Use when accuracy is non-negotiable and cost is secondary
Ideal for enterprise chatbots, legal/medical Q&A
Avoid for real-time, on-device, or budget-constrained deployments
Poor explainability rules it out for regulated industries

📖

BERT-base

Best for: Text classification, NER, semantic search on a tight compute budget.

Fine-tune on domain-specific labeled data for strong results
Ideal for on-premise NLP in regulated sectors
No external API dependency — data sovereignty preserved
Not suitable for open-ended text generation

🖼️

EfficientNet-B4

Best for: High-accuracy image classification on cloud or GPU-equipped servers.

Medical imaging, quality control, satellite analysis
Grad-CAM visualization aids clinical explainability
Not suitable for real-time mobile inference (>380ms)

📱

MobileNetV3

Best for: Real-time inference on mobile and edge devices with power constraints.

Retail apps, AR overlays, IoT cameras
Quantizable to INT8 for further efficiency
Trade-off: 7.7% accuracy gap vs. EfficientNet-B4

📊

XGBoost

Best for: Structured data prediction in regulated industries needing full explainability.

Credit scoring, fraud detection, insurance pricing
Native SHAP values satisfy regulatory requirements
Slower training than LightGBM on very large datasets

⚡

LightGBM

Best for: High-throughput tabular prediction where training time and speed are critical.

Real-time bidding, recommendation engines, telemetry
10–20× faster training than XGBoost at scale
Marginally less stable on very small datasets

07 — Reflection on Learning

What This Project Taught Me

This comparative analysis challenged several assumptions I held about AI model selection and revealed new dimensions of the decision-making process I had previously underweighted.

"The right model is not the most accurate model — it is the model that most faithfully satisfies the constraints of the deployment environment, including regulatory, ethical, and operational constraints that performance benchmarks never capture."

🔄

Accuracy as One Variable, Not the Variable

Prior to this project, my default approach to model selection treated accuracy as the primary decision variable. Assembling the seven-criterion matrix challenged this assumption at every domain. When the highest-accuracy model was consistently the slowest, largest, and least explainable, I was compelled to reconceptualize model selection as a multi-objective optimization problem — not a ranking problem.

🧩

The Classification Challenge Across Domains

The hardest analytical task was building a classification scheme applicable across three fundamentally different data modalities. NLP, image, and tabular models don't share benchmarks, deployment infrastructure, or failure modes. The "Partial" explainability tier was not in my original plan — I introduced it specifically to distinguish Grad-CAM spatial visualizations from native SHAP feature attribution, which operate at very different levels of precision. That classification decision reflects genuine learning during the process.

🌳

Tabular Models Were the Surprise

Prior coursework emphasized deep learning architectures. Researching XGBoost and LightGBM in depth revealed a class of models that consistently outperforms neural networks on structured tabular data in both accuracy and interpretability — at a fraction of the compute cost. This directly reinforced a key principle from AIML 501: architectural novelty is not a proxy for fitness for purpose. The right model is context-driven, not trend-driven.

⚖️

Explainability as an Ethical Imperative

Building the explainability column of the matrix deepened my appreciation for responsible AI. The gap between GPT-4 (effectively zero native explainability) and XGBoost (per-prediction SHAP attribution) is not just a technical difference — it represents different levels of accountability to the humans affected by model decisions. In clinical, financial, or legal contexts, choosing a black-box model when a transparent alternative exists is an ethical choice, not just an engineering preference.

📐

Designing for Comprehension, Not Complexity

Building the portfolio visualization required translating abstract trade-off analysis into spatial, color-coded, and typographic decisions that a reader could parse in seconds. The discipline of information design — choosing what to foreground, what to summarize, what to make interactive — is itself an analytical skill. I found that the design process forced me to clarify ideas that felt clear in prose but were actually ambiguous when I tried to represent them visually.

🔭

Areas for Future Development

This project identified two gaps I intend to address in future work. First, the framework does not incorporate bias and fairness benchmarking — demographic parity, equalized odds — despite their importance for responsible deployment. Second, using compute cost as a proxy for energy consumption is imprecise; quantifying actual carbon footprint per inference would make the sustainability analysis actionable rather than directional.

Growth Points from This Assignment

Shifted from single-metric to multi-objective thinking about model selection
Learned to classify explainability tools by precision tier, not just presence/absence
Deepened understanding of gradient-boosted tree families (XGBoost / LightGBM) relative to neural approaches
Practiced responsible AI reasoning: explainability as ethical obligation, not optional feature
Developed information design skills: encoding data in color, space, and typography simultaneously
Identified future research directions: fairness metrics and carbon-footprint quantification

Companion Document

This artifact is the interactive visualization component of a full academic paper. The companion explanatory document — which includes extended methodology, full APA references, and detailed reflection — is submitted alongside this artifact for AIML 501.

Egonu, D. (2026). Pre-trained model trade-off matrix: A comparative analysis across NLP, Computer Vision, and Tabular Data domains [Explanatory document, AIML 501]. Indiana Wesleyan University.

08 — Conclusion

Key Takeaways

No single model wins across all dimensions. Effective AI deployment requires matching the model's strengths to the application's constraints — speed, accuracy, explainability, and cost must all be weighed against one another.

Accuracy and speed are inversely correlated at scale — GPT-4 leads accuracy but trails all models in inference speed.

Tabular models (XGBoost, LightGBM) offer the best explainability-to-performance ratio for regulated industries.

Choosing EfficientNet vs. MobileNet depends almost entirely on deployment context — cloud vs. edge — not raw accuracy alone.

BERT-base remains the most practical NLP fine-tuning baseline for organizations avoiding proprietary API dependency.

Ethical AI deployment must account for the carbon cost of large model inference — energy scales with model size.

Hybrid architectures (e.g., BERT + XGBoost) often outperform single-model approaches in real production pipelines.

← Return to Portfolio

Portfolio Reference

Egonu, D. (2026). Pre-trained model trade-off matrix: A comparative analysis across NLP, Computer Vision, and Tabular Data domains [Interactive portfolio artifact, AIML 501]. Indiana Wesleyan University. https://didiegons.github.io/model_evaluation_matrix.html

Portfolio landing page: https://didiegons.github.io

Pre-Trained Model
Trade-Off Matrix

Why Trade-Offs Matter

How Models Were Classified

Domain Classification

Scale & Purpose Axis

Directional Criteria Rating

Seven Evaluation Criteria

Model Size

Accuracy

Inference Speed

Explainability

Energy Cost

Integration

Benchmark Score

Selected Models & Data Sources

Model Comparison Table

Benchmark Breakdown

⬡ NLP / Generative AI — Accuracy Score (normalized)

⬡ Computer Vision — ImageNet Top-1 Accuracy

⬡ Tabular Data — Inference Speed Score (inversely scaled — higher = faster)

⬡ Explainability Score — All Domains

Strengths, Weaknesses & Use Cases

GPT-4

BERT-base

EfficientNet-B4

MobileNetV3

XGBoost

LightGBM

What This Project Taught Me

Accuracy as One Variable, Not the Variable

The Classification Challenge Across Domains

Tabular Models Were the Surprise

Explainability as an Ethical Imperative

Designing for Comprehension, Not Complexity

Areas for Future Development

Growth Points from This Assignment

Key Takeaways