A comparative analysis of six leading pre-trained models across NLP/Generative AI, Computer Vision, and Tabular Data — evaluating size, accuracy, speed, explainability, and energy cost to guide deployment decisions.
Selecting a pre-trained model is rarely straightforward. Practitioners must weigh competing pressures — raw accuracy vs. inference latency, model size vs. deployment cost, and prediction power vs. explainability. This is especially critical across domains where constraints differ dramatically.
Explainability adds a further dimension: regulated industries — finance, healthcare, law — require models whose decisions can be audited. A model that achieves 96% accuracy while remaining a black box may be entirely unsuitable in a compliance-driven context. This report synthesizes benchmark data, deployment considerations, and ethical trade-offs to produce actionable recommendations for each domain.
The classification process followed three deliberate stages. Models were first grouped by data modality (the primary organizing principle), then paired along scale and purpose axes within each domain, then rated across seven directional criteria.
Models were grouped by data modality: language models, image classifiers, and tabular ensembles. Domain is the primary organizing principle because cross-domain accuracy benchmarks are incommensurable — MMLU scores, ImageNet Top-1, and AUC measure fundamentally different capabilities and cannot be directly compared.
Within each domain, models are classified along two axes: scale (large/frontier vs. lightweight/efficient) and purpose (generative/detection vs. discriminative/classification). This produced meaningful paired comparisons: GPT-4 vs. BERT-base, EfficientNet-B4 vs. MobileNetV3, XGBoost vs. LightGBM.
Each of the seven criteria is classified directly (High = favorable: Accuracy, Speed, Explainability, Integration) or inversely (Low = favorable: Size, Energy Cost). Ratings reflect performance relative to the cohort of six models, not against external absolute benchmarks.
Parameter count & memory footprint
Domain-specific benchmark score
Latency per unit of output
Availability of interpretation tools
Compute & environmental footprint
SDK, framework, deployment ease
Quantitative reference by domain
Six models spanning three domains were selected for documented production adoption, published benchmark data (Hugging Face, MMLU, GLUE, ImageNet, MLPerf v3.1), and intentional pairing of frontier vs. lightweight alternatives within each domain.
Frontier large language model using Mixture-of-Experts architecture (~1.8T params). Excels at complex reasoning, code generation, and multi-modal inputs. Accessed via API.
Bidirectional encoder-only transformer (110M params). Industry standard for classification, NER, and Q&A fine-tuning. Open-source on Hugging Face.
Compound-scaled CNN (19M params) jointly scaling depth, width, and resolution. Strong accuracy-per-parameter ratio for production image classifiers.
Ultra-lightweight depthwise separable CNN (5.4M params) designed for mobile and edge devices via Neural Architecture Search.
Gradient-boosted decision trees with regularized learning. Native SHAP support for feature-level explainability. Kaggle benchmark champion for structured data.
Histogram-based GBDT with leaf-wise growth. Trains 10–20× faster than XGBoost at scale while maintaining comparable AUC. SHAP compatible.
Each model is scored across seven criteria. Color indicates performance tier: High / Favorable · Medium / Partial · Low / Unfavorable. Note: Size and Energy Cost rate inversely — Low = favorable.
| Model | Domain | Model Size | Accuracy | Inf. Speed | Explainability | Energy Cost | Integration | Benchmark |
|---|---|---|---|---|---|---|---|---|
GPT-4 OpenAI |
NLP / Gen AI | Very Large | Very High | Low | Low — Black Box | Very High | High (API) | MMLU 86.4% |
BERT-base Google |
NLP | Medium | Moderate | Fast | Partial — Attn. Viz | Moderate | High (HuggingFace) | GLUE 79.6 |
EfficientNet-B4 Google Brain |
Vision | Medium | High | Moderate | Partial — Grad-CAM | Low | High (TF / PyTorch) | Top-1 82.9% |
MobileNetV3 Google |
Vision | Very Small | Moderate | Very Fast | Partial — Grad-CAM | Very Low | Excellent (TF Lite) | Top-1 75.2% |
XGBoost DMLC |
Tabular | Flexible | High | Fast | High — Native SHAP | Very Low | Excellent (pip) | AUC ≥ 0.97 |
LightGBM Microsoft |
Tabular | Flexible | High | Fastest | High — SHAP Compatible | Very Low | Excellent (pip) | AUC ≥ 0.96 |
Normalized scores (0–100) grouped by domain. Cross-domain comparison is illustrative only — benchmarks differ by task and are not commensurable across rows.
Each model occupies a distinct niche. The right choice depends on application constraints — not just peak accuracy.
Best for: Complex reasoning, content generation, and multi-modal applications.
Best for: Text classification, NER, semantic search on a tight compute budget.
Best for: High-accuracy image classification on cloud or GPU-equipped servers.
Best for: Real-time inference on mobile and edge devices with power constraints.
Best for: Structured data prediction in regulated industries needing full explainability.
Best for: High-throughput tabular prediction where training time and speed are critical.
This comparative analysis challenged several assumptions I held about AI model selection and revealed new dimensions of the decision-making process I had previously underweighted.
Prior to this project, my default approach to model selection treated accuracy as the primary decision variable. Assembling the seven-criterion matrix challenged this assumption at every domain. When the highest-accuracy model was consistently the slowest, largest, and least explainable, I was compelled to reconceptualize model selection as a multi-objective optimization problem — not a ranking problem.
The hardest analytical task was building a classification scheme applicable across three fundamentally different data modalities. NLP, image, and tabular models don't share benchmarks, deployment infrastructure, or failure modes. The "Partial" explainability tier was not in my original plan — I introduced it specifically to distinguish Grad-CAM spatial visualizations from native SHAP feature attribution, which operate at very different levels of precision. That classification decision reflects genuine learning during the process.
Prior coursework emphasized deep learning architectures. Researching XGBoost and LightGBM in depth revealed a class of models that consistently outperforms neural networks on structured tabular data in both accuracy and interpretability — at a fraction of the compute cost. This directly reinforced a key principle from AIML 501: architectural novelty is not a proxy for fitness for purpose. The right model is context-driven, not trend-driven.
Building the explainability column of the matrix deepened my appreciation for responsible AI. The gap between GPT-4 (effectively zero native explainability) and XGBoost (per-prediction SHAP attribution) is not just a technical difference — it represents different levels of accountability to the humans affected by model decisions. In clinical, financial, or legal contexts, choosing a black-box model when a transparent alternative exists is an ethical choice, not just an engineering preference.
Building the portfolio visualization required translating abstract trade-off analysis into spatial, color-coded, and typographic decisions that a reader could parse in seconds. The discipline of information design — choosing what to foreground, what to summarize, what to make interactive — is itself an analytical skill. I found that the design process forced me to clarify ideas that felt clear in prose but were actually ambiguous when I tried to represent them visually.
This project identified two gaps I intend to address in future work. First, the framework does not incorporate bias and fairness benchmarking — demographic parity, equalized odds — despite their importance for responsible deployment. Second, using compute cost as a proxy for energy consumption is imprecise; quantifying actual carbon footprint per inference would make the sustainability analysis actionable rather than directional.
Companion Document
This artifact is the interactive visualization component of a full academic paper. The companion explanatory document — which includes extended methodology, full APA references, and detailed reflection — is submitted alongside this artifact for AIML 501.
Egonu, D. (2026). Pre-trained model trade-off matrix: A comparative analysis across NLP, Computer Vision, and Tabular Data domains [Explanatory document, AIML 501]. Indiana Wesleyan University.
No single model wins across all dimensions. Effective AI deployment requires matching the model's strengths to the application's constraints — speed, accuracy, explainability, and cost must all be weighed against one another.
Accuracy and speed are inversely correlated at scale — GPT-4 leads accuracy but trails all models in inference speed.
Tabular models (XGBoost, LightGBM) offer the best explainability-to-performance ratio for regulated industries.
Choosing EfficientNet vs. MobileNet depends almost entirely on deployment context — cloud vs. edge — not raw accuracy alone.
BERT-base remains the most practical NLP fine-tuning baseline for organizations avoiding proprietary API dependency.
Ethical AI deployment must account for the carbon cost of large model inference — energy scales with model size.
Hybrid architectures (e.g., BERT + XGBoost) often outperform single-model approaches in real production pipelines.
Portfolio Reference
Egonu, D. (2026). Pre-trained model trade-off matrix: A comparative analysis across NLP, Computer Vision, and Tabular Data domains [Interactive portfolio artifact, AIML 501]. Indiana Wesleyan University. https://didiegons.github.io/model_evaluation_matrix.html
Portfolio landing page: https://didiegons.github.io