Text embeddings replace domain knowledge in algorithm selection across seven problem classes
Researchers propose ZeroFolio, a feature-free method that uses pretrained text embeddings to select algorithms without manual feature engineering, outperforming hand-crafted approaches on 10 of 11 tested scenarios.
1 source · single source
- ZeroFolio uses pretrained text embeddings instead of hand-crafted features to select algorithms across diverse problem domains including SAT, MaxSAT, QBF, ASP, CSP, MIP, and graph problems.
- The method outperformed random forest baselines trained on domain-specific features in 10 of 11 test scenarios with a single configuration, and all 11 scenarios with two-seed voting.
- Key design choices include inverse-distance weighting, line shuffling, and Manhattan distance as identified through ablation study.
- Combining embeddings with traditional hand-crafted features via soft voting yielded further improvements on competitive scenarios.
A research team led by Stefan Szeider has proposed ZeroFolio, a domain-agnostic approach to algorithm selection that eliminates the need for hand-engineered features. Rather than extracting problem-specific characteristics, the method treats raw instance files as plain text, encodes them with pretrained embeddings, and applies weighted k-nearest neighbors for solver selection.
The core innovation rests on an empirical observation: pretrained language model embeddings capture structural distinctions between problem instances without explicit domain knowledge or task-specific fine-tuning. This permits the same three-step pipeline—serialize, embed, select—to work across unrelated problem classes.
The authors evaluated ZeroFolio on 11 scenarios spanning seven distinct combinatorial optimization domains: satisfiability, maximum satisfiability, quantified Boolean formulas, answer set programming, constraint satisfaction, mixed-integer programming, and graph problems. Against random forest classifiers built on conventional hand-crafted features, ZeroFolio outperformed baselines in 10 of 11 scenarios using a single fixed hyperparameter set, and in all 11 scenarios when ensemble voting with two random seeds was applied.
Ablation analysis identified three critical design decisions: inverse-distance weighting for neighbor contribution, random line shuffling during text preprocessing, and Manhattan distance as the similarity metric. On datasets where both approaches showed comparable performance, combining embeddings with traditional features through soft voting produced measurable gains.
- Apr 24, 2026 · arXiv cs.AI
New framework enables LLMs to discover and reuse skills for long-horizon game-playing tasks
Trust69 - Apr 24, 2026 · arXiv cs.AI
Researchers propose policy-grounded metrics to replace agreement-based evaluation in AI content moderation
Trust70 - Apr 24, 2026 · Google DeepMind — Blog
Google DeepMind proposes Decoupled DiLoCo for resilient distributed AI model training across data centers
Trust69