Meta AI launches NeuralBench: a unified, open source framework for benchmarking neural AI models across 36 EEG tasks and 94 datasets


Evaluating AI models trained on brain signals has long been a messy and inconsistent topic. Different research groups use different preprocessing pipelines, train models on different data sets, and report results on a narrow set of tasks – making it nearly impossible to know which model actually works best, or for what purpose. A new framework from the Meta AI team is designed to fix this.

Researchers released a meta NeuralBencha unified, open-source framework for measuring AI models of brain activity. Its first release, NeuralBench-EEG v1.0is the largest open benchmark of its kind: 36 definitive tasks, 94 datasets, 9,478 subjects, 13,603 hours of EEG data, and 14 deep learning architectures evaluated within one unified interface.

https://ai.meta.com/research/publications/neuralbench-a-unifying-framework-to-benchmark-neuroai-models/

The problem that NeuralBench solves

The broader field of NeuroAI, where deep learning meets neuroscience, has exploded in recent years. Self-supervised learning techniques originally developed for language, speech, and images are now being adapted for construction Brain foundation models: Large models pre-trained on unlabeled brain recordings and fine-tuned for downstream tasks ranging from detecting clinical seizures to decoding what a person sees or hears.

But the evaluation landscape was poorly fragmented. Existing standards such as MOABB cover up to 148 brain-computer interaction (BCI) datasets, but limit the evaluation to only 5 tasks. Other efforts-such as EEG-Bench, EEG-FM-Bench, and AdaBrain-Bench-are each limited in their own ways. For modalities such as magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI), there is no systematic standard at all.

The result is that claims about basic models being “generalizable” or “foundational” often rely on selected tasks without any common point of reference.

What is NeuralBench?

NeuralBench is built on Three basic Python packages make up a standard pipeline.

Neural Fetch It handles dataset acquisition, pulling formatted data from public repositories including OpenNeuro, DANDI, and NEMAR. NeuralSet It prepares data as PyTorch-ready data loaders, wrapping existing neuroscience tools such as MNE-Python and nilearn for preprocessing, and HuggingFace for extracting stimulus embeddings (for tasks involving images, speech, or text). Nervous train Provides standard training code based on PyTorch-Lightning, Pydantic, and exca Implementation and caching library.

Once installed via pip install neuralbench,The framework is controlled via the command line interface (CLI). Running the task is simple and requires only three commands: download data, set up cache, and execute. Each task is configured with a lightweight YAML file that defines the data source, training/validation/test splits, preprocessing steps, target processing, training hyperparameters, and evaluation metrics.

https://ai.meta.com/research/publications/neuralbench-a-unifying-framework-to-benchmark-neuroai-models/

What NeuralBench-EEG v1.0 covers

The first edition focuses on electroencephalography (EEG) and covers eight categories of tasks: Cognitive decoding (image, sentence, speech, writing, video, and word decoding), Brain-computer interfacing (BCI), It sparked reactions, Clinical tasks, Internal condition, He sleeps, Phenotypeand diverse.

Three classes of models are compared:

  • Task-specific structures (~1.5k – 4.2 million parameters, trained from scratch): ShallowFBCSPNet, Deep4Net, EEGNet, BDTCN, ATCNet, EEGConformer, SimpleConvTimeAgg, CTNet.
  • Electroencephalogram (EEG) foundation models. (~3.2 million – 157.1 million parameters, pre-trained and fine-tuned): BENDR, LaBraM, BIOT, CBraMod, LUNA, REVE.
  • Feature handcrafted basic lines: Sklearn-style pipelines that use symmetric positive definite (SPD) matrix representations that are fed into Logistic Regression or Ridge.

All baseline models are fine-tuned from start to finish using a common training recipe – AdamW optimizer, learning rate 10⁻⁴, weight decay 0.05, cosine hardening with 10% warm-up, up to 50 epochs with early stopping (patience = 10). The only exception is BENDR, where the learning rate is reduced to 10⁻⁵ and gradient clipping at 0.5 is applied to obtain stable learning curves. This intentional standardization removes model-specific optimization tricks-such as layer-level drop-in learning, two-stage scanning, or LoRA-so that the architecture and pre-training methodology are truly evaluated.

Data partitioning is handled differently for each task type to reflect real-world generalizability limitations: pre-defined partitionings where provided by the data set research team, Leave the concept For cognitive decoding tasks (all subjects seen in training, but a set of stimuli used for testing), between-subject splits for most clinical and BCI tasks, and within-subject splits for data sets with very few participants. Each model is trained three times for each task using three different random seeds.

Evaluation metrics are standardized by task type: balanced accuracy for binary and multi-class classification, overall F1 score for multi-class classification, Pearson correlation for regression, and top 5 accuracy for retrieval tasks. In addition, all results are reported as normalized scores (s̃), where 0 corresponds to dummy-level performance and 1 corresponds to ideal performance, allowing fair comparisons between tasks regardless of metric scale.

One important methodological note: Some of the underlying EEG models were pre-trained on datasets that overlap with the final NeuralBench evaluation sets. Instead of ignoring these results, the benchmark flags them with hashed bars in the result numbers so that readers can identify potential leakage of pre-training data – no strong trend was observed indicating that leakage is inflating performance, but transparency is maintained.

The standard offers two options: NeuralBench-EEG-Core v1.0which uses a single representative dataset for each task for broad coverage, and NeuralBench-EEG-Full v1.0which expands to up to 24 datasets per task to study within-task variation across recording devices, laboratories, and populations. A Kendall’s τ value of 0.926 (p < 0.001) between the basic and full classification confirms that the basic variable is a reliable proxy - although some model positions do change, including exceeding CTNet LUNA when more datasets are included.

https://ai.meta.com/research/publications/neuralbench-a-unifying-framework-to-benchmark-neuroai-models/

Two main findings

Finding 1: Basic models only marginally outperform task-specific models. The top-ranked models overall are REVE (69.2 million parameters, average normalized rank 0.20), LaBraM (5.8 million, rank 0.21), and LUNA (40.4 million, rank 0.30). But several task-specific models trained from scratch – CTNet (150K parameters, rating 0.32), SimpleConvTimeAgg (4.2M, rating 0.35), and Deep4Net (146K parameters, rating 0.43) – lag closely. CTNet actually outperforms the LUNA baseline model to rank third in the full version, despite having approximately 270× fewer parameters. This shows that the gap between task-specific and basic models is narrow enough that expanding the coverage of the dataset alone is sufficient to change the global rankings.

Finding 2: Many tasks remain really difficult. Cognitive decoding tasks-recovering dense representations of images, speech, sentences, videos, or words from brain activity-are so challenging that even the best models achieve results well below the ceiling. Tasks such as mental imagery, sleep arousal, psychopathological decoding, cross-subject motor imagery and P300 categorization often result in performance close to the placebo level. These tasks represent the best benchmarks for stress testing the next generation of EEG models.

Tasks that are approaching saturation include SSVEP classification, disease detection, seizure detection, sleep stage classification, and phenotyping tasks such as age regression and sex classification.

Beyond EEG: MEG and fMRI

Even in this initial EEG-focused release, NeuralBench already supports MEG and fMRI tasks as proof of concept. Notably, the REVE model-pre-trained exclusively on EEG data-achieves the best performance among all tested models on the MEG writing decoding task. This is a striking early indication that pre-trained EEG representations may usefully transfer across brain recording methods, a hypothesis that the framework is set out to rigorously test in future versions.

The infrastructure is explicitly designed to scale to intracranial electroencephalography (EEG) (iEEG), functional near-infrared spectroscopy (fNIRS), and electromyography (EMG).

How to start

Installation requires one command: pip install neuralbench. Hence, running the audiovisual stimulus classification task on the EEG looks like this:

neuralbench eeg audiovisual_stimulus --download    Download data
neuralbench eeg audiovisual_stimulus --prepare     Prepare cache
neuralbench eeg audiovisual_stimulus               Run the task

To run all 36 tasks against all 14 EEG models, you must use -m all_classic all_fm Science addresses coordination. The full benchmark storage requirements are large: ~11TB total (~3.2TB of raw data, ~7.8TB of preprocessed cache, ~333GB of recorded results), with a single GPU with at least 32GB of VRAM per task – although the average peak GPU usage measured across experiments is only ~1.3GB (maximum ~30.3GB).

A full run of NeuralBench-EEG-Full v1.0 requires approximately 1,751 GPU hours across 4,947 trials.

Key takeaways

  • Meta AI’s NeuralBench-EEG version 1.0 is an open EEG standard – 36 tasks, 94 datasets, 9,478 topics, and 14 deep learning architectures within one unified interface.
  • Despite having up to 270× additional parameters, enterprise EEG models like REVE only marginally outperform lightweight task-specific models like CTNet (150K parameters) across the benchmark.
  • Cognitive decoding tasks (speech, video, sentence, and word decoding from brain activity) and clinical predictions remain very challenging, with most models scoring at a level close to the placebo level.
  • REVE, which was pre-trained on EEG data only, outperformed all models in decoding MEG writing – an early signal of meaningful cross-modal transmission.
  • NeuralBench is licensed by MIT.

verify paper and GitHub repo. Also, feel free to follow us on twitter Don’t forget to join us 150k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Do you need to partner with us to promote your GitHub Repo page, face hug page, product release, webinar, etc.? Contact us


Leave a Reply