Skip to the content.

VEUCTOR

Training, Selecting, and Aligning Word Embeddings from European Online Job Advertisements


🚀 What is VEUCTOR?

VEUCTOR is a reproducible methodological framework for training, selecting, and aligning word embedding models built from European Online Job Advertisements (OJAs).

Unlike standard embedding repositories, VEUCTOR does not treat word embeddings as neutral preprocessing tools.
It demonstrates that embedding choice is a methodological decision with measurable empirical consequences for:

Credits

VEUCTOR is partially supported within the research activity of a grant entitled “PILLARS - Pathways to Inclusive Labour Markets” under the call H-2020 TRANSFORMATIONS 18-2020 “Technological transformations, skills, and globalization - future challenges for shared prosperity”, grant agreement number 101004703 - PILLARS. See https://www.h2020-pillars.eu/


🖼 Methodological Workflow

The entire framework is summarized in the following diagram:

VEUCTOR Methodological Workflow

Figure 1 – End‑to‑end VEUCTOR pipeline: from OJA data collection to embedding generation, selection via HSS, multilingual alignment with SeNSe, and downstream validation.

The workflow is structured into four main phases:

  1. Data Collection & Pre‑processing
  2. Embedding Training & Selection
  3. Multilingual Alignment
  4. Intrinsic & Extrinsic Evaluation

📊 Data Sources

Online Job Advertisements (OJAs)

The corpus comes from the Web Intelligence Hub (WIH) initiative developed by Eurostat and Cedefop under the Trusted Smart Statistics framework.

The WIH dataset is described in detail in the VEUCTOR paper (Section 4, Experimental Results).

ESCO Taxonomy

Evaluation is grounded in:

ESCO – European Skills, Competences, Qualifications and Occupations
https://esco.ec.europa.eu/

ESCO provides:

ESCO is used as a semantic benchmark, not as a normative ground truth.


🧠 Methodology

1️⃣ Embedding Pool Generation

For each country, FastText models (Bojanowski et al., 2017) are trained on preprocessed OJA corpora.

Grid search over:

108 models per country

Preprocessing includes:


2️⃣ Intrinsic Evaluation – Hierarchical Semantic Similarity (HSS)

Embedding quality is assessed using Hierarchical Semantic Similarity (HSS)
(Giabelli et al., 2020).

HSS is based on:

For each occupation pair:

  1. Compute cosine similarity between vectors\
  2. Compute HSS score\
  3. Compute Spearman rank correlation (ρ)

The best model maximizes:

ρ(cosine similarity, HSS)

This ensures that the embedding geometry respects the ESCO hierarchy.


3️⃣ Multilingual Alignment – SeNSe

Country-specific embeddings are independently trained and therefore not directly comparable.

We apply SeNSe Alignment (Malandri et al., 2024):

Alignment quality is evaluated via:

Cross‑Lingual Semantic Fitting Score (CLS)

This step ensures cross-country semantic comparability.


4️⃣ Extrinsic Evaluation

We validate embedding impact on:

Metrics:

Best-performing embeddings consistently outperform worst-performing configurations, confirming that embedding selection materially affects empirical outcomes.


📂 Repository Structure

veuctor/
│
├── data/
│   ├── esco/
│   └── supplementary/
│
├── models/
│   ├── fasttext/
│   └── aligned/
│
├── HSS_eval.py
├── alignment_eval.py
├── demo.py
├── requirements.txt
└── README.md

⚙ Installation

git clone https://gitlab.com/crisp1/veuctor.git
cd veuctor

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Python ≥ 3.12


📚 References


🔎 Key Message

Embedding selection is not neutral.

VEUCTOR provides a fully reproducible, taxonomy‑driven framework for building robust multilingual labor market intelligence systems grounded in European statistical infrastructure.

Citations

If you use VEUCTOR in your research, please cite:

Emilio Colombo and Simone D’Amico and Fabio Mercorio and Mario Mezzanzanica
Training, Selecting, and Aligning Word Embeddings from European Online Job Advertisements.
Information Sciences 2026, Volume 741 🔗 https://www.sciencedirect.com/science/article/pii/S0020025526002057

BibTeX

```bibtex @article{COLOMBO2026123274, title = {VEUCTOR: Training and selecting best vector space models from online job ads for European countries}, journal = {Information Sciences}, volume = {741}, pages = {123274}, year = {2026}, issn = {0020-0255}, doi = {https://doi.org/10.1016/j.ins.2026.123274}, url = {https://www.sciencedirect.com/science/article/pii/S0020025526002057}, author = {Emilio Colombo and Simone D’Amico and Fabio Mercorio and Mario Mezzanzanica}, keywords = {Word embedding, Machine learning, Labor market, NLP} }