fixtures: vocabularies: add subjects (EuroSciVoc, GEMET, MeSH, NVS) #1191

ptamarit · 2025-06-05T15:47:50Z

This will create subject vocabulary entries when running Zenodo on a local development environment.
There are:

EuroSciVoc: 1031 entries
GEMET: 5573 entries
MeSH: 659662 entries (filtered to a subset of the first 5000 entries)
NVS: 448 entries

This also creates 4 VocabularyScheme entries (while we have only 1 for EuroSciVoc in prod), which makes the UI for keywords automatically show a filter per scheme:

⚠️ However, this does NOT work, since we removed this feature from the backend in inveniosoftware/invenio-vocabularies#448 ⚠️

The header of each file mentions how the file was generated.

EuroSciVoc:

# Generated with:
#
# from invenio_vocabularies.contrib.subjects.config import euroscivoc_file_url
# print(euroscivoc_file_url)
#
# invenio vocabularies convert \
#   --vocabulary subjects:euroscivoc \
#   --origin "<url-printed-above>" \
#   --target app_data/vocabularies/subjects_euroscivoc.yaml
#
# sed -i '/^    \(de\|es\|fr\|it\|pl\)\: /d; ' app_data/vocabularies/subjects_euroscivoc.yaml
#

GEMET:

# Generated with:
#
# from invenio_vocabularies.contrib.subjects.config import gemet_file_url
# print(gemet_file_url)
#
# invenio vocabularies convert \
#   --vocabulary subjects:gemet \
#   --origin "<url-printed-above>" \
#   --target app_data/vocabularies/subjects_gemet.yaml
#
# sed -i '/^    \(ar\|az\|bg\|ca\|cs\|da\|de\|el\|es\|et\|eu\|fi\|fr\|ga\|hr\|hu\|hy\|is\|it\|ka\|lt\|lv\|mt\|nl\|\'no\'\|pl\|pt\|ro\|ru\|sk\|sl\|sv\|tr\|uk\)\: /d; ' app_data/vocabularies/subjects_gemet.yaml
#

MeSH:

# Generated with:
#
# wget https://github.com/galterlibrary/invenio-subjects-mesh/raw/refs/heads/master/invenio_subjects_mesh/vocabularies/subjects_mesh.csv
#
# head -n5001 subjects_mesh.csv > subjects_mesh_subset.csv
#
# invenio vocabularies convert \
#   --vocabulary subjects \
#   --filepath app_data/vocabularies-future.yaml \
#   --origin subjects_mesh_subset.csv
#

NVS:

# Generated with:
#
# from invenio_vocabularies.contrib.subjects.config import nvs_file_url
# print(nvs_file_url)
#
# invenio vocabularies convert \
#   --vocabulary subjects:nvs \
#   --origin "<url-printed-above>" \
#   --target app_data/vocabularies/subjects_nvs.yaml
#

ptamarit · 2025-06-05T16:47:43Z

app_data/vocabularies-future.yaml

+
+# This is used to generate subjects_mesh.yaml
+subjects:
+  readers:
+    - type: csv
+  transformers:
+    - type: mesh-subjects
+  writers:
+    - type: yaml
+      args:
+        filepath: app_data/vocabularies/subjects_mesh.yaml


Not super sure if vocabularies-future.yaml is the right place to put this config (used only to generate the fixture).

ptamarit · 2025-06-06T07:08:02Z

app_data/vocabularies/subjects_euroscivoc.yaml

+# Generated with:
+#
+# from invenio_vocabularies.contrib.subjects.config import euroscivoc_file_url
+# print(euroscivoc_file_url)
+#
+# invenio vocabularies convert \
+#   --vocabulary subjects:euroscivoc \
+#   --origin "<url-printed-above>" \
+#   --target app_data/vocabularies/subjects_euroscivoc.yaml
+#


Would it be better to not have comments at the top of each file and instead have a README.md in the directory explaining how each file is generated?

I like the comments, it's good change compared to looking at a YAML file and not having any idea how it was generated (see e.g. licenses.csv in rdm-records)

carlinmack · 2025-06-06T08:35:39Z

any idea why the GEMET file is 10-50x larger than the other vocabularies? maybe we could check in something smaller

slint

LGTM, just a suggestion for reducing the file sizes, manually for now + via config in invenio-vocabularies for future automation.

slint · 2025-06-06T08:54:50Z

app_data/vocabularies/subjects_gemet.yaml

Agree with Carlin that GEMET is pretty large compared to the rest (~8MB). I think it's because there are a lot of translations (which we don't actually use in Zenodo)... This might be something that we could actually control in invenio-vocabularies and allow selecting specific languages for titles and other I18N fields (via a global + vocabulary-specific config). I'll open an issue in invenio-vocabularies for that, but does it make sense for now to "manually" filter languages and keep only en in this file? Same for EuroSciVoc.

Added sed commands to delete languages other than 'en'.
The GEMET file went from 8.1MB down to 2.5MB.

We could also limit the number of entries, for instance a maximum of 100 entries per vocabulary.
It would need to be done manually before committing the file, but I think it's fine.

fixtures: vocabularies: add subjects (EuroSciVoc, GEMET, MeSH, NVS)

332fec4

ptamarit commented Jun 5, 2025

View reviewed changes

ptamarit requested a review from slint June 5, 2025 17:02

ptamarit commented Jun 6, 2025

View reviewed changes

slint approved these changes Jun 6, 2025

View reviewed changes

ptamarit closed this Jun 6, 2025

ptamarit reopened this Jun 6, 2025

fixtures: vocabularies: subjects with English titles only

e84ec9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fixtures: vocabularies: add subjects (EuroSciVoc, GEMET, MeSH, NVS) #1191

fixtures: vocabularies: add subjects (EuroSciVoc, GEMET, MeSH, NVS) #1191

Uh oh!

ptamarit commented Jun 5, 2025 •

edited

Loading

Uh oh!

ptamarit Jun 5, 2025 •

edited

Loading

Uh oh!

ptamarit Jun 6, 2025

Uh oh!

slint Jun 6, 2025

Uh oh!

carlinmack commented Jun 6, 2025

Uh oh!

slint left a comment

Uh oh!

slint Jun 6, 2025

Uh oh!

ptamarit Jun 6, 2025 •

edited

Loading

Uh oh!

ptamarit Jun 6, 2025

Uh oh!

Uh oh!

fixtures: vocabularies: add subjects (EuroSciVoc, GEMET, MeSH, NVS) #1191

Are you sure you want to change the base?

fixtures: vocabularies: add subjects (EuroSciVoc, GEMET, MeSH, NVS) #1191

Uh oh!

Conversation

ptamarit commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ptamarit Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptamarit Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

slint Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

carlinmack commented Jun 6, 2025

Uh oh!

slint left a comment

Choose a reason for hiding this comment

Uh oh!

slint Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

ptamarit Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptamarit Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ptamarit commented Jun 5, 2025 •

edited

Loading

ptamarit Jun 5, 2025 •

edited

Loading

ptamarit Jun 6, 2025 •

edited

Loading