-
Notifications
You must be signed in to change notification settings - Fork 46
fixtures: vocabularies: add subjects (EuroSciVoc, GEMET, MeSH, NVS) #1191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
||
# This is used to generate subjects_mesh.yaml | ||
subjects: | ||
readers: | ||
- type: csv | ||
transformers: | ||
- type: mesh-subjects | ||
writers: | ||
- type: yaml | ||
args: | ||
filepath: app_data/vocabularies/subjects_mesh.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super sure if vocabularies-future.yaml
is the right place to put this config (used only to generate the fixture).
# Generated with: | ||
# | ||
# from invenio_vocabularies.contrib.subjects.config import euroscivoc_file_url | ||
# print(euroscivoc_file_url) | ||
# | ||
# invenio vocabularies convert \ | ||
# --vocabulary subjects:euroscivoc \ | ||
# --origin "<url-printed-above>" \ | ||
# --target app_data/vocabularies/subjects_euroscivoc.yaml | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to not have comments at the top of each file and instead have a README.md in the directory explaining how each file is generated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the comments, it's good change compared to looking at a YAML file and not having any idea how it was generated (see e.g. licenses.csv
in rdm-records)
any idea why the GEMET file is 10-50x larger than the other vocabularies? maybe we could check in something smaller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a suggestion for reducing the file sizes, manually for now + via config in invenio-vocabularies for future automation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Carlin that GEMET is pretty large compared to the rest (~8MB). I think it's because there are a lot of translations (which we don't actually use in Zenodo)... This might be something that we could actually control in invenio-vocabularies
and allow selecting specific languages for titles and other I18N fields (via a global + vocabulary-specific config). I'll open an issue in invenio-vocabularies
for that, but does it make sense for now to "manually" filter languages and keep only en
in this file? Same for EuroSciVoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added sed
commands to delete languages other than 'en'.
The GEMET file went from 8.1MB down to 2.5MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also limit the number of entries, for instance a maximum of 100 entries per vocabulary.
It would need to be done manually before committing the file, but I think it's fine.
This will create subject vocabulary entries when running Zenodo on a local development environment.
There are:
This also creates 4

⚠️ However, this does NOT work, since we removed this feature from the backend in inveniosoftware/invenio-vocabularies#448 ⚠️
VocabularyScheme
entries (while we have only 1 for EuroSciVoc in prod), which makes the UI for keywords automatically show a filter per scheme:The header of each file mentions how the file was generated.