added _variants to get_process_pron so that the function can output a… #572

arianacodes · 2025-06-30T19:44:51Z

… list or a single string

Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

… list or a single string

kylebgorman · 2025-07-03T22:26:09Z

Hi Ariana, thanks for this. You're pretty close here. A bunch of small notes below.

Testing

I see the flake8 and black tests are currently failing. To fix:

Run black -l79 over config.py. This works in place and will reflow the files.
Run flake8 --extend-ignore E203 over config.py. Address any issues it raises; this doesn't fix the issues itself (nor could it). You may have to run it a few times until it returns no issues (if you run it and it prints nothing, you're good to go).
Repeat step 1 because your fixes may have broken things that black can fix.
Check in (with git's add, commit and push "verbs") any changed files.

That should at least allow those two tests to clear. You can see on this PR if they're passing or not. If they work for you locally and you check in your changes with those three "verbs", then they ought to work here too.

There of course are other tests that will run once these all pass, including unit tests.

These steps will also fix some minor style issues; we can address the rest once you have this taken care of. Note that you will probably need to repeat this step later on because subsequent changes you'll make...

Typing and design

I didn't mention the mypy test, which checks the type signatures (and more type consistency more generally). There is a genuine issue here that this test is uncovering. As I think you noticed, the various processors are all functions from str to str, but variants is a function from str to List[str], and this is not properly handled yet. That is, if you enable generation of variants, you'll get a list of strings that none of the code is set up to handle.

Here is what I think the simplest possibilty is: you could just lightly redesign the processor interface, on two dimensions:

Right now we have a function (self.process_pron) which calls all the processors, but it is just a relatively simple loop that gets stashed into a "wrapper". I probably wouldn't do it this way knowing what we know now. Instead what I'd do is keep all the booleans indicating what kind of processing we're doing as attributes of the configuration object, and then just do it usnig an ordinary if/else. I'd also allow it to return a list, and if we're not enabling variants, I'd have it just return a list with one element. (Having two different types of returned values, like a list vs. a string, is the other possibility, but it strikes me as overly complex.) This might look a bit like the following:

@staticmethod
def _variants(pron: str) -> List[str]:
    # FIXME: put your `variants` function here.
    ... 

def process_pron(self, pron) -> List[str]:
   if not self.stress:
      pron = re.sub(r"[ˈˌ]", "", pron)
   # FIXME: put the other cases here.
   ...
   if self.variants:
      return self._variants(pron)
   else:
     # Just removes parentheses. I moved this here from below...
     pron = re.sub(_PARENS_REGEX, "", pron)
     return [pron]

Note the trick here: if we're not generating variants, we just remove parentheses and make a singleton list!

Then you have to lightly modify this function to deal with the fact that process_pron now returns a list. What I'd do is change those lines to something like:

        pron_string = m.group(1)
        try:
            # All pronunciation processing is done in NFD-space.
            pron_string = unicodedata.normalize("NFD", pron_string)
            for pron in config.process_pron(pron_string):
                if pron:
                    # The segments package inserts a # in-between spaces.
                    if not config.skip_spaces_pron:
                        pron = pron.replace(" #", "")
                    yield pron
         except IndexError:
           ...

Then get rid of _handle_parens and all references to it, including the test__handle_parens in [test_extract.py] (tests/test_wikipron/test_extract.py) unit (and the associated import).

I see you modified some of the type signatures in Config; I think you will want to reverse some of those changes if you take this advice.

Testing

You can then make sure the mypy test is running (cf. what you did earlier): the command is mypy --install-types --ignore-missing-imports --non-interactive and you can run it over any files you changed (or the whole source tree for this project if you'd rather; it takes directories as arguments). It doesn't fix anything itself, so you have to do it manually.

You should need to add a unit test under test_config.py. You can just adapt your assertion tests there. You will then want to make sure the other tests run and that nothing has broken.

pytest -vvv is the command to run a test. It too can be run over a single file or a directory (all the tests are in the subdirectory tests). It's faster to run only the tests you're working on while you're actively developing, but the PR testing system will run them all.

Last thing

Make sure to check changed files every time. Doing so will cause the tests to run again, and once everything is green (i.e., once all tests pass) I can take another look. If you get blocked send me a message.

kylebgorman · 2025-07-03T22:28:07Z

Oh one other thing: you see in the description of the PR (at the top of the page) where there's a checkbox?

Add a brief description of the change being made (just follow the style used there) to the CHANGELOG.md file, check in that file, and then "click on" the check box to indicate you've done so. You can do that before you finish the change since it should just be a high-level description of what you did.

added _variants to get_process_pron so that the function can output a…

f7ddd95

… list or a single string

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added _variants to get_process_pron so that the function can output a… #572

added _variants to get_process_pron so that the function can output a… #572

Uh oh!

arianacodes commented Jun 30, 2025

Uh oh!

kylebgorman commented Jul 3, 2025

Uh oh!

kylebgorman commented Jul 3, 2025

Uh oh!

Uh oh!

added _variants to get_process_pron so that the function can output a… #572

Are you sure you want to change the base?

added _variants to get_process_pron so that the function can output a… #572

Uh oh!

Conversation

arianacodes commented Jun 30, 2025

Uh oh!

kylebgorman commented Jul 3, 2025

Testing

Typing and design

Testing

Last thing

Uh oh!

kylebgorman commented Jul 3, 2025

Uh oh!

Uh oh!