style: fixed typos

umarbutler · umarbutler · commit 4efecedde249 · 2025-03-27T22:13:23.000+11:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,10 @@
 ## Changelog 🔄
 All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [3.2.1] - 2025-03-27
+### Fixed
+- Fixed minor typos in the README and docstrings.
+
 ## [3.2.0] - 2025-03-20
 ### Changed
 - Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the `semchunk` algorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general ([#17](https://github.com/isaacus-dev/semchunk/issues/17)).
@@ -145,6 +149,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
 ### Added
 - Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
 
+[3.2.1]: https://github.com/isaacus-dev/semchunk/compare/v3.2.0...v3.2.1
 [3.2.0]: https://github.com/isaacus-dev/semchunk/compare/v3.1.3...v3.2.0
 [3.1.3]: https://github.com/isaacus-dev/semchunk/compare/v3.1.2...v3.1.3
 [3.1.2]: https://github.com/isaacus-dev/semchunk/compare/v3.1.1...v3.1.2
diff --git a/README.md b/README.md
@@ -61,7 +61,7 @@ assert chunker([text], progress = True) == [['The quick brown fox', 'jumps over
 # setting `processes` to a number greater than 1.
 assert chunker([text], processes = 2) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]
 
-# You can also pass a `offsets` argument to return the offsets of chunks, as well as an `overlap`
+# You can also pass an `offsets` argument to return the offsets of chunks, as well as an `overlap`
 # argument to overlap chunks by a ratio (if < 1) or an absolute number of tokens (if >= 1).
 chunks, offsets = chunker(text, offsets = True, overlap = 0.5)
 ```
@@ -85,7 +85,7 @@ def chunkerify(
 
 `chunk_size` is the maximum number of tokens a chunk may contain. It defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible, otherwise a `ValueError` will be raised.
 
-`max_token_chars` is the maximum numbers of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
+`max_token_chars` is the maximum number of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to `None` in which case it will either not be used or will, if possible, be set to the number of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
 
 `memoize` flags whether to memoize the token counter. It defaults to `True`.
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "semchunk"
-version = "3.2.0"
+version = "3.2.1"
 authors = [
     {name="Isaacus", email="support@isaacus.com"},
     {name="Umar Butler", email="umar@umar.au"},
diff --git a/src/semchunk/semchunk.py b/src/semchunk/semchunk.py
@@ -400,7 +400,7 @@ def chunkerify(
     Args:
         tokenizer_or_token_counter (str | tiktoken.Encoding | transformers.PreTrainedTokenizer | tokenizers.Tokenizer | Callable[[str], int]): Either: the name of a `tiktoken` or `transformers` tokenizer (with priority given to the former); a tokenizer that possesses an `encode` attribute (e.g., a `tiktoken`, `transformers` or `tokenizers` tokenizer); or a token counter that returns the number of tokens in a input.
         chunk_size (int, optional): The maximum number of tokens a chunk may contain. Defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible otherwise a `ValueError` will be raised.
-        max_token_chars (int, optional): The maximum numbers of characters a token may contain. Used to significantly speed up the token counting of long inputs. Defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
+        max_token_chars (int, optional): The maximum number of characters a token may contain. Used to significantly speed up the token counting of long inputs. Defaults to `None` in which case it will either not be used or will, if possible, be set to the number of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
         memoize (bool, optional): Whether to memoize the token counter. Defaults to `True`.
         cache_maxsize (int, optional): The maximum number of text-token count pairs that can be stored in the token counter's cache. Defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.