Skip to content

Commit 4efeced

Browse files
committed
style: fixed typos
1 parent 375ee11 commit 4efeced

File tree

4 files changed

+9
-4
lines changed

4 files changed

+9
-4
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
## Changelog 🔄
22
All notable changes to `semchunk` will be documented here. This project adheres to [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
33

4+
## [3.2.1] - 2025-03-27
5+
### Fixed
6+
- Fixed minor typos in the README and docstrings.
7+
48
## [3.2.0] - 2025-03-20
59
### Changed
610
- Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the `semchunk` algorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general ([#17](https://github.com/isaacus-dev/semchunk/issues/17)).
@@ -145,6 +149,7 @@ All notable changes to `semchunk` will be documented here. This project adheres
145149
### Added
146150
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.
147151

152+
[3.2.1]: https://github.com/isaacus-dev/semchunk/compare/v3.2.0...v3.2.1
148153
[3.2.0]: https://github.com/isaacus-dev/semchunk/compare/v3.1.3...v3.2.0
149154
[3.1.3]: https://github.com/isaacus-dev/semchunk/compare/v3.1.2...v3.1.3
150155
[3.1.2]: https://github.com/isaacus-dev/semchunk/compare/v3.1.1...v3.1.2

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ assert chunker([text], progress = True) == [['The quick brown fox', 'jumps over
6161
# setting `processes` to a number greater than 1.
6262
assert chunker([text], processes = 2) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]
6363

64-
# You can also pass a `offsets` argument to return the offsets of chunks, as well as an `overlap`
64+
# You can also pass an `offsets` argument to return the offsets of chunks, as well as an `overlap`
6565
# argument to overlap chunks by a ratio (if < 1) or an absolute number of tokens (if >= 1).
6666
chunks, offsets = chunker(text, offsets = True, overlap = 0.5)
6767
```
@@ -85,7 +85,7 @@ def chunkerify(
8585

8686
`chunk_size` is the maximum number of tokens a chunk may contain. It defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible, otherwise a `ValueError` will be raised.
8787

88-
`max_token_chars` is the maximum numbers of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
88+
`max_token_chars` is the maximum number of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to `None` in which case it will either not be used or will, if possible, be set to the number of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
8989

9090
`memoize` flags whether to memoize the token counter. It defaults to `True`.
9191

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "semchunk"
7-
version = "3.2.0"
7+
version = "3.2.1"
88
authors = [
99
{name="Isaacus", email="support@isaacus.com"},
1010
{name="Umar Butler", email="umar@umar.au"},

src/semchunk/semchunk.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -400,7 +400,7 @@ def chunkerify(
400400
Args:
401401
tokenizer_or_token_counter (str | tiktoken.Encoding | transformers.PreTrainedTokenizer | tokenizers.Tokenizer | Callable[[str], int]): Either: the name of a `tiktoken` or `transformers` tokenizer (with priority given to the former); a tokenizer that possesses an `encode` attribute (e.g., a `tiktoken`, `transformers` or `tokenizers` tokenizer); or a token counter that returns the number of tokens in a input.
402402
chunk_size (int, optional): The maximum number of tokens a chunk may contain. Defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible otherwise a `ValueError` will be raised.
403-
max_token_chars (int, optional): The maximum numbers of characters a token may contain. Used to significantly speed up the token counting of long inputs. Defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
403+
max_token_chars (int, optional): The maximum number of characters a token may contain. Used to significantly speed up the token counting of long inputs. Defaults to `None` in which case it will either not be used or will, if possible, be set to the number of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.
404404
memoize (bool, optional): Whether to memoize the token counter. Defaults to `True`.
405405
cache_maxsize (int, optional): The maximum number of text-token count pairs that can be stored in the token counter's cache. Defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.
406406

0 commit comments

Comments
 (0)