-
-
Notifications
You must be signed in to change notification settings - Fork 57
Closed
Labels
detectionRelated to the charset detection mechanism, chaos/mess/coherenceRelated to the charset detection mechanism, chaos/mess/coherence
Milestone
Description
charset_normalizer
returns None
$ chardetect star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt
star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt: Windows-1252 with confidence 0.73
$ file -i star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt
star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt: application/x-subrip; charset=iso-8859-1
$ python -c "import charset_normalizer; print(charset_normalizer.from_path('star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt').best())"
None
who is right? chardetect is right! the expected encoding is Windows-1252
iso-8859-1
produces an ugly <U+0085> when piped to less
(utf16 hex bytes)
or c285
as utf8 hex bytes
U+0085: The "Next Line" (NEL) control character was used in the 1970s for controlling printers and displays (e.g. VT100). Moves to the first position of the next line.
--- star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt.iso-8859-1
+++ star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt.Windows-1252
@@ -2242,7 +2242,7 @@
505
00:43:04,098 --> 00:43:05,428
-Adil davranmaktan bahsetmiþken<U+0085>
+Adil davranmaktan bahsetmiþken…
506
00:43:06,771 --> 00:43:09,777
input file
star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt
Metadata
Metadata
Assignees
Labels
detectionRelated to the charset detection mechanism, chaos/mess/coherenceRelated to the charset detection mechanism, chaos/mess/coherence