Skip to content

[BUG] Windows-1252 encoding is not detected in turkish text #407

@milahu

Description

@milahu

charset_normalizer returns None

$ chardetect star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt
star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt: Windows-1252 with confidence 0.73

$ file -i star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt
star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt: application/x-subrip; charset=iso-8859-1

$ python -c "import charset_normalizer; print(charset_normalizer.from_path('star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt').best())"
None

who is right? chardetect is right! the expected encoding is Windows-1252

iso-8859-1 produces an ugly <U+0085> when piped to less (utf16 hex bytes)
or c285 as utf8 hex bytes

unicode-explorer.com/c/0085

U+0085: The "Next Line" (NEL) control character was used in the 1970s for controlling printers and displays (e.g. VT100). Moves to the first position of the next line.

--- star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt.iso-8859-1
+++ star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt.Windows-1252
@@ -2242,7 +2242,7 @@
 
 505
 00:43:04,098 --> 00:43:05,428
-Adil davranmaktan bahsetmiþken<U+0085>
+Adil davranmaktan bahsetmiþken…
 
 506
 00:43:06,771 --> 00:43:09,777

input file

star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt

Metadata

Metadata

Assignees

No one assigned

    Labels

    detectionRelated to the charset detection mechanism, chaos/mess/coherence

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions