Port new Tokeniser from Linguist

Part of the #155

Right now enry uses content tokenization approach [based on regexps](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) from linguist before [v5.3.2](https://github.com/github/linguist/releases/tag/v5.3.2).

This issues is about enry supporting/producing same results as a new, flex-based scanner introduced in [github/linguist#3846](https://github.com/github/linguist/pull/3846).

This is important as it affects Bayesian classifier accuracy and classifier tests in both projects make a strong assumption that all samples can be distinguished by a content classifier alone.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port new Tokeniser from Linguist #193

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Port new Tokeniser from Linguist #193

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions