TypeScript Content Based Recommender

This is a TypeScript-based content-based recommender with enhanced multilingual support, forked from stanleyfok/content-based-recommender.

Credits

This package is forked from stanleyfok/content-based-recommender by Stanley Fok.

Original Author

Stanley Fok - Original implementation and concept

Enhancements in this fork

Full TypeScript support with comprehensive type definitions
Japanese language support using kuromoji morphological analyzer
Enhanced multilingual text processing capabilities
Improved testing coverage with better error handling
Updated dependencies and modern build system with ESLint v9
Performance optimizations in similarity calculations
Modular architecture with separated tokenizers and filters
Factory pattern implementation for easy component creation

What's New

Latest Version

Modular Architecture: Separated tokenizers and filters into independent classes
Factory Pattern: Introduced ProcessingPipelineFactory for easy component creation
Enhanced Testing: Moved all tests to test/ directory with improved coverage
Improved Japanese Support: Advanced morphological analysis with part-of-speech filtering
Better TypeScript Support: Comprehensive type definitions for all components

1.5.0

Added trainBidirectional(collectionA, collectionB) to allow recommendations between two different datasets

1.4.0

Upgrade dependencies to fix security alerts

1.3.0

Introduce the use of unigram, bigrams and trigrams when constructing the word vector

1.2.0

Simplify the implementation by not using sorted set data structure to store the similar documents data. Also support the maxSimilarDocuments and minScore options to save memory used by the recommender.

1.1.0

Update to newer version of vector-object

Installation

npm install ts-content-based-recommender

And then import the ContentBasedRecommender class

const ContentBasedRecommender = require('ts-content-based-recommender')

For TypeScript projects:

import ContentBasedRecommender from 'ts-content-based-recommender'
// or import individual components
import {
  ProcessingPipelineFactory,
  EnglishTokenizer,
  JapaneseTokenizer,
  EnglishTokenFilter,
  JapaneseTokenFilter
} from 'ts-content-based-recommender'

Overview

This is a content-based recommender implemented in TypeScript to illustrate the concept of content-based recommendation. Content-based recommender is a popular recommendation technique to show similar items to users, especially useful to websites for e-commerce, news content, etc.

After the recommender is trained by an array of documents, it can tell the list of documents which are more similar to the input document.

The training process involves 3 main steps:

content pre-processing, such as html tag stripping, stopwords removal and stemming
document vectors formation using tf-idf
find the cosine similarity scores between all document vectors

Special thanks to the library natural helps a lot by providing a lot of NLP functionalities, such as tf-idf and word stemming.

⚠️ Note:

I haven't tested how this recommender is performing with a large dataset. I will share more results after some more testing.

Language Support

English

Tokenization using natural.WordTokenizer
Porter Stemmer for word stemming
Stopword removal
N-gram support (unigram, bigram, trigram)

Japanese

Morphological analysis using kuromoji
Part-of-speech filtering (nouns, verbs, adjectives)
Japanese-specific text processing

Usage

Single collection

import ContentBasedRecommender from 'ts-content-based-recommender'

const recommender = new ContentBasedRecommender({
  minScore: 0.1,
  maxSimilarDocuments: 100
});

// prepare documents data
const documents = [
  { id: '1000001', content: 'Why studying javascript is fun?' },
  { id: '1000002', content: 'The trend for javascript in machine learning' },
  { id: '1000003', content: 'The most insightful stories about JavaScript' },
  { id: '1000004', content: 'Introduction to Machine Learning' },
  { id: '1000005', content: 'Machine learning and its application' },
  { id: '1000006', content: 'Python vs Javascript, which is better?' },
  { id: '1000007', content: 'How Python saved my life?' },
  { id: '1000008', content: 'The future of Bitcoin technology' },
  { id: '1000009', content: 'Is it possible to use javascript for machine learning?' }
];

// start training (now async)
await recommender.train(documents);

//get top 10 similar items to document 1000002
const similarDocuments = recommender.getSimilarDocuments('1000002', 0, 10);

console.log(similarDocuments);
/*
  the higher the score, the more similar the item is
  documents with score < 0.1 are filtered because options minScore is set to 0.1
  [
    { id: '1000004', score: 0.5114304586412038 },
    { id: '1000009', score: 0.45056313558918837 },
    { id: '1000005', score: 0.37039308109283564 },
    { id: '1000003', score: 0.10896767690747626 }
  ]
*/

Multi collection

This example shows how to automatically match posts with related tags

import ContentBasedRecommender from 'ts-content-based-recommender'

const posts = [
  {
    id: '1000001',
    content: 'Why studying javascript is fun?',
  },
  {
    id: '1000002',
    content: 'The trend for javascript in machine learning',
  },
  {
    id: '1000003',
    content: 'The most insightful stories about JavaScript',
  },
  {
    id: '1000004',
    content: 'Introduction to Machine Learning',
  },
  {
    id: '1000005',
    content: 'Machine learning and its application',
  },
  {
    id: '1000006',
    content: 'Python vs Javascript, which is better?',
  },
  {
    id: '1000007',
    content: 'How Python saved my life?',
  },
  {
    id: '1000008',
    content: 'The future of Bitcoin technology',
  },
  {
    id: '1000009',
    content: 'Is it possible to use javascript for machine learning?',
  },
];

const tags = [
               {
                 id: '1',
                 content: 'Javascript',
               },
               {
                 id: '2',
                 content: 'machine learning',
               },
               {
                 id: '3',
                 content: 'application',
               },
               {
                 id: '4',
                 content: 'introduction',
               },
               {
                 id: '5',
                 content: 'future',
               },
               {
                 id: '6',
                 content: 'Python',
               },
               {
                 id: '7',
                 content: 'Bitcoin',
               },
             ];

const tagMap = tags.reduce((acc, tag) => {
  acc[tag.id] = tag;
  return acc;
}, {});

const recommender = new ContentBasedRecommender();

// Training is now async
await recommender.trainBidirectional(posts, tags);

for (const post of posts) {
  const relatedTags = recommender.getSimilarDocuments(post.id);
  const tagNames = relatedTags.map(t => tagMap[t.id].content);
  console.log(post.content, 'related tags:', tagNames);
}


/*
Why studying javascript is fun? related tags: [ 'Javascript' ]
The trend for javascript in machine learning related tags: [ 'machine learning', 'Javascript' ]
The most insightful stories about JavaScript related tags: [ 'Javascript' ]
Introduction to Machine Learning related tags: [ 'machine learning', 'introduction' ]
Machine learning and its application related tags: [ 'machine learning', 'application' ]
Python vs Javascript, which is better? related tags: [ 'Python', 'Javascript' ]
How Python saved my life? related tags: [ 'Python' ]
The future of Bitcoin technology related tags: [ 'future', 'Bitcoin' ]
Is it possible to use javascript for machine learning? related tags: [ 'machine learning', 'Javascript' ]
*/

Japanese Language Example

import ContentBasedRecommender from 'ts-content-based-recommender'

const recommender = new ContentBasedRecommender({
  language: 'ja', // 日本語サポートを有効化
  minScore: 0.1,
  maxSimilarDocuments: 100
});

// 日本語文書データの準備
const japaneseDocuments = [
  { id: '1', content: 'JavaScriptプログラミングは楽しいです。フロントエンドの開発に最適です。' },
  { id: '2', content: 'プログラミング言語の比較検討。PythonとJavaScriptの違いについて。' },
  { id: '3', content: '機械学習の基礎知識。データサイエンスへの応用。' },
  { id: '4', content: 'ウェブ開発のベストプラクティス。モダンなJavaScript技術。' },
  { id: '5', content: 'データ分析とビジュアライゼーション。統計学の活用。' }
];

// 学習開始（非同期処理）
await recommender.train(japaneseDocuments);

// 文書IDが'1'に類似した上位5件を取得
const similarDocuments = recommender.getSimilarDocuments('1', 0, 5);

console.log(similarDocuments);
/*
  日本語の形態素解析により、より精密な類似度計算が可能
  [
    { id: '4', score: 0.45123456789 },
    { id: '2', score: 0.32456789012 }
  ]
*/

Using Individual Components

The library now provides modular components that can be used independently:

import {
  ProcessingPipelineFactory,
  EnglishTokenizer,
  JapaneseTokenizer,
  EnglishTokenFilter,
  JapaneseTokenFilter
} from 'ts-content-based-recommender'

// Using factory pattern to create processing pipelines
const englishPipeline = ProcessingPipelineFactory.createPipeline('en', {
  minTokenLength: 2,
  removeStopwords: true,
  customStopWords: ['custom', 'words']
});

const japanesePipeline = ProcessingPipelineFactory.createPipeline('ja', {
  allowedPos: ['名詞', '動詞', '形容詞'],  // part-of-speech filtering
  minTokenLength: 1
});

// Using tokenizers directly
const englishTokenizer = ProcessingPipelineFactory.createTokenizer('en');
const japaneseTokenizer = ProcessingPipelineFactory.createTokenizer('ja');

const englishTokens = await englishTokenizer.tokenize('machine learning algorithm');
const japaneseTokens = await japaneseTokenizer.tokenize('機械学習アルゴリズム');

// Using filters directly
const englishFilter = new EnglishTokenFilter({
  removeDuplicates: true,
  removeStopwords: true,
  minTokenLength: 2
});

const japaneseFilter = new JapaneseTokenFilter({
  allowedPos: ['名詞', '動詞'],
  removeDuplicates: false
});

const filteredEnglishTokens = englishFilter.filter(englishTokens);
const filteredJapaneseTokens = japaneseFilter.filter(japaneseTokens);

Advanced Configuration Example

import ContentBasedRecommender from 'ts-content-based-recommender'

// Example with advanced token filtering options
const recommender = new ContentBasedRecommender({
  language: 'ja',
  minScore: 0.1,
  maxSimilarDocuments: 50,
  tokenFilterOptions: {
    removeDuplicates: false,           // Keep duplicate tokens for frequency analysis
    removeStopwords: true,             // Remove Japanese stopwords
    minTokenLength: 2,                 // Exclude tokens shorter than 2 characters
    allowedPos: ['名詞', '動詞'],       // Only extract nouns and verbs
    customStopWords: ['です', 'ます']   // Additional custom stopwords
  }
});

const documents = [
  { id: '1', content: 'JavaScriptプログラミングはとても楽しいです' },
  { id: '2', content: 'Pythonによる機械学習の勉強をします' },
  { id: '3', content: 'ウェブ開発の最新技術トレンド' }
];

await recommender.train(documents);
const similar = recommender.getSimilarDocuments('1');

API Reference

ContentBasedRecommender

The main class for content-based recommendations.

constructor([options])

To create the recommender instance

options (optional): an object to configure the recommender

Supported options:

language - the language to use for text processing. Supported values: 'en'（English）, 'ja'（Japanese）. Default is 'en'.
maxVectorSize - to control the max size of word vector after tf-idf processing. A smaller vector size will help training performance while not affecting recommendation quality. Defaults to be 100.
minScore - the minimum score required to meet to consider it is a similar document. It will save more memory by filtering out documents having low scores. Allowed values range from 0 to 1. Default is 0.
maxSimilarDocuments - the maximum number of similar documents to keep for each document. Default is the max safe integer in javascript.
debug - show progress messages so can monitor the training progress
tokenFilterOptions - advanced filtering options for token processing:
- removeDuplicates - remove duplicate tokens（default: true）
- removeStopwords - remove stopwords（default: true）
- customStopWords - additional custom stopwords（default: empty array）
- minTokenLength - minimum token length（default: 1）
- allowedPos - for Japanese: allowed part-of-speech tags（default: 名詞、動詞、形容詞）

train(documents)

To tell the recommender about your documents and then it will start training itself.

documents - an array of object, with fields id and content

Note: This method is now asynchronous and returns a Promise. Use await or .then() to handle the async operation.

trainBidirectional(collectionA, collectionB)

Works like the normal train function, but it creates recommendations between two different collections instead of within one collection.

Note: This method is now asynchronous and returns a Promise. Use await or .then() to handle the async operation.

getSimilarDocuments(id, [start], [size])

To get an array of similar items with document id

id - the id of the document
start - the start index, inclusive. Default to be 0
size - the max number of similar documents to obtain. If it is omitted, the whole list after start index will be returned

It returns an array of objects, with fields id and score (ranging from 0 to 1)

export()

To export the recommender as json object.

const recommender = new ContentBasedRecommender();
await recommender.train(documents);

const object = recommender.export();
//can save the object to disk, database or otherwise

import(object)

To update the recommender by importing from a json object, exported by the export() method

const recommender = new ContentBasedRecommender();
recommender.import(object); // object can be loaded from disk, database or otherwise

ProcessingPipelineFactory

Factory class for creating processing pipelines and individual components.

ProcessingPipelineFactory.createPipeline(language, options)

Creates a complete processing pipeline with tokenizer and filter.

language - 'en' for English or 'ja' for Japanese
options - filter options (optional)

ProcessingPipelineFactory.createTokenizer(language)

Creates a tokenizer for the specified language.

language - 'en' for English or 'ja' for Japanese

ProcessingPipelineFactory.createEnglishPipeline(options)

Creates an English-specific processing pipeline.

options - filter options (optional)

ProcessingPipelineFactory.createJapanesePipeline(options)

Creates a Japanese-specific processing pipeline.

options - filter options (optional)

Tokenizers

EnglishTokenizer

Tokenizes English text with stemming and N-gram support.

const tokenizer = new EnglishTokenizer();
const tokens = await tokenizer.tokenize('machine learning algorithm');

JapaneseTokenizer

Tokenizes Japanese text using kuromoji morphological analyzer.

const tokenizer = new JapaneseTokenizer();
const tokens = await tokenizer.tokenize('機械学習アルゴリズム');
const detailedTokens = await tokenizer.getDetailedTokens('機械学習アルゴリズム');

Filters

EnglishTokenFilter

Filters English tokens with stopword removal, N-gram support, and more.

const filter = new EnglishTokenFilter({
  removeDuplicates: true,
  removeStopwords: true,
  minTokenLength: 2,
  customStopWords: ['custom', 'words']
});
const filtered = filter.filter(tokens);
const ngramFiltered = filter.filterWithNgrams(tokens);

JapaneseTokenFilter

Filters Japanese tokens with part-of-speech filtering and Japanese-specific processing.

const filter = new JapaneseTokenFilter({
  allowedPos: ['名詞', '動詞', '形容詞'],
  removeDuplicates: true,
  removeStopwords: true,
  minTokenLength: 1
});
const filtered = filter.filter(tokens);
const posFiltered = filter.filterWithPos(detailedTokens);

Filter Options

Common filter options for both English and Japanese:

removeDuplicates - remove duplicate tokens（default: true）
removeStopwords - remove stopwords（default: true）
customStopWords - additional custom stopwords（default: empty array）
minTokenLength - minimum token length（default: 1）

Japanese-specific options:

allowedPos - allowed part-of-speech tags（default: 名詞、動詞、形容詞）

Development

Project Structure

├── src/                    # Source code
│   ├── lib/               # Main library code
│   │   ├── tokenizers/    # Tokenizer implementations
│   │   │   ├── EnglishTokenizer.ts
│   │   │   └── JapaneseTokenizer.ts
│   │   ├── filters/       # Token filter implementations
│   │   │   ├── EnglishTokenFilter.ts
│   │   │   └── JapaneseTokenFilter.ts
│   │   ├── factories/     # Factory classes
│   │   │   └── ProcessingPipelineFactory.ts
│   │   ├── ContentBasedRecommender.ts  # Main recommender class
│   │   └── index.ts       # Library exports
│   ├── types/             # TypeScript type definitions
│   │   └── index.ts
│   └── index.ts           # Main export file
├── test/                  # Test files
│   ├── tokenizers/        # Tokenizer tests
│   ├── filters/           # Filter tests
│   ├── factories/         # Factory tests
│   └── *.ts              # Integration and main tests
├── fixtures/              # Test data
│   ├── sample-documents.ts
│   ├── sample-document-tags.ts
│   ├── sample-target-documents.ts
│   └── sample-japanese-documents.ts
├── example/               # Usage examples
│   └── example.ts
├── index.ts               # Package entry point
├── tsconfig.json          # TypeScript configuration
└── eslint.config.js       # ESLint configuration

Running Tests

The test suite includes comprehensive unit tests and integration tests for all components:

# Install dependencies
npm install

# Run all tests
npm test

# Run specific test categories
npm test -- --grep "EnglishTokenizer"
npm test -- --grep "JapaneseTokenizer"
npm test -- --grep "EnglishTokenFilter"
npm test -- --grep "JapaneseTokenFilter"
npm test -- --grep "ProcessingPipelineFactory"
npm test -- --grep "ContentBasedRecommender"

# Run example
npm run example

# Run development mode with ts-node
npm run dev

Building

# Build TypeScript
npm run build

# Run linting
npm run lint

# Fix linting issues
npm run lint:fix

Authors

Current Maintainer

Ken Sakurai - TypeScript migration and Japanese language support

Original Author

Stanley Fok - Original implementation

Contributors

Marian Klühspies

License

MIT

Historical Changes (from upstream)

This package is based on the original work by Stanley Fok. For historical changes before the fork, see: https://github.com/stanleyfok/content-based-recommender

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github		.github
.vscode		.vscode
example		example
fixtures		fixtures
src		src
test		test
.gitignore		.gitignore
.mocharc.json		.mocharc.json
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
debug-japanese.ts		debug-japanese.ts
eslint.config.mjs		eslint.config.mjs
index.ts		index.ts
package-lock.json		package-lock.json
package.json		package.json
test-japanese.ts		test-japanese.ts
test-kuromoji.ts		test-kuromoji.ts
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json

License

kemsakurai/ts-content-based-recommender

Folders and files

Latest commit

History

Repository files navigation

TypeScript Content Based Recommender

Credits

Original Author

Enhancements in this fork

What's New

Latest Version

1.5.0

1.4.0

1.3.0

1.2.0

1.1.0

Installation

Overview

Language Support

English

Japanese

Usage

Single collection

Multi collection

Japanese Language Example

Using Individual Components

Advanced Configuration Example

API Reference

ContentBasedRecommender

constructor([options])

train(documents)

trainBidirectional(collectionA, collectionB)

getSimilarDocuments(id, [start], [size])

export()

import(object)

ProcessingPipelineFactory

ProcessingPipelineFactory.createPipeline(language, options)

ProcessingPipelineFactory.createTokenizer(language)

ProcessingPipelineFactory.createEnglishPipeline(options)

ProcessingPipelineFactory.createJapanesePipeline(options)

Tokenizers

EnglishTokenizer

JapaneseTokenizer

Filters

EnglishTokenFilter

JapaneseTokenFilter

Filter Options

Development

Project Structure

Running Tests

Building

Authors

Current Maintainer

Original Author

Contributors

License

Historical Changes (from upstream)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages