diff --git a/README.md b/README.md
index f9fefe2..a4fae47 100644
--- a/README.md
+++ b/README.md
@@ -73,6 +73,8 @@ If you have a use case you would like to show off or even a quick tutorial for a
**Notebooks**:
* [Criando e analisando dados com AQL](https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/community_notebooks/BD_g01_ArangoDB.ipynb) - Submitted by [janiosl](https://github.com/janiosl)
+* [PageRank with MovieLens and ArangoDB](https://colab.research.google.com/github/Vinizx17/interactive_tutorials/blob/master/community_notebooks/Page_Rank_Movielens.ipynb) - Submitted by [Vinizx17](https://github.com/Vinizx17)
+
### Workshop Repositories
The following is a list of workshops given that cover topics related to ArangoDB.
diff --git a/community_notebooks/Page_Rank_Movielens.ipynb b/community_notebooks/Page_Rank_Movielens.ipynb
new file mode 100644
index 0000000..113d140
--- /dev/null
+++ b/community_notebooks/Page_Rank_Movielens.ipynb
@@ -0,0 +1,762 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yUfwGMXs3JWx"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gafVPB_N3JW3"
+ },
+ "source": [
+ "# Analyzing Movie Popularity Through Graph Algorithms on ArangoDB\n",
+ "\n",
+ "---\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lu-8LTPO3JW4"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "frXF0GXe3JW4"
+ },
+ "source": [
+ "**Movie Influence and Popularity Analysis Using Graphs and ArangoDB**\n",
+ "\n",
+ "---\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "This project demonstrates how to integrate the MovieLens dataset with the multimodel database ArangoDB to perform advanced graph data analysis. Using data on movies, users, ratings, and tags, we build a directed graph representing connections between users and movies through weighted ratings.\n",
+ "\n",
+ "The pipeline begins with extracting and cleaning the MovieLens CSV files, followed by batch inserting vertex collections (movies and users) and edge collections (ratings and tags) into ArangoDB. After loading the data, a graph is created in ArangoDB defining the relationships between users and movies.\n",
+ "\n",
+ "Using the NetworkX library, the graph is reconstructed in Python from ArangoDB data, with edges weighted by rating values. We then apply the PageRank algorithm to measure the relative influence of movies in the graph, considering both connection structure and rating weights.\n",
+ "\n",
+ "To enhance the analysis, we combine PageRank with additional metrics computed directly in ArangoDB, such as total number of ratings and average rating per movie. After normalizing these indicators, we generate a weighted final score ranking movies by popularity and relevance within the network.\n",
+ "\n",
+ "The final output is a list of the top 10 most influential movies, taking into account both the graph structural influence (PageRank) and the volume and quality of ratings received. This approach showcases the power of distributed graph processing with ArangoDB and the application of analytical algorithms to extract valuable insights from large relational datasets."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2qfdO22s3JW4"
+ },
+ "source": [
+ "# Setup"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tYODkDx33JW5"
+ },
+ "source": [
+ "This project is designed to run in a Python environment, such as Jupyter Notebook or Google Colab, and requires several libraries and dependencies for data processing, graph construction, and database interaction.\n",
+ "\n",
+ "The setup process starts by upgrading the Python package manager (pip) and installing key libraries including:\n",
+ "\n",
+ "- python-arango and pyarango;\n",
+ "\n",
+ "- pandas;\n",
+ "\n",
+ "- networkx;\n",
+ "\n",
+ "- numpy"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "id": "vsOFaPNJ3JW5"
+ },
+ "outputs": [],
+ "source": [
+ "%%capture\n",
+ "# Update\n",
+ "!pip3 install --upgrade pip\n",
+ "\n",
+ "# Install pandas, networkx, arango client libs\n",
+ "!pip3 install --upgrade python-arango pyarango pandas networkx\n",
+ "\n",
+ "\n",
+ "# Clone repo\n",
+ "!git clone https://github.com/arangodb/interactive_tutorials.git -b oasis_connector --single-branch\n",
+ "!rsync -av interactive_tutorials/ ./ --exclude=.git\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "id": "fKfqjUrL3JW6"
+ },
+ "outputs": [],
+ "source": [
+ "import oasis\n",
+ "from pyArango.connection import *\n",
+ "import pandas as pd\n",
+ "import zipfile\n",
+ "import requests\n",
+ "import os\n",
+ "from tqdm import tqdm\n",
+ "from arango import ArangoClient\n",
+ "import networkx as nx\n",
+ "import numpy as np\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jYk7qd3T3JW6"
+ },
+ "source": [
+ "Create the temporary database:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "id": "c_6RZVqu3JW6",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "a2877e36-aa9c-49a2-ac56-609d64f7511f"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Requesting new temp credentials.\n",
+ "Temp database ready to use.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Retrieve tmp credentials from ArangoDB Tutorial Service\n",
+ "login = oasis.getTempCredentials(tutorialName=\"Movielens\", credentialProvider='https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB')\n",
+ "\n",
+ "# Connect to the temp database\n",
+ "db = oasis.connect_python_arango(login)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "id": "uGptRNz93JW7",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "ffa7fb6d-4e90-4bc9-93b8-03017a62a41a"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "https://tutorials.arangodb.cloud:8529\n",
+ "Username: TUTpgayvplrllcuf8n46zhz3i\n",
+ "Password: TUT5mpnyqh8fn880lgt1ihq2v\n",
+ "Database: TUTg7c78306nb9p0eeovgcyu\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"https://{}:{}\".format(login[\"hostname\"], login[\"port\"]))\n",
+ "print(\"Username: \" + login[\"username\"])\n",
+ "print(\"Password: \" + login[\"password\"])\n",
+ "print(\"Database: \" + login[\"dbName\"])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CHh-4ED-3JW7"
+ },
+ "source": [
+ "Feel free to use to above URL to checkout the UI!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "j2YJ88SJ3JW7"
+ },
+ "source": [
+ "## Import Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5q-sYHDqvMAu"
+ },
+ "source": [
+ "import movielens dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "id": "FCTkPLNF9-vS",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "e4dff13c-c0c2-476e-ca72-4cdbfbb1fa34"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Downloading dataset...\n",
+ "Extracting files...\n",
+ "Reading CSV files...\n",
+ "Total movies: 9742\n",
+ "Total ratings: 100836\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Small dataset URL\n",
+ "url = \"https://files.grouplens.org/datasets/movielens/ml-latest-small.zip\"\n",
+ "\n",
+ "# Zip file path\n",
+ "zip_path = \"/content/ml-latest-small.zip\"\n",
+ "if not os.path.exists(zip_path):\n",
+ " print(\"Downloading dataset...\")\n",
+ " r = requests.get(url)\n",
+ " with open(zip_path, \"wb\") as f:\n",
+ " f.write(r.content)\n",
+ "else:\n",
+ " print(\"Dataset already downloaded.\")\n",
+ "\n",
+ "# Extract files (ml-latest-small/)\n",
+ "print(\"Extracting files...\")\n",
+ "with zipfile.ZipFile(zip_path, \"r\") as zip_ref:\n",
+ " zip_ref.extractall(\"/content\")\n",
+ "\n",
+ "# Correct paths inside the subfolder\n",
+ "extract_folder = \"/content/ml-latest-small\"\n",
+ "movies_path = os.path.join(extract_folder, \"movies.csv\")\n",
+ "ratings_path = os.path.join(extract_folder, \"ratings.csv\")\n",
+ "tags_path = os.path.join(extract_folder, \"tags.csv\")\n",
+ "\n",
+ "# Read complete datasets\n",
+ "print(\"Reading CSV files...\")\n",
+ "movies_df = pd.read_csv(movies_path)\n",
+ "ratings_df = pd.read_csv(ratings_path)\n",
+ "tags_df = pd.read_csv(tags_path)\n",
+ "\n",
+ "print(f\"Total movies: {movies_df.shape[0]}\")\n",
+ "print(f\"Total ratings: {ratings_df.shape[0]}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Insert the dataset on Arangodb"
+ ],
+ "metadata": {
+ "id": "7GD9yDXPFQuI"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Fixed connection data (replace with your own)\n",
+ "HOST = \"https://tutorials.arangodb.cloud:8529\"\n",
+ "DB_NAME = \"TUTg7c78306nb9p0eeovgcyu\" #Your DB name\n",
+ "USERNAME = \"TUTpgayvplrllcuf8n46zhz3i\" #Your Username\n",
+ "PASSWORD = \"TUT5mpnyqh8fn880lgt1ihq2v\" #Your Password\n",
+ "\n",
+ "# Set up connection\n",
+ "client = ArangoClient(hosts=HOST)\n",
+ "db = client.db(DB_NAME, username=USERNAME, password=PASSWORD)\n",
+ "\n",
+ "# Create collections if they don't exist\n",
+ "if not db.has_collection('movies'):\n",
+ " db.create_collection('movies')\n",
+ "if not db.has_collection('users'):\n",
+ " db.create_collection('users')\n",
+ "if not db.has_collection('ratings'):\n",
+ " db.create_collection('ratings', edge=True)\n",
+ "if not db.has_collection('tags'):\n",
+ " db.create_collection('tags', edge=True)\n",
+ "\n",
+ "movies_collection = db.collection('movies')\n",
+ "users_collection = db.collection('users')\n",
+ "ratings_edge_collection = db.collection('ratings')\n",
+ "tags_edge_collection = db.collection('tags')\n",
+ "\n",
+ "# Function to split lists into chunks\n",
+ "def chunks(lst, n):\n",
+ " for i in range(0, len(lst), n):\n",
+ " yield lst[i:i + n]\n",
+ "\n",
+ "# Insert movies in batches\n",
+ "print(\"Inserting movies in batches...\")\n",
+ "movies = [{\n",
+ " \"_key\": str(row['movieId']),\n",
+ " \"title\": row['title'],\n",
+ " \"genres\": row['genres']\n",
+ "} for _, row in movies_df.iterrows()]\n",
+ "\n",
+ "for batch in tqdm(list(chunks(movies, 1000))):\n",
+ " db.aql.execute(\n",
+ " \"\"\"\n",
+ " FOR doc IN @batch\n",
+ " INSERT doc INTO movies OPTIONS { ignoreErrors: true }\n",
+ " \"\"\",\n",
+ " bind_vars={\"batch\": batch}\n",
+ " )\n",
+ "\n",
+ "# Insert users in batches\n",
+ "print(\"Inserting users in batches...\")\n",
+ "user_ids = list(set(ratings_df['userId'].astype(str)))\n",
+ "users = [{'_key': uid} for uid in user_ids]\n",
+ "\n",
+ "for batch in tqdm(list(chunks(users, 1000))):\n",
+ " db.aql.execute(\n",
+ " \"\"\"\n",
+ " FOR doc IN @batch\n",
+ " INSERT doc INTO users OPTIONS { ignoreErrors: true }\n",
+ " \"\"\",\n",
+ " bind_vars={\"batch\": batch}\n",
+ " )\n",
+ "\n",
+ "# Insert ratings (edges) in batches\n",
+ "print(\"Inserting ratings in batches...\")\n",
+ "edges_ratings = [{\n",
+ " \"_from\": f\"users/{row['userId']}\",\n",
+ " \"_to\": f\"movies/{row['movieId']}\",\n",
+ " \"rating\": float(row['rating']),\n",
+ " \"timestamp\": int(row['timestamp'])\n",
+ "} for _, row in ratings_df.iterrows()]\n",
+ "\n",
+ "for batch in tqdm(list(chunks(edges_ratings, 1000))):\n",
+ " db.aql.execute(\n",
+ " \"\"\"\n",
+ " FOR doc IN @batch\n",
+ " INSERT doc INTO ratings OPTIONS { ignoreErrors: true }\n",
+ " \"\"\",\n",
+ " bind_vars={\"batch\": batch}\n",
+ " )\n",
+ "\n",
+ "# Insert tags (edges) in batches\n",
+ "print(\"Inserting tags in batches...\")\n",
+ "edges_tags = [{\n",
+ " \"_from\": f\"users/{row['userId']}\",\n",
+ " \"_to\": f\"movies/{row['movieId']}\",\n",
+ " \"tag\": row['tag'],\n",
+ " \"timestamp\": int(row['timestamp'])\n",
+ "} for _, row in tags_df.iterrows()]\n",
+ "\n",
+ "for batch in tqdm(list(chunks(edges_tags, 1000))):\n",
+ " db.aql.execute(\n",
+ " \"\"\"\n",
+ " FOR doc IN @batch\n",
+ " INSERT doc INTO tags OPTIONS { ignoreErrors: true }\n",
+ " \"\"\",\n",
+ " bind_vars={\"batch\": batch}\n",
+ " )\n",
+ "\n",
+ "print(\"Insertion complete!\")\n"
+ ],
+ "metadata": {
+ "id": "uSIKpwnAEmSl",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "4552040a-cab7-4670-8e1d-4f4492a696c7"
+ },
+ "execution_count": 6,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Inserting movies in batches...\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "100%|██████████| 10/10 [00:01<00:00, 9.59it/s]\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Inserting users in batches...\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "100%|██████████| 1/1 [00:00<00:00, 10.91it/s]\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Inserting ratings in batches...\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "100%|██████████| 101/101 [00:10<00:00, 9.45it/s]\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Inserting tags in batches...\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "100%|██████████| 4/4 [00:00<00:00, 9.51it/s]"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Insertion complete!\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "W5R-S4V7vR8z"
+ },
+ "source": [
+ "## Create the graph"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "id": "siA4yJGd8HvE",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "76417168-bd12-40b5-e617-a489beb1c6be"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Graph 'movies_graph' created with vertex and edge collections.\n"
+ ]
+ }
+ ],
+ "source": [
+ "graph_name = 'movies_graph'\n",
+ "\n",
+ "# If the graph exists, delete it (without dropping the collections)\n",
+ "if db.has_graph(graph_name):\n",
+ " db.delete_graph(graph_name, drop_collections=False)\n",
+ "\n",
+ "# Create the graph\n",
+ "graph = db.create_graph(graph_name)\n",
+ "\n",
+ "# Define the ratings edge (user -> movie)\n",
+ "graph.create_edge_definition(\n",
+ " edge_collection='ratings',\n",
+ " from_vertex_collections=['users'],\n",
+ " to_vertex_collections=['movies']\n",
+ ")\n",
+ "\n",
+ "# Define the tags edge (user -> movie)\n",
+ "graph.create_edge_definition(\n",
+ " edge_collection='tags',\n",
+ " from_vertex_collections=['users'],\n",
+ " to_vertex_collections=['movies']\n",
+ ")\n",
+ "\n",
+ "print(f\"Graph '{graph_name}' created with vertex and edge collections.\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9SCu7-HB6K1r"
+ },
+ "source": [
+ "# Page Rank"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q0-fDZgK6Qkt"
+ },
+ "source": [
+ "This code loads movie rating data into ArangoDB, builds a graph of users and movies, and runs the PageRank algorithm to rank movies based on user ratings. It then combines PageRank scores with rating counts and averages to identify the top movies"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "id": "naA2n93k6_W9",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "63c58833-66a3-4399-b5b2-8303d05d6e8a"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Graph created with 10334 nodes and 100836 edges.\n",
+ "\n",
+ "Top 10 movies combining PageRank, count, and average rating:\n",
+ "\n",
+ "Shawshank Redemption, The (1994)\n",
+ "Forrest Gump (1994)\n",
+ "Pulp Fiction (1994)\n",
+ "Silence of the Lambs, The (1991)\n",
+ "Matrix, The (1999)\n",
+ "Braveheart (1995)\n",
+ "Star Wars: Episode IV - A New Hope (1977)\n",
+ "Schindler's List (1993)\n",
+ "Jurassic Park (1993)\n",
+ "Terminator 2: Judgment Day (1991)\n"
+ ]
+ }
+ ],
+ "source": [
+ "import networkx as nx\n",
+ "from pyArango.connection import *\n",
+ "import numpy as np\n",
+ "\n",
+ "# ArangoDB settings\n",
+ "DB_NAME = \"TUTg7c78306nb9p0eeovgcyu\" #Your DB name\n",
+ "USERNAME = \"TUTpgayvplrllcuf8n46zhz3i\" #Your Username\n",
+ "PASSWORD = \"TUT5mpnyqh8fn880lgt1ihq2v\" #Your Password\n",
+ "HOST = \"https://tutorials.arangodb.cloud:8529\"\n",
+ "\n",
+ "# Function to clean the trailing \".0\" from keys if it exists\n",
+ "def clean_key(s):\n",
+ " if s.endswith('.0'):\n",
+ " return s[:-2]\n",
+ " return s\n",
+ "\n",
+ "# Connect to ArangoDB\n",
+ "conn = Connection(username=USERNAME, password=PASSWORD, arangoURL=HOST)\n",
+ "db = conn[DB_NAME]\n",
+ "\n",
+ "# Fetch ratings (edges) to build the graph\n",
+ "query = \"\"\"\n",
+ "FOR r IN ratings\n",
+ " RETURN {user: r._from, movie: r._to, weight: r.rating}\n",
+ "\"\"\"\n",
+ "edges = db.AQLQuery(query, rawResults=True)\n",
+ "\n",
+ "# Create directed graph\n",
+ "G = nx.DiGraph()\n",
+ "\n",
+ "# Add edges with weights\n",
+ "for e in edges:\n",
+ " G.add_edge(e['user'], e['movie'], weight=e['weight'])\n",
+ "\n",
+ "print(f\"Graph created with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.\")\n",
+ "\n",
+ "# Run PageRank\n",
+ "pr = nx.pagerank(G, alpha=0.85, weight='weight')\n",
+ "\n",
+ "# Filter only movies in the PageRank result\n",
+ "movie_ranks = {k: v for k, v in pr.items() if k.startswith(\"movies/\")}\n",
+ "\n",
+ "# Extract movie _keys applying cleaning\n",
+ "movie_keys = [clean_key(m.split('/')[1]) for m in movie_ranks.keys()]\n",
+ "\n",
+ "# Fetch movie titles\n",
+ "query_movies = \"\"\"\n",
+ "FOR m IN movies\n",
+ " FILTER m._key IN @keys\n",
+ " RETURN {key: m._key, title: m.title}\n",
+ "\"\"\"\n",
+ "movies_data = db.AQLQuery(query_movies, bindVars={\"keys\": movie_keys}, rawResults=True)\n",
+ "id_to_title = {m['key']: m['title'] for m in movies_data}\n",
+ "\n",
+ "# Fetch stats: count and average rating per movie\n",
+ "query_stats = \"\"\"\n",
+ "FOR r IN ratings\n",
+ " COLLECT movie = r._to INTO group = r\n",
+ " LET count = LENGTH(group)\n",
+ " LET avg_rating = AVERAGE(group[*].rating)\n",
+ " RETURN {movie: movie, count: count, avg_rating: avg_rating}\n",
+ "\"\"\"\n",
+ "stats = db.AQLQuery(query_stats, rawResults=True)\n",
+ "\n",
+ "# Create dictionaries for quick access\n",
+ "movie_counts = {s['movie']: s['count'] for s in stats}\n",
+ "movie_avg_ratings = {s['movie']: s['avg_rating'] for s in stats}\n",
+ "\n",
+ "# Prepare arrays for normalization and final score calculation\n",
+ "pr_values = np.array([movie_ranks.get(k, 0) for k in movie_ranks.keys()])\n",
+ "count_values = np.array([movie_counts.get(k, 0) for k in movie_ranks.keys()])\n",
+ "avg_rating_values = np.array([movie_avg_ratings.get(k, 0) for k in movie_ranks.keys()])\n",
+ "\n",
+ "# Simple min-max normalization function\n",
+ "def min_max_normalize(arr):\n",
+ " if arr.max() == arr.min():\n",
+ " return np.zeros_like(arr)\n",
+ " return (arr - arr.min()) / (arr.max() - arr.min())\n",
+ "\n",
+ "pr_norm = min_max_normalize(pr_values)\n",
+ "count_norm = min_max_normalize(count_values)\n",
+ "avg_norm = min_max_normalize(avg_rating_values)\n",
+ "\n",
+ "# Combine scores with weights: 50% PageRank, 30% count, 20% average rating\n",
+ "final_score = 0.5 * pr_norm + 0.3 * count_norm + 0.2 * avg_norm\n",
+ "\n",
+ "# Create a dict with final scores\n",
+ "combined_scores = dict(zip(movie_ranks.keys(), final_score))\n",
+ "\n",
+ "# Sort top 10 movies by combined score\n",
+ "top_10_combined = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:10]\n",
+ "\n",
+ "print(\"\\nTop 10 movies combining PageRank, count, and average rating:\\n\")\n",
+ "for movie_id_str, score in top_10_combined:\n",
+ " movie_key = clean_key(movie_id_str.split('/')[1])\n",
+ " title = id_to_title.get(movie_key, \"Title not found\")\n",
+ " print(title)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eBD63hEf3JXD"
+ },
+ "source": [
+ "# Next Steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fIF4PVuluT9m"
+ },
+ "source": [
+ "Be sure to check out the community detection tutorial to explore more graph analytics applications using ArangoDB.\n",
+ "\n",
+ "To keep experimenting and working with ArangoDB beyond this temporary setup, you can:\n",
+ "\n",
+ "Get a 2-week free trial on ArangoDB Cloud\n",
+ "\n",
+ "Take the free Graph Course\n",
+ "\n",
+ "Download and install ArangoDB locally\n",
+ "\n",
+ "Keep learning at https://www.arangodb.com/arangodb-training-center/\n",
+ "\n",
+ "Useful resources:\n",
+ "https://www.arangodb.com/docs/stable/aql/tutorial.html\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "#Further links\n"
+ ],
+ "metadata": {
+ "id": "a8NlJY_tIBnJ"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Keep learning at https://www.arangodb.com/arangodb-training-center/\n",
+ "\n",
+ "Useful resources: https://www.arangodb.com/docs/stable/aql/tutorial.html"
+ ],
+ "metadata": {
+ "id": "VCEcEfkwIM9Q"
+ }
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": [],
+ "toc_visible": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file
diff --git a/notebooks/example_output/Page_Rank_Movielens_output.ipynb.ipynb b/notebooks/example_output/Page_Rank_Movielens_output.ipynb.ipynb
new file mode 100644
index 0000000..14c6485
--- /dev/null
+++ b/notebooks/example_output/Page_Rank_Movielens_output.ipynb.ipynb
@@ -0,0 +1,751 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yUfwGMXs3JWx"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gafVPB_N3JW3"
+ },
+ "source": [
+ "# Analyzing Movie Popularity Through Graph Algorithms on ArangoDB\n",
+ "\n",
+ "---\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "lu-8LTPO3JW4"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "frXF0GXe3JW4"
+ },
+ "source": [
+ "**Movie Influence and Popularity Analysis Using Graphs and ArangoDB**\n",
+ "\n",
+ "---\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "This project demonstrates how to integrate the MovieLens dataset with the multimodel database ArangoDB to perform advanced graph data analysis. Using data on movies, users, ratings, and tags, we build a directed graph representing connections between users and movies through weighted ratings.\n",
+ "\n",
+ "The pipeline begins with extracting and cleaning the MovieLens CSV files, followed by batch inserting vertex collections (movies and users) and edge collections (ratings and tags) into ArangoDB. After loading the data, a graph is created in ArangoDB defining the relationships between users and movies.\n",
+ "\n",
+ "Using the NetworkX library, the graph is reconstructed in Python from ArangoDB data, with edges weighted by rating values. We then apply the PageRank algorithm to measure the relative influence of movies in the graph, considering both connection structure and rating weights.\n",
+ "\n",
+ "To enhance the analysis, we combine PageRank with additional metrics computed directly in ArangoDB, such as total number of ratings and average rating per movie. After normalizing these indicators, we generate a weighted final score ranking movies by popularity and relevance within the network.\n",
+ "\n",
+ "The final output is a list of the top 10 most influential movies, taking into account both the graph structural influence (PageRank) and the volume and quality of ratings received. This approach showcases the power of distributed graph processing with ArangoDB and the application of analytical algorithms to extract valuable insights from large relational datasets."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "2qfdO22s3JW4"
+ },
+ "source": [
+ "# Setup"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tYODkDx33JW5"
+ },
+ "source": [
+ "This project is designed to run in a Python environment, such as Jupyter Notebook or Google Colab, and requires several libraries and dependencies for data processing, graph construction, and database interaction.\n",
+ "\n",
+ "The setup process starts by upgrading the Python package manager (pip) and installing key libraries including:\n",
+ "\n",
+ "- python-arango and pyarango;\n",
+ "\n",
+ "- pandas;\n",
+ "\n",
+ "- networkx;\n",
+ "\n",
+ "- numpy"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "id": "vsOFaPNJ3JW5"
+ },
+ "outputs": [],
+ "source": [
+ "%%capture\n",
+ "# Update\n",
+ "!pip3 install --upgrade pip\n",
+ "\n",
+ "# Install pandas, networkx, arango client libs\n",
+ "!pip3 install --upgrade python-arango pyarango pandas networkx\n",
+ "\n",
+ "\n",
+ "# Clone repo\n",
+ "!git clone https://github.com/arangodb/interactive_tutorials.git -b oasis_connector --single-branch\n",
+ "!rsync -av interactive_tutorials/ ./ --exclude=.git\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "id": "fKfqjUrL3JW6"
+ },
+ "outputs": [],
+ "source": [
+ "import oasis\n",
+ "from pyArango.connection import *\n",
+ "import pandas as pd\n",
+ "import zipfile\n",
+ "import requests\n",
+ "import os\n",
+ "from tqdm import tqdm\n",
+ "from arango import ArangoClient\n",
+ "import networkx as nx\n",
+ "import numpy as np\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "jYk7qd3T3JW6"
+ },
+ "source": [
+ "Create the temporary database:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "id": "c_6RZVqu3JW6",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "a2877e36-aa9c-49a2-ac56-609d64f7511f"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Requesting new temp credentials.\n",
+ "Temp database ready to use.\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Retrieve tmp credentials from ArangoDB Tutorial Service\n",
+ "login = oasis.getTempCredentials(tutorialName=\"Movielens\", credentialProvider='https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB')\n",
+ "\n",
+ "# Connect to the temp database\n",
+ "db = oasis.connect_python_arango(login)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "id": "uGptRNz93JW7",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "ffa7fb6d-4e90-4bc9-93b8-03017a62a41a"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "https://tutorials.arangodb.cloud:8529\n",
+ "Username: TUTpgayvplrllcuf8n46zhz3i\n",
+ "Password: TUT5mpnyqh8fn880lgt1ihq2v\n",
+ "Database: TUTg7c78306nb9p0eeovgcyu\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"https://{}:{}\".format(login[\"hostname\"], login[\"port\"]))\n",
+ "print(\"Username: \" + login[\"username\"])\n",
+ "print(\"Password: \" + login[\"password\"])\n",
+ "print(\"Database: \" + login[\"dbName\"])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CHh-4ED-3JW7"
+ },
+ "source": [
+ "Feel free to use to above URL to checkout the UI!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "j2YJ88SJ3JW7"
+ },
+ "source": [
+ "## Import Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "5q-sYHDqvMAu"
+ },
+ "source": [
+ "import movielens dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "id": "FCTkPLNF9-vS",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "e4dff13c-c0c2-476e-ca72-4cdbfbb1fa34"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Downloading dataset...\n",
+ "Extracting files...\n",
+ "Reading CSV files...\n",
+ "Total movies: 9742\n",
+ "Total ratings: 100836\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Small dataset URL\n",
+ "url = \"https://files.grouplens.org/datasets/movielens/ml-latest-small.zip\"\n",
+ "\n",
+ "# Zip file path\n",
+ "zip_path = \"/content/ml-latest-small.zip\"\n",
+ "if not os.path.exists(zip_path):\n",
+ " print(\"Downloading dataset...\")\n",
+ " r = requests.get(url)\n",
+ " with open(zip_path, \"wb\") as f:\n",
+ " f.write(r.content)\n",
+ "else:\n",
+ " print(\"Dataset already downloaded.\")\n",
+ "\n",
+ "# Extract files (ml-latest-small/)\n",
+ "print(\"Extracting files...\")\n",
+ "with zipfile.ZipFile(zip_path, \"r\") as zip_ref:\n",
+ " zip_ref.extractall(\"/content\")\n",
+ "\n",
+ "# Correct paths inside the subfolder\n",
+ "extract_folder = \"/content/ml-latest-small\"\n",
+ "movies_path = os.path.join(extract_folder, \"movies.csv\")\n",
+ "ratings_path = os.path.join(extract_folder, \"ratings.csv\")\n",
+ "tags_path = os.path.join(extract_folder, \"tags.csv\")\n",
+ "\n",
+ "# Read complete datasets\n",
+ "print(\"Reading CSV files...\")\n",
+ "movies_df = pd.read_csv(movies_path)\n",
+ "ratings_df = pd.read_csv(ratings_path)\n",
+ "tags_df = pd.read_csv(tags_path)\n",
+ "\n",
+ "print(f\"Total movies: {movies_df.shape[0]}\")\n",
+ "print(f\"Total ratings: {ratings_df.shape[0]}\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Insert the dataset on Arangodb"
+ ],
+ "metadata": {
+ "id": "7GD9yDXPFQuI"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Fixed connection data (replace with your own)\n",
+ "HOST = \"https://tutorials.arangodb.cloud:8529\"\n",
+ "DB_NAME = \"TUTg7c78306nb9p0eeovgcyu\" #Your DB name\n",
+ "USERNAME = \"TUTpgayvplrllcuf8n46zhz3i\" #Your Username\n",
+ "PASSWORD = \"TUT5mpnyqh8fn880lgt1ihq2v\" #Your Password\n",
+ "\n",
+ "# Set up connection\n",
+ "client = ArangoClient(hosts=HOST)\n",
+ "db = client.db(DB_NAME, username=USERNAME, password=PASSWORD)\n",
+ "\n",
+ "# Create collections if they don't exist\n",
+ "if not db.has_collection('movies'):\n",
+ " db.create_collection('movies')\n",
+ "if not db.has_collection('users'):\n",
+ " db.create_collection('users')\n",
+ "if not db.has_collection('ratings'):\n",
+ " db.create_collection('ratings', edge=True)\n",
+ "if not db.has_collection('tags'):\n",
+ " db.create_collection('tags', edge=True)\n",
+ "\n",
+ "movies_collection = db.collection('movies')\n",
+ "users_collection = db.collection('users')\n",
+ "ratings_edge_collection = db.collection('ratings')\n",
+ "tags_edge_collection = db.collection('tags')\n",
+ "\n",
+ "# Function to split lists into chunks\n",
+ "def chunks(lst, n):\n",
+ " for i in range(0, len(lst), n):\n",
+ " yield lst[i:i + n]\n",
+ "\n",
+ "# Insert movies in batches\n",
+ "print(\"Inserting movies in batches...\")\n",
+ "movies = [{\n",
+ " \"_key\": str(row['movieId']),\n",
+ " \"title\": row['title'],\n",
+ " \"genres\": row['genres']\n",
+ "} for _, row in movies_df.iterrows()]\n",
+ "\n",
+ "for batch in tqdm(list(chunks(movies, 1000))):\n",
+ " db.aql.execute(\n",
+ " \"\"\"\n",
+ " FOR doc IN @batch\n",
+ " INSERT doc INTO movies OPTIONS { ignoreErrors: true }\n",
+ " \"\"\",\n",
+ " bind_vars={\"batch\": batch}\n",
+ " )\n",
+ "\n",
+ "# Insert users in batches\n",
+ "print(\"Inserting users in batches...\")\n",
+ "user_ids = list(set(ratings_df['userId'].astype(str)))\n",
+ "users = [{'_key': uid} for uid in user_ids]\n",
+ "\n",
+ "for batch in tqdm(list(chunks(users, 1000))):\n",
+ " db.aql.execute(\n",
+ " \"\"\"\n",
+ " FOR doc IN @batch\n",
+ " INSERT doc INTO users OPTIONS { ignoreErrors: true }\n",
+ " \"\"\",\n",
+ " bind_vars={\"batch\": batch}\n",
+ " )\n",
+ "\n",
+ "# Insert ratings (edges) in batches\n",
+ "print(\"Inserting ratings in batches...\")\n",
+ "edges_ratings = [{\n",
+ " \"_from\": f\"users/{row['userId']}\",\n",
+ " \"_to\": f\"movies/{row['movieId']}\",\n",
+ " \"rating\": float(row['rating']),\n",
+ " \"timestamp\": int(row['timestamp'])\n",
+ "} for _, row in ratings_df.iterrows()]\n",
+ "\n",
+ "for batch in tqdm(list(chunks(edges_ratings, 1000))):\n",
+ " db.aql.execute(\n",
+ " \"\"\"\n",
+ " FOR doc IN @batch\n",
+ " INSERT doc INTO ratings OPTIONS { ignoreErrors: true }\n",
+ " \"\"\",\n",
+ " bind_vars={\"batch\": batch}\n",
+ " )\n",
+ "\n",
+ "# Insert tags (edges) in batches\n",
+ "print(\"Inserting tags in batches...\")\n",
+ "edges_tags = [{\n",
+ " \"_from\": f\"users/{row['userId']}\",\n",
+ " \"_to\": f\"movies/{row['movieId']}\",\n",
+ " \"tag\": row['tag'],\n",
+ " \"timestamp\": int(row['timestamp'])\n",
+ "} for _, row in tags_df.iterrows()]\n",
+ "\n",
+ "for batch in tqdm(list(chunks(edges_tags, 1000))):\n",
+ " db.aql.execute(\n",
+ " \"\"\"\n",
+ " FOR doc IN @batch\n",
+ " INSERT doc INTO tags OPTIONS { ignoreErrors: true }\n",
+ " \"\"\",\n",
+ " bind_vars={\"batch\": batch}\n",
+ " )\n",
+ "\n",
+ "print(\"Insertion complete!\")\n"
+ ],
+ "metadata": {
+ "id": "uSIKpwnAEmSl",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "4552040a-cab7-4670-8e1d-4f4492a696c7"
+ },
+ "execution_count": 6,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Inserting movies in batches...\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "100%|██████████| 10/10 [00:01<00:00, 9.59it/s]\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Inserting users in batches...\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "100%|██████████| 1/1 [00:00<00:00, 10.91it/s]\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Inserting ratings in batches...\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "100%|██████████| 101/101 [00:10<00:00, 9.45it/s]\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Inserting tags in batches...\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "100%|██████████| 4/4 [00:00<00:00, 9.51it/s]"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Insertion complete!\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "W5R-S4V7vR8z"
+ },
+ "source": [
+ "## Create the graph"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "id": "siA4yJGd8HvE",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "76417168-bd12-40b5-e617-a489beb1c6be"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Graph 'movies_graph' created with vertex and edge collections.\n"
+ ]
+ }
+ ],
+ "source": [
+ "graph_name = 'movies_graph'\n",
+ "\n",
+ "# If the graph exists, delete it (without dropping the collections)\n",
+ "if db.has_graph(graph_name):\n",
+ " db.delete_graph(graph_name, drop_collections=False)\n",
+ "\n",
+ "# Create the graph\n",
+ "graph = db.create_graph(graph_name)\n",
+ "\n",
+ "# Define the ratings edge (user -> movie)\n",
+ "graph.create_edge_definition(\n",
+ " edge_collection='ratings',\n",
+ " from_vertex_collections=['users'],\n",
+ " to_vertex_collections=['movies']\n",
+ ")\n",
+ "\n",
+ "# Define the tags edge (user -> movie)\n",
+ "graph.create_edge_definition(\n",
+ " edge_collection='tags',\n",
+ " from_vertex_collections=['users'],\n",
+ " to_vertex_collections=['movies']\n",
+ ")\n",
+ "\n",
+ "print(f\"Graph '{graph_name}' created with vertex and edge collections.\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9SCu7-HB6K1r"
+ },
+ "source": [
+ "# Page Rank"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q0-fDZgK6Qkt"
+ },
+ "source": [
+ "This code loads movie rating data into ArangoDB, builds a graph of users and movies, and runs the PageRank algorithm to rank movies based on user ratings. It then combines PageRank scores with rating counts and averages to identify the top movies"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "id": "naA2n93k6_W9",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "outputId": "63c58833-66a3-4399-b5b2-8303d05d6e8a"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Graph created with 10334 nodes and 100836 edges.\n",
+ "\n",
+ "Top 10 movies combining PageRank, count, and average rating:\n",
+ "\n",
+ "Shawshank Redemption, The (1994)\n",
+ "Forrest Gump (1994)\n",
+ "Pulp Fiction (1994)\n",
+ "Silence of the Lambs, The (1991)\n",
+ "Matrix, The (1999)\n",
+ "Braveheart (1995)\n",
+ "Star Wars: Episode IV - A New Hope (1977)\n",
+ "Schindler's List (1993)\n",
+ "Jurassic Park (1993)\n",
+ "Terminator 2: Judgment Day (1991)\n"
+ ]
+ }
+ ],
+ "source": [
+ "import networkx as nx\n",
+ "from pyArango.connection import *\n",
+ "import numpy as np\n",
+ "\n",
+ "# ArangoDB settings\n",
+ "DB_NAME = \"TUTg7c78306nb9p0eeovgcyu\" #Your DB name\n",
+ "USERNAME = \"TUTpgayvplrllcuf8n46zhz3i\" #Your Username\n",
+ "PASSWORD = \"TUT5mpnyqh8fn880lgt1ihq2v\" #Your Password\n",
+ "HOST = \"https://tutorials.arangodb.cloud:8529\"\n",
+ "\n",
+ "# Function to clean the trailing \".0\" from keys if it exists\n",
+ "def clean_key(s):\n",
+ " if s.endswith('.0'):\n",
+ " return s[:-2]\n",
+ " return s\n",
+ "\n",
+ "# Connect to ArangoDB\n",
+ "conn = Connection(username=USERNAME, password=PASSWORD, arangoURL=HOST)\n",
+ "db = conn[DB_NAME]\n",
+ "\n",
+ "# Fetch ratings (edges) to build the graph\n",
+ "query = \"\"\"\n",
+ "FOR r IN ratings\n",
+ " RETURN {user: r._from, movie: r._to, weight: r.rating}\n",
+ "\"\"\"\n",
+ "edges = db.AQLQuery(query, rawResults=True)\n",
+ "\n",
+ "# Create directed graph\n",
+ "G = nx.DiGraph()\n",
+ "\n",
+ "# Add edges with weights\n",
+ "for e in edges:\n",
+ " G.add_edge(e['user'], e['movie'], weight=e['weight'])\n",
+ "\n",
+ "print(f\"Graph created with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.\")\n",
+ "\n",
+ "# Run PageRank\n",
+ "pr = nx.pagerank(G, alpha=0.85, weight='weight')\n",
+ "\n",
+ "# Filter only movies in the PageRank result\n",
+ "movie_ranks = {k: v for k, v in pr.items() if k.startswith(\"movies/\")}\n",
+ "\n",
+ "# Extract movie _keys applying cleaning\n",
+ "movie_keys = [clean_key(m.split('/')[1]) for m in movie_ranks.keys()]\n",
+ "\n",
+ "# Fetch movie titles\n",
+ "query_movies = \"\"\"\n",
+ "FOR m IN movies\n",
+ " FILTER m._key IN @keys\n",
+ " RETURN {key: m._key, title: m.title}\n",
+ "\"\"\"\n",
+ "movies_data = db.AQLQuery(query_movies, bindVars={\"keys\": movie_keys}, rawResults=True)\n",
+ "id_to_title = {m['key']: m['title'] for m in movies_data}\n",
+ "\n",
+ "# Fetch stats: count and average rating per movie\n",
+ "query_stats = \"\"\"\n",
+ "FOR r IN ratings\n",
+ " COLLECT movie = r._to INTO group = r\n",
+ " LET count = LENGTH(group)\n",
+ " LET avg_rating = AVERAGE(group[*].rating)\n",
+ " RETURN {movie: movie, count: count, avg_rating: avg_rating}\n",
+ "\"\"\"\n",
+ "stats = db.AQLQuery(query_stats, rawResults=True)\n",
+ "\n",
+ "# Create dictionaries for quick access\n",
+ "movie_counts = {s['movie']: s['count'] for s in stats}\n",
+ "movie_avg_ratings = {s['movie']: s['avg_rating'] for s in stats}\n",
+ "\n",
+ "# Prepare arrays for normalization and final score calculation\n",
+ "pr_values = np.array([movie_ranks.get(k, 0) for k in movie_ranks.keys()])\n",
+ "count_values = np.array([movie_counts.get(k, 0) for k in movie_ranks.keys()])\n",
+ "avg_rating_values = np.array([movie_avg_ratings.get(k, 0) for k in movie_ranks.keys()])\n",
+ "\n",
+ "# Simple min-max normalization function\n",
+ "def min_max_normalize(arr):\n",
+ " if arr.max() == arr.min():\n",
+ " return np.zeros_like(arr)\n",
+ " return (arr - arr.min()) / (arr.max() - arr.min())\n",
+ "\n",
+ "pr_norm = min_max_normalize(pr_values)\n",
+ "count_norm = min_max_normalize(count_values)\n",
+ "avg_norm = min_max_normalize(avg_rating_values)\n",
+ "\n",
+ "# Combine scores with weights: 50% PageRank, 30% count, 20% average rating\n",
+ "final_score = 0.5 * pr_norm + 0.3 * count_norm + 0.2 * avg_norm\n",
+ "\n",
+ "# Create a dict with final scores\n",
+ "combined_scores = dict(zip(movie_ranks.keys(), final_score))\n",
+ "\n",
+ "# Sort top 10 movies by combined score\n",
+ "top_10_combined = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:10]\n",
+ "\n",
+ "print(\"\\nTop 10 movies combining PageRank, count, and average rating:\\n\")\n",
+ "for movie_id_str, score in top_10_combined:\n",
+ " movie_key = clean_key(movie_id_str.split('/')[1])\n",
+ " title = id_to_title.get(movie_key, \"Title not found\")\n",
+ " print(title)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "eBD63hEf3JXD"
+ },
+ "source": [
+ "# Next Steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fIF4PVuluT9m"
+ },
+ "source": [
+ "Be sure to check out the community detection tutorial to explore more graph analytics applications using ArangoDB.\n",
+ "\n",
+ "To keep experimenting and working with ArangoDB beyond this temporary setup, you can:\n",
+ "\n",
+ "Get a 2-week free trial on ArangoDB Cloud\n",
+ "\n",
+ "Take the free Graph Course\n",
+ "\n",
+ "Download and install ArangoDB locally\n",
+ "\n",
+ "Keep learning at https://www.arangodb.com/arangodb-training-center/\n",
+ "\n",
+ "Useful resources:\n",
+ "https://www.arangodb.com/docs/stable/aql/tutorial.html\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "#Further links\n"
+ ],
+ "metadata": {
+ "id": "a8NlJY_tIBnJ"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Keep learning at https://www.arangodb.com/arangodb-training-center/\n",
+ "\n",
+ "Useful resources: https://www.arangodb.com/docs/stable/aql/tutorial.html"
+ ],
+ "metadata": {
+ "id": "VCEcEfkwIM9Q"
+ }
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file