willian@home:~$

Term frequency–inverse document frequency in PHP

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It’s widely used in information retrieval and text mining. In this article, we will explore how to implement a TF-IDF search in PHP.

What is TF-IDF?

TF-IDF combines two metrics:

  • Term Frequency (TF): Measures how frequently a term appears in a document. The more frequent, the higher the TF value.
  • Inverse Document Frequency (IDF): Measures the importance of a term in the entire corpus. The more documents contain the term, the lower its IDF value.

The formula for TF-IDF is:

\[\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)\]

Where:

  • \(t\) is the term
  • \(d\) is the document
  • \(TF(t, d)\) is the term frequency of \(t\) in \(d\)
  • \(IDF(t)\) is the inverse document frequency of \(t\)

For more details, visit the Wikipedia page on TF-IDF.

Implementing TF-IDF in PHP

Below is a PHP script that implements TF-IDF search on text files stored in a dataset folder. The folder contains 10 example files generated by ChatGPT.

PHP Code

<?php

declare(strict_types=1);

/* Helpers */
function dd(...$vars): void
{
    foreach ($vars as $var) {
        var_dump($var);
        echo PHP_EOL;
    }

    die;
}

function get_documents_with_terms(string $path): array
{
    $documents = array_filter(scandir($path), fn($name) => strpos($name, '.txt') !== false);
    $doc_terms = [];

    foreach ($documents as $document_name) {
        $file_path = "$path/$document_name";
        $content = file_get_contents($file_path);
        $terms = preg_split("/[\s,]+/", $content);
        $terms = array_map(fn (string $val) => strtolower($val), $terms);
        $doc_terms[$document_name] = $terms;
    }

    return $doc_terms;
}

function get_tf_table(array $documents_with_terms): array
{
    return array_map(fn ($el) => array_count_values($el), $documents_with_terms);
}

function get_idf_table(array $documents_with_terms, array $tf_table): array
{
    $terms_table = [];
    $documents_list = array_keys($documents_with_terms);

    foreach ($documents_with_terms as $document_with_terms) {
        $terms_table = array_merge($terms_table, $document_with_terms);
    }

    $terms_in_documents = [];

    foreach ($terms_table as $term) {
        $term = strtolower($term);

        if (empty($terms_in_documents[$term])) {
            $terms_in_documents[$term] = [];
        }

        foreach ($tf_table as $document_name => $terms) {
            if (!in_array($document_name, $terms_in_documents[$term])
                && in_array($term, array_keys($terms))) {
                array_push($terms_in_documents[$term], $document_name);
                break;
            }
        }
    }

    $idf_table = [];
    $documents_count = count($documents_with_terms);

    foreach ($terms_table as $term) {
        $idf_table[$term] = log($documents_count / count($terms_in_documents[$term]));
    }

    return $idf_table;
}

function get_ranking(array $search, array $tf_table, array $idf_table): array
{
    $ranking = [];

    foreach ($search as $s) {
        foreach ($tf_table as $document => $tf) {
            if (!isset($ranking[$document])) {
                $ranking[$document] = [];
            }

            if (!isset($tf[$s]) || !isset($idf_table[$s])) {
                $ranking[$document][] = 0;
                continue;
            }

            $x = $tf[$s] * $idf_table[$s];
            $ranking[$document][] = $x;
        }
    }

    $ranking = array_map(fn ($r) => array_reduce(
        $r,
        fn ($carry, $item) => $carry += $item) / count($search), $ranking
    );

    asort($ranking);
    return array_reverse($ranking);
}

function parse_search(string $search): array
{
    $search = preg_split("/[\s,]+/", $search);
    return array_map(fn ($val) => strtolower($val), $search);
}

function main(int $argc, array $argv): void
{
    if ($argc < 2) {
        echo 'No term to search';
        exit(0);
    }

    $input = parse_search($argv[1]);
    $documents_with_terms = get_documents_with_terms('./dataset');
    $tf_table = get_tf_table($documents_with_terms);
    $idf_table = get_idf_table($documents_with_terms, $tf_table);
    $ranking = get_ranking($input, $tf_table, $idf_table);

    dd($ranking);
}

main($argc, $argv);

Explanation

Helper Functions:

  • dd: Dumps variables and stops execution.

Main Functions:

  • get_documents_with_terms: Reads text files from the dataset folder and splits them into terms.
  • get_tf_table: Calculates the term frequency for each document.
  • get_idf_table: Calculates the inverse document frequency for each term across all documents.
  • get_ranking: Calculates the TF-IDF score for the search terms and ranks the documents.
  • parse_search: Parses and normalizes the search input.
  • main: Orchestrates the TF-IDF calculation and outputs the ranking.

Running the Script

Clone the repository and run the script:

git clone https://github.com/bit-willi/tf-idf-php
cd tf-idf-php
php index.php 'search terms'

For example, to search for “plays childhood market”:

php index.php 'plays childhood market'

Output:

array(10) {
["Finance.txt"]=>
float(3.0701134573253945)
["Education.txt"]=>
float(2.839064397138746)
["Literature.txt"]=>
float(0.5364793041447001)
["Travel.txt"]=>
int(0)
["Technology.txt"]=>
int(0)
["Sports.txt"]=>
int(0)
["Science.txt"]=>
int(0)
["History.txt"]=>
int(0)
["Health.txt"]=>
int(0)
["Environment.txt"]=>
int(0)
}

Conclusion

This PHP script demonstrates how to implement a basic TF-IDF search to rank documents based on term relevance. Although not optimized, it provides a practical example of leveraging TF-IDF for information retrieval. For more details and the complete code, visit the GitHub repository.