Term frequency–inverse document frequency in PHP
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It’s widely used in information retrieval and text mining. In this article, we will explore how to implement a TF-IDF search in PHP.
What is TF-IDF?
TF-IDF combines two metrics:
- Term Frequency (TF): Measures how frequently a term appears in a document. The more frequent, the higher the TF value.
- Inverse Document Frequency (IDF): Measures the importance of a term in the entire corpus. The more documents contain the term, the lower its IDF value.
The formula for TF-IDF is:
\[\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)\]Where:
- \(t\) is the term
- \(d\) is the document
- \(TF(t, d)\) is the term frequency of \(t\) in \(d\)
- \(IDF(t)\) is the inverse document frequency of \(t\)
For more details, visit the Wikipedia page on TF-IDF.
Implementing TF-IDF in PHP
Below is a PHP script that implements TF-IDF search on text files stored in a dataset
folder. The folder contains 10 example files generated by ChatGPT.
PHP Code
<?php
declare(strict_types=1);
/* Helpers */
function dd(...$vars): void
{
foreach ($vars as $var) {
var_dump($var);
echo PHP_EOL;
}
die;
}
function get_documents_with_terms(string $path): array
{
$documents = array_filter(scandir($path), fn($name) => strpos($name, '.txt') !== false);
$doc_terms = [];
foreach ($documents as $document_name) {
$file_path = "$path/$document_name";
$content = file_get_contents($file_path);
$terms = preg_split("/[\s,]+/", $content);
$terms = array_map(fn (string $val) => strtolower($val), $terms);
$doc_terms[$document_name] = $terms;
}
return $doc_terms;
}
function get_tf_table(array $documents_with_terms): array
{
return array_map(fn ($el) => array_count_values($el), $documents_with_terms);
}
function get_idf_table(array $documents_with_terms, array $tf_table): array
{
$terms_table = [];
$documents_list = array_keys($documents_with_terms);
foreach ($documents_with_terms as $document_with_terms) {
$terms_table = array_merge($terms_table, $document_with_terms);
}
$terms_in_documents = [];
foreach ($terms_table as $term) {
$term = strtolower($term);
if (empty($terms_in_documents[$term])) {
$terms_in_documents[$term] = [];
}
foreach ($tf_table as $document_name => $terms) {
if (!in_array($document_name, $terms_in_documents[$term])
&& in_array($term, array_keys($terms))) {
array_push($terms_in_documents[$term], $document_name);
break;
}
}
}
$idf_table = [];
$documents_count = count($documents_with_terms);
foreach ($terms_table as $term) {
$idf_table[$term] = log($documents_count / count($terms_in_documents[$term]));
}
return $idf_table;
}
function get_ranking(array $search, array $tf_table, array $idf_table): array
{
$ranking = [];
foreach ($search as $s) {
foreach ($tf_table as $document => $tf) {
if (!isset($ranking[$document])) {
$ranking[$document] = [];
}
if (!isset($tf[$s]) || !isset($idf_table[$s])) {
$ranking[$document][] = 0;
continue;
}
$x = $tf[$s] * $idf_table[$s];
$ranking[$document][] = $x;
}
}
$ranking = array_map(fn ($r) => array_reduce(
$r,
fn ($carry, $item) => $carry += $item) / count($search), $ranking
);
asort($ranking);
return array_reverse($ranking);
}
function parse_search(string $search): array
{
$search = preg_split("/[\s,]+/", $search);
return array_map(fn ($val) => strtolower($val), $search);
}
function main(int $argc, array $argv): void
{
if ($argc < 2) {
echo 'No term to search';
exit(0);
}
$input = parse_search($argv[1]);
$documents_with_terms = get_documents_with_terms('./dataset');
$tf_table = get_tf_table($documents_with_terms);
$idf_table = get_idf_table($documents_with_terms, $tf_table);
$ranking = get_ranking($input, $tf_table, $idf_table);
dd($ranking);
}
main($argc, $argv);
Explanation
Helper Functions:
- dd: Dumps variables and stops execution.
Main Functions:
- get_documents_with_terms: Reads text files from the dataset folder and splits them into terms.
- get_tf_table: Calculates the term frequency for each document.
- get_idf_table: Calculates the inverse document frequency for each term across all documents.
- get_ranking: Calculates the TF-IDF score for the search terms and ranks the documents.
- parse_search: Parses and normalizes the search input.
- main: Orchestrates the TF-IDF calculation and outputs the ranking.
Running the Script
Clone the repository and run the script:
git clone https://github.com/bit-willi/tf-idf-php
cd tf-idf-php
php index.php 'search terms'
For example, to search for “plays childhood market”:
php index.php 'plays childhood market'
Output:
array(10) {
["Finance.txt"]=>
float(3.0701134573253945)
["Education.txt"]=>
float(2.839064397138746)
["Literature.txt"]=>
float(0.5364793041447001)
["Travel.txt"]=>
int(0)
["Technology.txt"]=>
int(0)
["Sports.txt"]=>
int(0)
["Science.txt"]=>
int(0)
["History.txt"]=>
int(0)
["Health.txt"]=>
int(0)
["Environment.txt"]=>
int(0)
}
Conclusion
This PHP script demonstrates how to implement a basic TF-IDF search to rank documents based on term relevance. Although not optimized, it provides a practical example of leveraging TF-IDF for information retrieval. For more details and the complete code, visit the GitHub repository.