Term frequency–inverse document frequency in PHP
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It’s widely used in information retrieval and text mining. In this article, we will explore how to implement a TF-IDF search in PHP.
What is TF-IDF?
TF-IDF combines two metrics:
- Term Frequency (TF): Measures how frequently a term appears in a document. The more frequent, the higher the TF value.
- Inverse Document Frequency (IDF): Measures the importance of a term in the entire corpus. The more documents contain the term, the lower its IDF value.
The formula for TF-IDF is:
\[\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)\]Where:
- \(t\) is the term
- \(d\) is the document
- \(TF(t, d)\) is the term frequency of \(t\) in \(d\)
- \(IDF(t)\) is the inverse document frequency of \(t\)
For more details, visit the Wikipedia page on TF-IDF.
Implementing TF-IDF in PHP
Below is a PHP script that implements TF-IDF search on text files stored in a dataset
folder. The folder contains 10 example files generated by ChatGPT.
PHP Code
/* Helpers */
function dd(...$vars): void
foreach ($vars as $var) {
echo PHP_EOL;
function get_documents_with_terms(string $path): array
$documents = array_filter(scandir($path), fn($name) => strpos($name, '.txt') !== false);
$doc_terms = [];
foreach ($documents as $document_name) {
$file_path = "$path/$document_name";
$content = file_get_contents($file_path);
$terms = preg_split("/[\s,]+/", $content);
$terms = array_map(fn (string $val) => strtolower($val), $terms);
$doc_terms[$document_name] = $terms;
return $doc_terms;
function get_tf_table(array $documents_with_terms): array
return array_map(fn ($el) => array_count_values($el), $documents_with_terms);
function get_idf_table(array $documents_with_terms, array $tf_table): array
$terms_table = [];
$documents_list = array_keys($documents_with_terms);
foreach ($documents_with_terms as $document_with_terms) {
$terms_table = array_merge($terms_table, $document_with_terms);
$terms_in_documents = [];
foreach ($terms_table as $term) {
$term = strtolower($term);
if (empty($terms_in_documents[$term])) {
$terms_in_documents[$term] = [];
foreach ($tf_table as $document_name => $terms) {
if (!in_array($document_name, $terms_in_documents[$term])
&& in_array($term, array_keys($terms))) {
array_push($terms_in_documents[$term], $document_name);
$idf_table = [];
$documents_count = count($documents_with_terms);
foreach ($terms_table as $term) {
$idf_table[$term] = log($documents_count / count($terms_in_documents[$term]));
return $idf_table;
function get_ranking(array $search, array $tf_table, array $idf_table): array
$ranking = [];
foreach ($search as $s) {
foreach ($tf_table as $document => $tf) {
if (!isset($ranking[$document])) {
$ranking[$document] = [];
if (!isset($tf[$s]) || !isset($idf_table[$s])) {
$ranking[$document][] = 0;
$x = $tf[$s] * $idf_table[$s];
$ranking[$document][] = $x;
$ranking = array_map(fn ($r) => array_reduce(
fn ($carry, $item) => $carry += $item) / count($search), $ranking
return array_reverse($ranking);
function parse_search(string $search): array
$search = preg_split("/[\s,]+/", $search);
return array_map(fn ($val) => strtolower($val), $search);
function main(int $argc, array $argv): void
if ($argc < 2) {
echo 'No term to search';
$input = parse_search($argv[1]);
$documents_with_terms = get_documents_with_terms('./dataset');
$tf_table = get_tf_table($documents_with_terms);
$idf_table = get_idf_table($documents_with_terms, $tf_table);
$ranking = get_ranking($input, $tf_table, $idf_table);
main($argc, $argv);
Helper Functions:
- dd: Dumps variables and stops execution.
Main Functions:
- get_documents_with_terms: Reads text files from the dataset folder and splits them into terms.
- get_tf_table: Calculates the term frequency for each document.
- get_idf_table: Calculates the inverse document frequency for each term across all documents.
- get_ranking: Calculates the TF-IDF score for the search terms and ranks the documents.
- parse_search: Parses and normalizes the search input.
- main: Orchestrates the TF-IDF calculation and outputs the ranking.
Running the Script
Clone the repository and run the script:
git clone https://github.com/bit-willi/tf-idf-php
cd tf-idf-php
php index.php 'search terms'
For example, to search for “plays childhood market”:
php index.php 'plays childhood market'
array(10) {
This PHP script demonstrates how to implement a basic TF-IDF search to rank documents based on term relevance. Although not optimized, it provides a practical example of leveraging TF-IDF for information retrieval. For more details and the complete code, visit the GitHub repository.