Skip to Navigation

Extracting Keywords from PDFs with Yahoo Term Extraction

Yahoo!'s Term Extraction Service can be used to extract significant words or phrases from some larger body of text. There are many uses for it, not the least of which is providing keywords, or tags in Web2.0 jargon, to help classify and organize a library of content. The following PHP script uses will use the Term Extraction service to analyze a PDF file. With a little more work, it could be expanded to work with Microsoft Word, Excel, and Powerpoint files. Extracting keywords automatically would be a helpful feature to build into your blog or CMS. There are modules to extract keywords for Drupal and Wordpress.

   1:<?php
   2:// discover where pdftotext tool is
   3:$catpdf = trim(`which pdftotext`);
   4:
   5:// the PDF file to analyze
   6:$source = 'http://example.com/my_file.pdf';
   7:
   8:// will copy file to a local temporary file
   9:$temp_pdf_file = tempnam(sys_get_temp_dir(), "ek");
  10:
  11:// see below
  12:download_file($source, $temp_pdf_file);
  13:
  14:// save text contents of pdf source to another temp file
  15:$extract_file = tempnam(sys_get_temp_dir(), "ek");
  16:exec($catpdf . ' ' . escapeshellarg($temp_pdf_file) . ' ' . escapeshellarg($extract_file));
  17:
  18:// fetch and output terms
  19:$contents = file_get_contents($extract_file);
  20:if ($terms = get_yahoo_terms($contents))
  21:{
  22:    echo "\nYahoo terms for the file $source";
  23:    foreach ($terms as $term)
  24:    {
  25:        echo "\n$term";
  26:    }
  27:    echo "\n";
  28:}
  29:
  30:// hide our footsteps
  31:unlink($temp_pdf_file);
  32:unlink($extract_file);
  33:
  34:/**
  35: * Uses curl to copy $source to a local file $dest
  36: * @param string
  37: * @param string
  38: */
  39:function download_file($source, $dest)
  40:{
  41:    $out = fopen($dest, 'wb');
  42:
  43:    $ch = curl_init();
  44:
  45:    curl_setopt($ch, CURLOPT_FILE, $out);
  46:    curl_setopt($ch, CURLOPT_HEADER, 0);
  47:    curl_setopt($ch, CURLOPT_URL, $source);
  48:
  49:    curl_exec($ch);
  50:
  51:    curl_close($ch);
  52:}
  53:
  54:/**
  55: * Uses curl to query yahoo term extraction service for meaninful terms
  56: * @param string
  57: * @return mixed, array on success or null on failure
  58: */
  59:function get_yahoo_terms($content)
  60:{
  61:    $SERVICE_URL = 'http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction';
  62:    $app_id = 'F1_Testing';
  63:
  64:    $ch = curl_init();
  65:    curl_setopt($ch, CURLOPT_URL, $SERVICE_URL);
  66:    curl_setopt($ch, CURLOPT_POST, 3);
  67:    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  68:
  69:    curl_setopt( $ch, CURLOPT_POSTFIELDS, 'appid=' . $app_id . '&context=' . urlencode($content) . '&output=php');
  70:    $raw = curl_exec($ch);
  71:    curl_close($ch);
  72:
  73:    if ($raw = unserialize($raw))
  74:    {
  75:        if (isset($raw['ResultSet']['Result']))
  76:        {
  77:            return $raw['ResultSet']['Result'];
  78:        }
  79:    }
  80:}
  81:?>

As a sample of what to expect, I used the script to look at Calculating CARMA: Global Estimation of CO2 Emissions from the Power Sector - Working Paper 145 and the list of terms returned is below. The list of words is fairly accurate, and even includes the name of one of the authors.

global estimation
geographical scales
carbon emissions
co2 emissions
global citizens
global poverty
david wheeler
rigorous research
power plants
power sector
fossil energy
poverty and inequality
solar wind
energy sources
monitoring system
keystrokes
groundwork
strengths and weaknesses
carbon dioxide
aggregation