Extracting Keywords from PDFs with Yahoo Term Extraction
Yahoo!'s Term Extraction Service can be used to extract significant words or phrases from some larger body of text. There are many uses for it, not the least of which is providing keywords, or tags in Web2.0 jargon, to help classify and organize a library of content. The following PHP script uses will use the Term Extraction service to analyze a PDF file. With a little more work, it could be expanded to work with Microsoft Word, Excel, and Powerpoint files. Extracting keywords automatically would be a helpful feature to build into your blog or CMS. There are modules to extract keywords for Drupal and Wordpress.
1:<?php
2:// discover where pdftotext tool is
3:$catpdf = trim(`which pdftotext`);
4:
5:// the PDF file to analyze
6:$source = 'http://example.com/my_file.pdf';
7:
8:// will copy file to a local temporary file
9:$temp_pdf_file = tempnam(sys_get_temp_dir(), "ek");
10:
11:// see below
12:download_file($source, $temp_pdf_file);
13:
14:// save text contents of pdf source to another temp file
15:$extract_file = tempnam(sys_get_temp_dir(), "ek");
16:exec($catpdf . ' ' . escapeshellarg($temp_pdf_file) . ' ' . escapeshellarg($extract_file));
17:
18:// fetch and output terms
19:$contents = file_get_contents($extract_file);
20:if ($terms = get_yahoo_terms($contents))
21:{
22: echo "\nYahoo terms for the file $source";
23: foreach ($terms as $term)
24: {
25: echo "\n$term";
26: }
27: echo "\n";
28:}
29:
30:// hide our footsteps
31:unlink($temp_pdf_file);
32:unlink($extract_file);
33:
34:/**
35: * Uses curl to copy $source to a local file $dest
36: * @param string
37: * @param string
38: */
39:function download_file($source, $dest)
40:{
41: $out = fopen($dest, 'wb');
42:
43: $ch = curl_init();
44:
45: curl_setopt($ch, CURLOPT_FILE, $out);
46: curl_setopt($ch, CURLOPT_HEADER, 0);
47: curl_setopt($ch, CURLOPT_URL, $source);
48:
49: curl_exec($ch);
50:
51: curl_close($ch);
52:}
53:
54:/**
55: * Uses curl to query yahoo term extraction service for meaninful terms
56: * @param string
57: * @return mixed, array on success or null on failure
58: */
59:function get_yahoo_terms($content)
60:{
61: $SERVICE_URL = 'http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction';
62: $app_id = 'F1_Testing';
63:
64: $ch = curl_init();
65: curl_setopt($ch, CURLOPT_URL, $SERVICE_URL);
66: curl_setopt($ch, CURLOPT_POST, 3);
67: curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
68:
69: curl_setopt( $ch, CURLOPT_POSTFIELDS, 'appid=' . $app_id . '&context=' . urlencode($content) . '&output=php');
70: $raw = curl_exec($ch);
71: curl_close($ch);
72:
73: if ($raw = unserialize($raw))
74: {
75: if (isset($raw['ResultSet']['Result']))
76: {
77: return $raw['ResultSet']['Result'];
78: }
79: }
80:}
81:?>
As a sample of what to expect, I used the script to look at Calculating CARMA: Global Estimation of CO2 Emissions from the Power Sector - Working Paper 145 and the list of terms returned is below. The list of words is fairly accurate, and even includes the name of one of the authors.
global estimation geographical scales carbon emissions co2 emissions global citizens global poverty david wheeler rigorous research power plants power sector fossil energy poverty and inequality solar wind energy sources monitoring system keystrokes groundwork strengths and weaknesses carbon dioxide aggregation




