Scrape the First Paragraph & Image from a Wikipedia Entry

Posted on July 26, 2010

by Kannan Ramakrishnan

One day, while you’re organizing the content for your latest and greatest website, you may find yourself wishing for an automated way to fetch a description for some of that content.  Perhaps an image, too, to pull into a header or elsewhere on your page.  And since you’ve got 100 other things on your plate, you’d like this content to be implemented automatically, dynamically appearing on certain pages based on a few keywords.  For novice Web developers, this sounds impossible, I know.  But… it’s totally doable.

So off you go in search of good, usable, public content.  Before long you’ll probably realize it yourself, but I’ll save you an hour of digging with this quick pro tip: Wikipedia is your friend. The bottomless content at Wikipedia is a perfect match for what we want to do.  The caveat, though, is it’s so vast we might scrape the wrong content.  And especially if you’re automating a task like this, constantly looking over your shoulder to confirm the content would defeat the entire purpose.  So relying soley on the Wiki gremlins is not the best way to go.

Let’s take for example a Wiki search for a popular technology company, say, Apple.  For a Web-savvy person like yourself, a Wiki search for ‘Apple’ turns up what you probably expect: the entry for a scrumptious kind of fruit.  And as Web developers, we should be more astute in our search terms, but sometimes yes, well, there are those gremlins.

So… to cut the risk of unwanted fruit creeping onto your page, we’re going to combine a Wiki search with a Google search.

Following is the code for scraping the first paragraph and image of the entry from a page in Wikipedia.  As long as your keywords aren’t really crazy, this should get the job done!

Finally, make sure you don’t forget to credit Wikipedia on your page… and if you have any questions, drop us a line here any time!!

Here’s the code:

require  'hpricot'
require 'open-uri'

def fetch_description(query_item)
    page_title, uri_title = get_wiki_name(query_item)
    return get_wiki_description(page_title, uri_title)
end

def upload_photo(wiki_photo)
    begin
      base_uri = URI.parse(wiki_photo)
      uploaded_data = open(base_uri)
      def uploaded_data.original_filename; base_uri.path.split('/').last; end
      return uploaded_data.original_filename.blank? ? nil : uploaded_data
    rescue
      return nil
    end
end

#Method to fetch wiki page and strip first two

 Tags
def get_wiki_description(page_title, uri_title)
    url =  uri_title
    final_content = ""
    if url.size > 10
      buffer = Hpricot(open(url, "UserAgent" => "reader"+rand(10000).to_s).read)
      #Capture first two paragraphs of text
      content = buffer.search("//div[@id='content']").search("//div[@id='bodyContent']").search("//p")[0..2]

      #Remove the extra spaces and strip html tags from the fetched content
      content.each do |c|
        final_content+=c.inner_html.gsub(/< \/?[^>]*>/, '').gsub(/&#\d+;/,'').gsub(/\([^\)]+\)/,'').gsub(/\[[^\]]+\]/,'').gsub(/ +/,' ')+"\n"
      end
    end
    return final_content
end

#Method to get the link for wikipedia from google search results
def get_wiki_name(query_item)
    search_keywords = query_item.strip.gsub(/\s+/,'+')
    url = "http://www.google.com/search?q=#{search_keywords}+site%3Aen.wikipedia.org"
    begin
      doc = Hpricot(open(url, "UserAgent" => "reader"+rand(10000).to_s).read)
      result = doc.search("//div[@id='ires']").search("//li[@class='g']").first.search("//a").first unless doc
    rescue
      return '',''
    end
    if result
      return result.inner_html.gsub(/< \/?[^>]*>/,"").gsub(/./,""),result.attributes["href"]
    else
      return '',''
    end
end

wiki_description, wiki_photo = fetch_description("Apple")
upload_photo(wiki_photo)

And this is how it looks implemented live, in context:

Tags: , , , ,

Leave a Reply

You must be logged in to post a comment.