N-GRAM analysis using Ruby

I dusted off some old code today to look at common word pairs in some customer data. NGRAM analysis finds the most common bi-grams (2 word combinations), tri-grams (3 word combinations), etc. The code is simple, and I share it here in case you ever need to do the same thing:

require 'zip/zipfilesystem'

def words text
text.downcase.scan(/[a-z]+/)
end

Zip::ZipFile.open('../text.txt.zip') { |zipFile| # training data
$words = words(zipFile.read('text.txt')) # is in a ZIP file
}

bi_grams = Hash.new(0)
tri_grams = Hash.new(0)

num = $words.length - 2
num.times {|i|
bi = $words[i] + ' ' + $words[i+1]
tri = bi + ' ' + $words[i+2]
bi_grams[bi] += 1
tri_grams[tri] += 1
}

puts "bi-grams:"
bb = bi_grams.sort{|a,b| b[1] a[1]}
(num / 10).times {|i| puts "#{bb[i][0]} : #{bb[i][1]}"}
puts "tri-grams:"
tt = tri_grams.sort{|a,b| b[1] a[1]}
(num / 10).times {|i| puts "#{tt[i][0]} : #{tt[i][1]}"}

Output might look like this:

bi-grams:
in the : 561
in Java : 213
...
tri-grams:
in the code : 119
Java source code : 78
...

Cool stuff. Ruby is my favorite language for tool building.

Related Posts

Comments are closed.