{"id":395,"date":"2009-11-08T05:11:46","date_gmt":"2009-11-08T05:11:46","guid":{"rendered":"http:\/\/euvolution.com\/futurist-transhuman-news-blog\/?p=395"},"modified":"2009-11-08T05:11:46","modified_gmt":"2009-11-08T05:11:46","slug":"n-gram-analysis-using-ruby","status":"publish","type":"post","link":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/artificial-intelligence\/n-gram-analysis-using-ruby.php","title":{"rendered":"N-GRAM analysis using Ruby"},"content":{"rendered":"<p>I dusted off some old code today to look at common word pairs in some customer data. NGRAM analysis finds the most common bi-grams (2 word combinations), tri-grams (3 word combinations), etc. The code is simple, and I share it here in case you ever need to do the same thing:<\/p><pre>require 'zip\/zipfilesystem'<br><br>def words text<br>  text.downcase.scan(\/[a-z]+\/)<br>end<br><br>Zip::ZipFile.open('..\/text.txt.zip') { |zipFile| # training data<br>  $words = words(zipFile.read('text.txt'))       # is in a ZIP file<br>}<br><br>bi_grams = Hash.new(0)<br>tri_grams = Hash.new(0)<br><br>num = $words.length - 2<br>num.times {|i|<br>  bi = $words[i] + ' ' + $words[i+1]<br>  tri = bi + ' ' + $words[i+2]<br>  bi_grams[bi] += 1<br>  tri_grams[tri] += 1<br>}<br><br>puts \"bi-grams:\"<br>bb = bi_grams.sort{|a,b| b[1]  a[1]}<br>(num \/ 10).times {|i|  puts \"#{bb[i][0]} : #{bb[i][1]}\"}<br>puts \"tri-grams:\"<br>tt = tri_grams.sort{|a,b| b[1]  a[1]}<br>(num \/ 10).times {|i|  puts \"#{tt[i][0]} : #{tt[i][1]}\"}<\/pre><p>Output might look like this:<\/p><pre>bi-grams:<br>in the : 561<br>in Java : 213<br>...<br>tri-grams:<br>in the code : 119<br>Java source code : 78<br>...<\/pre><p>Cool stuff. Ruby is my favorite language for tool building.<\/p><div><img loading=\"lazy\" decoding=\"async\" width=\"1\" height=\"1\" src=\"http:\/\/euvolution.com\/futurist-transhuman-news-blog\/wp-content\/plugins\/wp-o-matic\/cache\/04001_9025880770474050744-7124860797693184043?l=artificial-intelligence-theory.blogspot.com\" style=\"padding-left:10px; padding-right: 10px;\"><\/div>","protected":false},"excerpt":{"rendered":"<p>I dusted off some old code today to look at common word pairs in some customer data. NGRAM analysis finds the most common bi-grams (2 word combinations), tri-grams (3 word combinations), etc. The code is simple, and I share it &hellip; <a href=\"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/artificial-intelligence\/n-gram-analysis-using-ruby.php\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"limit_modified_date":"","last_modified_date":"","_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[13],"tags":[],"class_list":["post-395","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence"],"modified_by":null,"_links":{"self":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/395"}],"collection":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/comments?post=395"}],"version-history":[{"count":0,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/posts\/395\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/media?parent=395"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/categories?post=395"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.euvolution.com\/futurist-transhuman-news-blog\/wp-json\/wp\/v2\/tags?post=395"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}