Counting only specific phrases using apache pig and jRuby UDF
07 May 2015Recently, I had to count specific phrases across all reviews on Gogobot.
This was only a small part of a bigger data science project that actually tags places using the review text that user wrote, which is really what you want to know about the place. Knowing that people like you say this place had great breakfast will usually means it’s good for you as well and we will bump this place up in your lists.
There’s a lot more you can do with such analysis, your imagination is the limit.
Get Pigging
The internet is basically swamped with “word count” pig applications that give you a text file (or some “big” text files) and counting words in them, but I had to do something different, I needed to count specific phrases we cared about.
I chose Apache Pig over Amazon Elastic Map Reduce, this way, I can throw as many machines as I want and let it process.
If we simplify the object it looks something like this:
class Review
def place_id
1234
end
def description
"This is my review"
end
end
Obviously, this data usually is stored in the database, so the first thing we need to do is export it and clean out any irregularities that might break pig processing (line breaks for example).
mysql -B -u USERNAME_HERE -pPASSWORD_HERE gogobot -h HOST_HERE -e “YOUR SELECT STATEMENT HERE;” | sed “s/‘/\’/;s/\t/\”,\”/g;s/^/\”/;s/$/\”/;s/\n//g” > recommendations.csv
tr -c ‘ a-zA-Z0-9,\n’ ‘ ‘ < recommendations.csv > clean_recs.csv
split -l 2000 clean_recs.csv
Lets break this script up a bit and understand the different parts of it.
First the mysql -B
part is exporting the table into a CSV.
The second tr
statement is cleaning out anything that is not readable chars, making sure what we get back is a clean CSV with the parts we want.
Third, in order to make sure we use parallelism in the file load, we split the main file into smaller chunks. Used 2000 here as an example, usually I break it to around 10,000 lines.
After we finish, the file looks like this:
4000000132125, A great place to hang out and watch New Orleans go by Good coffee and cakes
4000000132125, Beignets, duh
4000000132125, the right place for a classic beignet
Basically, it’s recommendation_id, recommendation_text
, perfect!
Getting to it
In version 0.10.0
of pig, they added a way for you to write User Defined Functions (UDF) in almost any language you desire.
I chose jRuby for this, this way, most engineers at Gogobot will find it understandable and readable.
here’s the UDF:
require 'pigudf'
require 'java' # Magic line for JRuby - Java interworking
class JRubyUdf < PigUdf
outputSchema "word:chararray"
def allowed_words
["Breakfast", "Steak Dinner", "Vegan Dish"]
end
def tokenize(words)
clean = allowed_words.map { |x| x if words.scan(/\b#{x}\b/i).length > 0 }.compact.join(";")
end
end
This is obviously a slimmed down version for the sake of example (real one read from a file and contained a lot more words).
This will get the words from the allowed_words
array, will throw everything else away and will only return the words you care about.
If you just count words, you quickly realize that the most popular word in recommendations is The
or It
, and we obviously could not care less about those words, we are here to learn what our users are doing and why other users should care.
Now, lets get to the pig script
register 'gogobot_udf.rb' using org.apache.pig.scripting.jruby.JrubyScriptEngine as Gogobot;
CLEAN_REVIEWS = LOAD 'clean.csv' using PigStorage(',') AS (place_id:chararray, review:chararray);
ONLY_ALLOWED_WORDS = FOREACH CLEAN_REVIEWS generate place_id, Gogobot.tokenize(review) as words;
FLAT_ALLOWED_WORDS = FOREACH ONLY_ALLOWED_WORDS generate place_id, FLATTEN(TOKENIZE(words, ';')) as word;
GROUP_BY_WORD = GROUP GROUP_BY_WORD BY (place_id, word);
WITH_WORD_COUNT = foreach GROUP_BY_WORD generate group, COUNT(GROUP_BY_WORD) as w_count;
The most important thing to understand here regarding to UDF’s is realy this line:
register 'gogobot_udf.rb' using org.apache.pig.scripting.jruby.JrubyScriptEngine as Gogobot;
This registers the UDF in pig, so you can use the methods inside it just like you use the pig internal methods.
For example, instead of using the internal TOKENIZE
method, we are using Gogobot.tokenize(review)
, which in turn will invoke the tokenize
method in the UDF and will only return the words we care about separated by ;
.
Summing up
Using jRuby with pig gives you a lot of power, you can really kick in your analytics workflows, we are using it for a lot of other things. (More posts on that in the future).