Ruby Programming for Humanities Scholars
In late 2010 I wrote a post on how I learned code. I was, simultaneously, enrolled in ENGL4/878: Electronic Text with Professor Stephen Ramsay, where part of the course was about learning the Ruby programming language and how we could apply programming to humanistic data. The course, it turns out, was a sort of pivot moment for me. Although I had always been something of a computer geek, programming was something I had not touched since high school. But after the course I became captured by the power that programming can offer humanities scholars. In an age of Big Data, from Google Books to ever-growing cultural heritage digitized by libraries, museums, and centers, we have at our hands a vast array of material that can be manipulated, queried, browsed, and visualized through computational methods. When the course was finished, I decided to write a series of blog posts for others who might be interested in applying Ruby to humanistic questions. The result was the seven write-ups below.
The original post promised that the series would be released as an electronic book. At the time I wrote the series I was running WordPress and the plugin Anthologize had recently been released. Shortly after, however, I switched blog platforms to Jekyll and, as other projects demanded my attention, I never got around to pushing the material into a format beyond my blog posts.
But I had a new idea. Instead of pushing out a static project as a PDF or epub format I decided to host things dynamically. Starting today (2012-05-21) I am pushing all posts to Github and hosting the series on Github pages. Anyone interested can now download a copy of The Rubyist Historian to use.
I want more, however. As a bit of an experiment in open publishing, I've decided to open source The Rubyist Historian for public contributions. People are free to fork The Rubyist Historian and offer corrections, clearer (or better, or more) examples, and overall contribute to what I want to become a collaborative project and reference for humanities scholars looking to get started with Ruby programming.
As before, copies of the example code (and now the full text) can be found at the Rubyist Historian Github repository.
Note the links above return you to my website. Use the Github pages navigation to the left to jump to sections.
The structure, examples, and topics that comprise this blog series are directly inspired by and drawn from Prof. Stephen Ramsay's course ENGL 4/878: Electronic Text, which I took during the Fall 2010 term at the University of Nebraska-Lincoln. Thanks, Steve, for encouraging the hacker in all of us. Any mistakes, errors, or lousy explanations are my responsibility alone.
Many thanks to additional resources I consulted for example ideas and help with explanations. These resources include Dave Thomas, Programming Ruby 1.9: The Pragmatic Programmer's Guide, The Unofficial Ruby Usage Guide, and Ruby Inside. Other resources are included with each section.
The purpose of this ebook is to provide a brief overview of the Ruby programming language and consider ways Ruby (or any other programming language) can be applied to the day-to-day operations of humanities scholars. Once you complete this book, you should have a good understanding of Ruby basics, be able to complete basic tasks with Ruby, and hopefully leave with a solid basis that will allow you to continue learning.
The best way to learn Ruby is not by reading this book. The best way to learn any programming language is by hands-on interaction. As you read through the lessons and exercises, I encourage you to write the programs in your own text editor and run them; figure out how things fit together, try changing things in the program, learn what those changes break or improve and understand the reason behind it. Some exercises in this book may seem trivial, others quite complex. My goal is to provide a foundation to help those new to programming (or even those with basic or advanced experience) become comfortable with programming. And don't stop here. I'm barely touching the surface of what can be done with Ruby. I'll point out some additional resources to encourage the burgeoning Ruby enthusiast inside of you as we go along.
Before going any further, I want to thank Prof. Stephen Ramsay at the University of Nebraska for being the inspiration for this series. The structure of these posts, the topics of discussion, and some of the examples are directly correlated with his course I took in the Fall of 2010, ENGL 4/878: Electronic Text. Thanks, Steve, for encouraging the hacker in all of us.
So why am I writing about Ruby? Why not some of the other languages I know, such as Python? Or web language like PHP? I'm not suggesting here that Ruby is "the best" language but rather I hope to briefly sketch out the reasons why I think Ruby works as a beginner programming language.
All programming languages, like any foreign language, necessarily contain a learning curve. For example, we could compare PHP with Ruby: they have similar structures, syntaxes, and the like, but PHP sometimes throws in syntaxes that require careful distinctions (the difference between sprintf
and printf
). I believe that simplicity in the syntax of a language makes a huge difference in beginning programmers to grasp concepts. I also greatly appreciate Ruby's simplicity. I'm going to jump slightly ahead for the sake of making a comparison. Let us say we wanted to create an array of authors for a bibliographic program. In PHP, you might write:
$authors = array("Hemingway" => 3, "Dickinson" => 1, "Whitman" => 2);
$keys = array_keys($authors);
sort($keys);
$sorted = array_slice($keys, 0, 3);
We can achieve the same thing in Ruby much more simply:
authors = { "Hemingway" => 3, "Dickinson" => 1, "Whitman" => 2 }
sorted = authors.keys().sort().slice(0,3)
Don't worry so much here about what exactly is going on, we'll get to that later. But notice how much easier this is to read. This has something to do with Ruby being a pure OOP (object-oriented programming) language versus PHP's bolt-on functionality. The result is Ruby code that is much more readable. But we're getting ahead of ourselves. The point here is to illustrate the simplicity of the Ruby language.
Ruby also handles blocks well. Once again, lets compare PHP and Ruby. Imagine we wanted to sort a list of authors. In PHP, we would write:
function sort_authors_by_count($a, $b)
{
if($a -> counts == $b -> counts)
{
return 0;
}
return($a -> counts > $b -> counts) ? +1 : -1;
}
usort($authors, "sort_authors_by_count");
Ruby blocks are chunks of code between do . . . end
. The Ruby syntax would look like this:
authors.sort do |a, b|
a.counts <=> b.counts
end
Once again, Ruby is much simpler. Even if you're not exactly sure what is happening, it is much easier to look up the Ruby syntax of <=>
rather than try and decipher ? +1 : -1
.
Finally, everything in Ruby is an object. Ruby was designed as an object-oriented language, which makes writing programs much easier to create. Having everything as an object also makes code easier to handle. There's no need to check and see if something is an object and execute methods upon it. You can simply execute a method. Just as everything is an object, the results of manipulations on an object are also objects. There will be more on this later.
We could also ask a broader question, related to the first: why program? Why should historians take the time to learn to program? My answer is in line with Douglas Rushkoff's general warning: program or be programmed. Using tools developed by others puts you at their mercy. Much of our scholarly lives have already become digital: our sources are in digital form, we write in word processors, we communicate through e-mail and Twitter, we place lecture notes on Blackboard, we extend classrooms with blogs. We use these tools without really understanding how they do what they do. I'm offering a glimpse into this world and hopefully equipping you with a set of tools that will be readily useful in your scholarly work.
Wayne Graham has an entire list of why Ruby makes a great beginner language that I would also recommend checking out.
I'm writing this for people who have access to a UNIX environment. If you are on Linux or Mac, you have this accessible to you already: simply fire up the terminal and you're ready to go. Ruby comes preinstalled on most Linux distributions and on Mac OSX 10.5+. On Windows, you'll want to download Cygwin, a UNIX-like environment for Microsoft Windows. UPDATE: Reader Gordon Thiesfeld recommends Windows users check out RubyInstaller over Cygwin.
You'll also need a good text editor that you know your way around in. I work almost entirely in vim (or mvim). You might check out emacs or nano, or do your programming outside the terminal using TextMate (Mac), gEdit (Linux), or Notepad++ (Windows), or any other number of text editors. I would encourage you to find an editor that handles syntax highlighting, if only for making the code easier to read. And get ready for some battles.
You could also set up an IDE, or integrated development environment. I would follow the steps in William Turkel's The Programming Historian to install Komodo Edit (but ignore the extensions for Firefox), with a few changes for the appropriate programming language. I can also highly recommend NetBeans as a really useful IDE system if you prefer this route. I won't be going through that setup here -- if you really want the instructions, drop me an email.
Let's get started! It is traditional to start programming in a new language by writing something that says "hello world" and terminates. The language we are using is interpreted (as opposed to compiled), meaning that a special computer program known as an interpreter reads the instructions from Ruby and then runs the program. There are two ways to run Ruby. The first is by running Ruby interactively in the shell prompt. Simply type irb
into the command line to open the Ruby shell. Simply type in Ruby code and it will return the value of expressions under evaluation. Exit irb
by typing exit
or using the end-of-file character on your OS (normally Ctrl+D or Ctrl+Z). Alternatively, you can write these programs as files to your local disk or to a server and run them through the terminal. This is the preferred method for writing Ruby programs. In my case, I'll be running these programs locally through the terminal. I'll demonstrate briefly how irb
works and looks, but all subsequent examples and programs will be written as files.
Continuing with our comparative approach, generating "hello world" is a fairly straightforward process in many languages. In PHP, it looks like this:
print("Hello world");
Ruby operates similarly:
puts "Hello world"
If you're running this in the interactive Ruby shell, you should see something like this:
irb(main):001:0> puts "Hello world"
Hello world
=> nil
If you're running Ruby files off a server or local disk, save the file as hello.rb
and in the terminal run:
ruby hello.rb
Note the lack of parens in my puts
function. Parentheses are absolutely accepted Ruby syntax, but you must make a choice between a parens or a space. puts("Hello world")
and puts "Hello world"
are the same thing, but you cannot do puts ("Hello world")
. I tend to leave out parentheses unless I'm passing variables through a method.
It is common practice to also include the "shebang" notation (#!
) in the first line of the program, followed by introductory comments that usually include the name of the file, a description of what the program does, who wrote it and for what, and when it was last modified. Commented text is marked by #
. For example, a "hello world" program might look like this:
#!/usr/bin/ruby -w
# helloworld.rb
#
# Basic "hello world" program
#
# Written by Jason A. Heppler for
# The Rubyist Historian ebook project
#
# Last modified: Tue Dec 28 21:21:43 -0600 2010
puts "Hello, world!"
puts "I became a Ruby programmer on #{Time.now}"
Running ruby helloworld.rb
in the terminal will return:
Hello, world!
I became a Ruby programmer on Tue Dec 28 21:21:43 -0600 2010
And there you have it, your first Ruby program! But let's make things a little more interesting. Instead of just pushing static data, let's have Ruby work with data we give it through what's known as standard streams. For this we're going to use the methods gets() and chomp():
puts "Please enter your name: "
name = gets().chomp()
puts "I, #{name}, began learning Ruby code on #{Time.now}."
This will print to the screen:
I, Jason, began learning Ruby code on Tue Dec 28 21:21:43 -0600 2010.
Note the new notation #{}
. By asking for an input we are using what is called interpolation, or passing a variable into a string. Variables are enclosed in #{var}
. Take note that strings can be marked off by single or double quotes, but there is a distinction between their use. In order to interpolate, you must use double quotes. Single quotes will not allow interpolation, which has to do with Ruby attempting to optimize the code and [redacted boring technical jargon].
There you go! Your first Ruby program that works with user data. Up next, we're tackling methods and classes.
In our last section I introduced some Ruby programming basics. Now we're moving in to methods and classes.
Notations like gets()
and chomp()
are called methods. In our above example, gets()
accepts a single line of data from the user and assigns the string to name
. So how can we know what methods are available to us as programmers? The all-knowing, all-powerful Ruby Docs. gets
and chomp
barely scratch the surface. We can do things like count the number of characters or lines in a string or file, reverse the lines of a string or file, cut a string apart and join it in alphabetical order, or operate all methods on it at once. For example:
puts "I am a Rubyist Historian".length() #=> 24
puts "Learning some Ruby-fu".reverse() #=> uf-ybuR emos gninraeL
puts "Ruby is fracking awesomesauce.".split("").sort().join() #=> .Raaabcceeefgiikmnorsssuuwy
Run this in the terminal and you should get the commented results. For the first example, we would technically say that we are invoking the length method on object "I am a Rubyist Historian." (Or, even more abstractly, "I am a Rubyist Historian" is an object of type string.) Everything before the period is the receiver while everything after the period indicates the method(s) you wish to invoke upon the object.
We can also create methods. We'll return to our original "hello, name" program. But this time we're going to write our own method and invoke it. Methods are defined with the keyword def
followed by the method name and the method's parameters between parentheses (parentheses here are optional, but I use them for readability's sake. Remember, if you do not use parentheses you need to have a single space in its place, e.g., name hello
is the same as name(hello)
). So to build a new "hello, name" program using a defined method, we could write:
# defining function 'hello' to ask
# for parameter 'name'
def hello(name)
'Hello ' + name
end
puts "Please enter your name: "
name = gets.chomp
puts hello(name)
NB: Indentations does not matter to Ruby, but for readability's sake, we include them.
Become very familiar with defining functions. This allows us to define functions for later use and set code apart to keep the program organized (as Prof. Steve Ramsay persistently reminded us in his course, programs are designed for people to read, not just computers). You also want to avoid redundancy. As a final note, what happens inside of functions is not visible outside the function. We can make things visible by using the global variable (adding $ to the beginning of the variable you want visible, e.g., $result
) but, as Prof. Ramsay warned us: global variables are evil. No, really, they are. The last thing you'll want is a global variable to plague parts of your code without your knowledge.
Similarly to methods, we can define classes in Ruby. Recall that Ruby is an object oriented programming language. By using classes, we're fully entering the realm of OOP and learning how to create our own objects. We know that objects are closely allied with a type (object of type string, for example) and that certain behaviors go with certain objects. Objects are a data structure and a state, and also have behaviors that we call methods.
Ruby classes are templates for creating new kinds of objects. Classes are created by using the class
keyword, and take note that classes are capitalized and methods are lowercase. By using OOP we are making data central through what's called procedural programming where we're defining relationships between and among objects.
Let's say you wanted a program that allowed you to input author names and ISBNs. First we define the class starting off the definition with class
followed by the class name, capitalized:
class Books
# . . .
end
We'll use the initialize
method here, which allows programmers to set the state of constructed objects. We store these as instance variables inside the object, which we incidate through the use of the @
symbol. This makes variables visible within a class -- this is not a global variable. But the instance variables means we can allow each object to have its own unique state. initialize
is a special method in Ruby. Ruby allocates memory to hold uninitialized
objects and then calls the object's initialize
method. The method passes any parameters that were passed to new
.
Enough talk, let's write the code and explain things further:
class Books
attr_accessor :fname, :lname, :isbn
def initialize( fname, lname, isbn )
@fname = fname
@lname = lname
@isbn = isbn
end
def to_s
@lname + ", " + @fname + ", ISBN: " + @isbn
end
end
author = Books.new("Walt", "Whitman", "1234567890")
puts author
What we've done here is passed the instance variables @fname
, @lname
, and @isbn
a string by calling the class constructor Books
(Books.new("Walt", "Whitman", "1234567890")
). We could just as easily said Books.new("William", "Shakespeare", "1234567890")
. Note that attr_accessor
is not declaring an instance variable, it's only creating the accessor methods. Ruby decouples instance variables and accessor methods.
The class Books
takes three variables, fname
, lname
, and isbn
. These parameters act like local variables within the method and follow the same lowercase naming convention. Yet, if we kept them as local variables they would vanish once initialize
returned. So, we use an accessor to keep the variables usable throughout the class.
Note also that we redefined the to string (.to_s)
type cast as well. By default, when Ruby uses puts
it calls on the .to_s
type cast to convert data into a string. But we want .to_s
to be more useful. We can override the default implementation to display whatever we'd like it to display.
Our last segment introduced us to Ruby methods and classes. This section will introduce you to expressions and loops. Loops are, put simply, a test of whether an expression is true or false. This is the basic way that computer's operate: continue following a set of instructions until the expression becomes true, then end or move on to the next set of instructions.
Let's say we needed a program that printed numbers until it reached five. In this case, we want the program to print a number, evaluate whether that number is equal to five, if not add one and run the program again. Once the number is equal to five, the program terminates. We achieve this through the use of the while . . . end
loop:
num = 0
while num < 5
puts num
num += 1
end
Running ruby loop.rb
in the terminal will produce:
0
1
2
3
4
Note the +=
above. This symbol is called an operator, which allow us to compare values.
Example Operators
== | Test for equal value. |
=, > | Comparison operator for less than, less than or equal, greater than or equal, and grater than |
Returns -1, 0, or +1 depending on whether the receiver is less than, equal to, or greater than its argument. | |
-= | Subtraction operator. |
*= | Multiplication operator. |
!= | Not equal to operator. |
The if . . . else
loop allows us to evaluate several branches of code in the order we write it. If the first branch is false, the program moves on to the next and the next and so on until the value is true and terminates the program. We could write a program that evaluates what a user thinks about the quality of a book, for example:
puts "Enter a rating between one and five: "
# we use .to_i to convert the string to an integer
rank = gets.chomp.to_i
if rank >= 4
puts "The book was good!"
elsif rank == 3
puts "The book was so-so."
elsif rank <= 2
puts "The book stinks."
end
Pro Tip: If you get stuck in a loop and the terminal won't quit printing to the screen, hit CTRL+C. CTRL+C tells the terminal to stop whatever it's working on.
So far we've looked at some pretty primitive versions of loop constructs. Unlike Java, C, and C++, Ruby doesn't have a for
loop. Instead, it uses a less error-prone, built-in class functionality called iterators. Let's say you just finished writing a section of a chapter and wanted some applause for your effort. We could write a program to do that for you:
3.times do
print "Clap! "
end
Run the program from the terminal and it will produce:
Clap! Clap! Clap!
There, now you and your computer just shared a special moment. A pretty simple block of code, right? You could read what the program is doing even if you didn't understand a single line of Ruby: print "Clap!" three times, no more, no less. Simplicity.
We can also use iterators to loop through ranges. Let's return to our number counter example above and write an iterator to print numbers between one and five:
0.upto(5) do |x|
print x, " "
end
The most basic iterator in Ruby is simply loop
, which will run the block forever until you break out of the loop:
loop do
print "85098357-198058903028340jj23u0280234itj3"
# it's just like the Matrix!
end
Hit CTRL+C to break the loop.
Blocks contains a chunk of code normally enclosed between braces or within do
and end
. The prevailing style is to use the braces for blocks that fit on a single line and do . . . end
for multiple lines. Blocks are called only after the invocation of some method. We could, for example, write a program that sums the squares of numbers inside of an array:
sum = 0
[2, 4, 6, 8].each do |value|
square == value * value
sum += square
end
puts sum
In this example, the block is being called by the each
method once for each element in the array. The element passed as the parameter is value
. Note also that although sum
is defined outside of the block, it is also being modified within the block and then passed on to puts
. If a variable is inside a block with the same name as a variable outside of the block the two are the same, but if a variable appears only inside a block than the variable is local to the block.
As extra reading, I would check out Steve Ramsay's guide to regular expressions. I won't be covering regular expressions, but they will eventually show up and be useful as part of your programmer toolkit. It's good to get familiar with them.
To review, we've learned how to create functions, call upon methods, create classes, and generate basic programs in Ruby. We'll now be moving into creating arrays and hashes.
Ruby arrays and hashes are indexed collections that store objects. Arrays and hashes can contain different object types: strings, integers, and floating-point numbers. Arrays tend to be more efficient in accessing elements, but hashes provide greater flexibility.
Arrays are initiated between square brackets. Inside of an array you can access individual elements by calling upon its index. Note that Ruby begins its index with zero.
my_array = ["Ambrose", "White", "Worster"]
# print the items in the array by calling
# their index
puts my_array[0] # => Ambrose
puts my_array[1] # => White
puts my_array[2] # => Worster
There is no practical limit as to how many things an array can hold. And, as mentioned above, there is no problem with mixing types of array elements. You could just as easily write:
my_array = [ 42, "books", 3.14 ]
puts my_array[0] # => 42
puts my_array[1] # => books
puts my_array[2] # => 3.14
Note that strings must be enclosed in single or double quotes.
Arrays also include two operators. The first is pop
, which removes the item on the right side (or, the end of the array). The other is shift
, which removes items on the left side (beginning) of the array.
my_array = ["Ambrose", "White", "Worster"]
array_change = puts my_array.pop
array_change2 = puts my_array.shift
puts array_change # => Worster
puts array_change2 # => Ambrose
We can also add things to the array by using the push
and unshift
methods. The push
method adds items to the end (or right side of) the array while the unshift
method adds things to the beginning of (or left side of) the array.
my_array = ["Ambrose", "White", "Worster"]
array_new = my_array.push("Ulrich")
array_new = my_array.unshift("West")
puts "The array contains #{array_new.inspect}"
Arrays can also be created much more simply by using the shortcut %w
, which removes the need for all those quotes and commas:
# array-shortcut.rb
a = [ 'Apple', 'Microsoft', 'Linux', 'Solaris' ]
a[0] # => "Apple"
a[1] # => "Microsoft"
# we can achieve the same thing by using:
a = %w{ Apple Microsoft Linux Solaris }
a[0] # => "Apple"
a[1] # => "Microsoft"
We can create an empty array by calling Array.new
. The array is defining what objects must look like. Remember that every class has a special method called new
, and new
is a special method called a constructor. By calling Array.new
, we're asking Ruby to create an empty object. So, lets create an empty array and populate it with data:
authors = Array.new
authors.push("Hemingway")
authors.push("Faulkner")
authors.push("Whitman")
puts authors
The program should print the contents of the array to the screen.
Ruby hashes share similarities with arrays but operate differently and have different syntaxes. Hashes use braces rather than brackets. Most importantly hashes must provide two objects for every entry, one for the key and one for the value. The key and the value are separated by a =>
.
So, lets say we wanted to map author ratings. The hash setup would look like this:
authors = {
'Hemingway' => 'five_stars',
'Stephenson' => 'four_stars',
'Heppler' => 'one_star',
'Whitman' => 'five_stars',
# key => value
}
The item to the left of =>
is the key while item on the right is its corresponding value. Keys in a hash must be unique (we cannot have two "Stephenson," for example) but values can repeat. Hashes are indexed with the same square brackets as arrays. To print results from the above hash, we could write:
puts authors['Whitman'] # => five_stars
puts authors['Stephenson'] # => four_stars
Also, like arrays, you can create an empty hash with Hash.new
and inject data into it. For example:
my_hash = Hash.new()
hash['Dickenson'] = 'three_stars'
Hashes have one significant advantage over arrays: they use any object as an index. And, if you're using Ruby 1.9, Ruby also remembers the order in which you add items to the hash. When you iterate over entries, Ruby will return them in the correct order. Hashes are a frequently used data structure in Ruby.
In this section, a whole new world of programing is about to open before your eyes. So far, we've been working with simple data inputs through the use of .gets(), but the method only allows us to call upon a single data entry. What would be infinitely more useful would be the ability to read files outside the program. We can do this with Ruby's File
class.
When working with files, you have some options about how you want to access them. These are called mode specifiers and describes read/write characteristics of the file: r+
(read/write text, or append data), r
(read only), w
(write only), and w+
(read/write, but destructive because it destroys whatever existed in the file previously).
Imagine we have a primary source document we would like to read. You can download this file from Github to your local disk for something to work with (this is a letter from a Civil War soldier). We're going to ask Ruby to read the file and print the results to the screen:
File.open("letter.txt", "r") do |file|
lines = file.readlines
puts lines
end
We could also print out specific lines of the array:
File.open("letter.txt", "r") do |file| # open the file and assign to variable 'file'
line_array = file.readlines
puts line_array[3]
puts line_array[5]
puts line_array[9]
end
# Since we used the "r" specifier, Ruby will automatically close the
# file. Otherwise, you will need to exit the file using file.close().
The program will print the specified lines to the screen.
We can also write to files using File.new
:
file = File.new("my_file.txt", "w")
file.write("Hello, world!")
file.write("\n")
file.write("I'm learning Ruby!")
file.close()
After running the program. it will will create the file my_file
in the directory you are working in. The file should contain the contents we wrote.
You now know how to read and write files. A whole new world of programming should be opening before you.
We're entering the final leg of our journey. We've covered a lot of topics in the last few sections, but I just have a couple of things to touch on before we move on to writing our first full program together.
A variable that is in all caps cannot be reassigned anywhere in the program. For example, if you were writing a program that used Pi in its calculations, you wouldn't want the program (or yourself or another programmer) to accidentally override the value of Pi. To prevent this, Ruby allows for constant variables. We would simply write this in all caps:
PI = 3.141592
We can now use the variable anywhere in the program without fear that it will be overridden by another variable. For example, we could use Pi to calculate the area and circumference of a circle:
PI = 3.141592
puts "Enter a radius to calculate: "
radius = gets.chomp.to_f
area = PI * (radius**2)
area = "%.4f" % area
puts "The area of the circle is: #{area}"
circ = 2 * radius * PI
circ = "%.4f" % circ
puts "The circumference of the circle is: #{circ}"
In the last section we talked about modules and the ability to avoid namespace conflicts. The other great thing about modules is there are literally thousands of modules that exist outside the Ruby system, written and (theoretically) tested by other programmers, but available for your use. You probably saw an early version of this when we first talked about modules and the use of Trig.rb
and Morals.rb
. Libraries operate by prefacing the call with require
and then tell Ruby what we want included:
require 'rubygems'
How do we know what's available to us as programmers? By consulting either RubyForge or the Ruby Application Archive (see Ruby-Lang.org for more). To use the libraries, you'll need to have a copy on your local system. Many Ruby libraries are conveniently packaged under Ruby Gems and provides a standard formate for distributing Ruby programs and libraries. Follow the instructions on Ruby-Lang.org on how to download and install Ruby Gems.
Perhaps one of the most useful libraries that Prof. Ramsay pointed our class to was the linguistics
library:
# linguistics.rb
require 'rubygems'
require 'linguistics'
# tell linguistics to use English
Linguistics::use( :en )
puts 185934538450.en.numwords
# => one hundred and eighty-five billion, nine hundred and thirty-four million, five hundred and thirty-eight thousand, four hundred and fifty
Or maybe you want to know what the plural of "goose" is:
Linguistics::use( :en )
"goose".en.plural
# => "geese"
Or maybe we have an array of farm animals:
Linguistics::use( :en )
animals = %w{dog cow ox chicken goose goat cow dog rooster llama pig goat dog cat cat dog cow goat goose goose ox alpaca}
puts "The farm has: " + animals.en.conjunction
This will print:
The farm has: four dogs, three cows, three goats, two oxen, two geese, two cats, a chicken, a rooster, a llama, a pig, and an alpaca
You can do a lot with linguistics
.
Choose your external libraries carefully, but also don't reinvent the wheel if you can avoid it. Don't be afraid of scrapping an entire program or salvaging good code and throwing away the rest. In my experience, programming contains its headaches -- there will be failure, but there's always a learning opportunity in failure.
Comment your code and comment it well. We've already seen some of this in Ruby. Commenting a single line in Ruby starts with a #
. But we can also write multi-line comments by putting our text between =begin
and =end
.
# this is a single line comment
=begin
Multi-line comment
And another
Yet another
=end
I frequently use single line comments for explaining what chunks of code are doing, while multi-line commenting is often useful for removing parts of code without actually deleting it. This makes debugging much easier. Be sure to use your commenting wisely by explaining what the code doesn't tell you. When you define functions or classes or variables, it should be fairly clear what's going on. But commenting on why you made the choices you made that will help you or another programmer better understand the code is worth including. Remember: programming should be as much about readability as it is about its functionality. Comment even if its code only you will be seeing.
I read somewhere recently that code is the crystallization of human thought (if I can find the comment, I'll attribute it). Plan ahead in the programs you write, make sure the intent is clear, explain how you expect the code to work. Diagram! Design mockups! Some of my best tools aren't digital: I keep a permanent marker and stack of paper handy for sketching out ideas.
Keep backups of multiple versions. Better yet, place your stuff under version control like Subversion or Github.
Now that you have the basics, you might want to learn more and start creating awesome stuff for the Internet. Here are some additional resources to learn more about Ruby and other languages.
Next up, we're writing a program together. We're going to build a word frequency generator and begin working with the web.
"Okay, Jason," you're asking yourself, "I'm tired of saying hello and counting numbers and doing mathematics. How can Ruby be applied to my work as a humanities scholar?" I'm thrilled you asked! Because today, we're writing our first full program together. I'll warn you, this might be a long read and a lot of writing. But I'm hoping by doing this we experience the process of designing, planning, writing code, optimizing code, debugging, and finally using the program.
We're going to write a program based off a homework example we completed in Prof. Steve Ramsay's class (To Steve's future students: don't copy this program. Your professor will know). We're going to take a word frequency generator and read a file off the Internet, strips the HTML or XML tagging out of the file, generate a word frequency, and print the frequency as a new HTML file. A lot will be happening, so I hope I can carefully and concisely explain the details of our program as we go along.
One potential way to write our word frequency program is as such:
# frequency.rb
def separation(string)
string().downcase().scan(/[\w']+/) # downcase and strip out white space
end
def word_count(elements)
number = Hash.new(0)
for word in elements
number[word] += 1
end
number
end
text = File.read("text.txt")
elements = separation(text)
number = word_count(elements)
sorted_list = number.sort_by { |word, count| count }
most_to_fewest.each { |word, count| puts "#{count} #{word}" }
Our program takes in a file (text
) and sends the file into our separation
method to convert everything into a string, downcase the words for normalization, and scan for whitespace (hence the regex code /[\w']+/
). Once the program read the file and converted the text into individual words, it sends the file into our word_count
method and enters the file into a hash. Inside of word_count
, the file counts the words and for each instance of a word adds an increment until the file has finished processing. We return number and call the sort
method and assign sort values (word
and count
) and print our results.
There are certainly several ways to achieve the results we're after. If you have your own word frequency generator that you're comfortable working with, go ahead and use it. I'll be using my own code:
# frequency.rb
filename = File.new("text.txt", "r").read().downcase().scan(/[\w']+/)
frequency = Hash.new(0)
filename.each { |word| frequency[word] += 1 }
frequency.sort_by { |x,y| y }.reverse().each { |w,f| puts "#{f}, #{w}" }
You should now have a working word frequency generator. However, we want to be able to read HTML files from the web; this will make the program much more useful. To do this we're going to import a Ruby library called open-uri
and use its methods to fetch web data. Let's first look at how we achieve the ability to have Ruby read web files before we integrate it into our frequency program. I'll be using an XML newspaper file from one of my digital history projects -- feel free to use the same or select your own file:
require 'open-uri'
uri_file = open("http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml").read()
puts uri_file
The above file will read the URL and print to the screen. But you'll notice something that will inconvenience us if we try and generate a frequency: the output includes the HTML tags. We need to get rid of all that junk. There are a couple of ways to do that, but we're going to return to our good friend regex to look for HTML tags and strip out everything we don't want. We'll use the gsub
method and regular expressions to substitute HTML tags with empty lines. We'll also use it to strip out punctuation marks and other HTML formatting (such as "
). Make a small edit to your file:
require 'open-uri'
# read a URL, strip out HTML tags, and assign the file to a variable
uri_file = open("http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml").read().gsub(/<\/?[^>]*>/, "").gsub(/"*/, "\""/)
puts uri_file
You should now be seeing just the text of the webpage we are having Ruby read. Pretty cool, huh? But we're not quite where we want to be yet. Let's also get rid of punctuation and numbers as well as downcase all the text so we have a consistent word base:
uri_file = open("http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml").read().gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'":.)]/, '').downcase
puts uri_file
Now let's add this to our frequency generator.
require 'open-uri'
uri_file = open("http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml").read.gsub(/<\/?[^>]*>/, "").gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'":.)]/, '').downcase
filename = File.new("#{uri_file}","r").read.downcase.scan(/[\w']+/)
frequency = Hash.new(0)
filename.each { |word| frequency[word] += 1 }
frequency.sort_by { |x,y| y }.reverse.each { |w,f| puts "#{f}, #{w}" }
Ok, run ruby frequency.rb
and we should . . . wait, what happened? If you run this, you should get an error. Time to debug!
The issue is we're not reading a file, we're reading the contents of a variable. So, there's no need for the File.new
class. We can get rid of that. We also need to update the each
method to read our URL variable:
require 'open-uri'
uri_file = open("http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml").read.gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'":.)]/, '').downcase
frequency = Hash.new(0)
uri_file.each { |word| frequency[word] += 1 }
frequency.sort_by { |x,y| y }.reverse.each { |w,f| puts "#{f}, #{w}" }
All right, now we can run this. Type in ruby frequency.rb
and . . . whoh. Something still isn't right. You should be outputting some sort of frequency counter, but the program is counting lines rather than individual words. We forgot to split the words apart. So, we'll add the split
method:
require 'open-uri'
uri_file = open("http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml").read.gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'":.)]/, '').downcase.split(' ')
frequency = Hash.new(0)
uri_file.each { |word| frequency[word] += 1 }
frequency.sort_by { |x,y| y }.reverse().each { |w,f| puts "#{f}, #{w}" }
Before we move on, let's clean things up a bit. Let's move our URL reader into a method and rewrite some code. The method should look like this:
def readFile(url)
# Strip out HTML tags, alphanumeric characters, and punctuation, lower-case
# all words, and split the words apart
uri_file = open(url).read().gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'":.)]/, '').downcase.split(' ')
return uri_file
end
Now we can rewrite the URL input as:
url = "http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml"
uri_file = readFile(url)
Your file should now look similar to this:
require 'open-uri'
def readFile(url)
# Strip out HTML tags, alphanumeric characters, and punctuation, lower-case
# all words, and split the words apart
uri_file = open(url).read.gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'""':.)]/, '').downcase.split(' ')
return uri_file
end
# create a dictionary of n-grams
url = "http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml"
uri_file = readFile(url)
#print uri_file
frequency = Hash.new(0)
uri_file.each { |word| frequency[word] += 1 }
frequency.sort_by { |x,y| y }.reverse.each { |w,f| puts "#{f}, #{w}" }
We're also going to add a new method of inputting files by using Ruby's ARGV method. ARGV is a global array that allows us to pass command-line arguments after the filename. So, we'll rewrite the code above a bit:
require 'open-uri'
def readFile(url)
# Strip out HTML tags, alphanumeric characters, and punctuation, lower-case
# all words, and split the words apart
uri_file = open(url).read.gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'""':.)]/, '').downcase.split(' ')
return uri_file
end
# create a dictionary of n-grams
url = ARGV[0]
uri_file = readFile(url)
#print uri_file
frequency = Hash.new(0)
uri_file.each { |word| frequency[word] += 1 }
frequency.sort_by { |x,y| y }.reverse().each { |w,f| puts "#{f}, #{w}" }
You should now be able to run ruby frequency.rb http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml
in the command line. And there we have it! A working word frequency generator that can read HTML files or local files. This may be as far as you want to go, but if you're like me, you'd love to have a program that not only generates frequencies but will also output a file that you can use. In my case, when doing digital scholarship, I want files that can be read by a browser. So, we're going to have the frequency list export as HTML. For this, we'll be bringing back in our File
I/O method:
File.open("output.html", "w") do |output|
frequency = Hash.new(0)
uri_file.each { |word| frequency[word] += 1 }
frequency.sort_by { |x,y| y }.reverse().each do |w,f|
output.write "<p>#{f}, #{w}</p>\n"
end
end
Let's also let the user know where the file was exported. Add to the end of the file:
puts "\nFile exported to #{Dir.pwd}.\n"
So, you're program should now look like:
require 'open-uri'
def readFile(url)
# Strip out HTML tags, alphanumeric characters, and punctuation, lower-case
# all words, and split the words apart
uri_file = open(url).read.gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'""':.)]/, '').downcase.split(' ')
return uri_file
end
# create a dictionary of n-grams
url = ARGV[0]
uri_file = readFile(url)
# print output to HTML file
File.open("output.html", "w") do |output|
frequency = Hash.new(0)
uri_file.each { |word| frequency[word] += 1 }
frequency.sort_by { |x,y| y }.reverse().each do |w,f|
output.write "<p>#{f}, #{w}</p>\n"
end
end
puts "\nFile exported to #{Dir.pwd}.\n"
You should now be set to write to the command line ruby frequency.rb http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml
, which will compute the frequencies and output the results to an HTML file.
Neat, huh? Except . . . well, perhaps it isn't that useful yet. I mean, is it really useful for us to know that the word "the" shows up 35 times? Not really. In fact, you've probably noticed that the majority of the highest frequencies in the list are common words (this is known as Zipf's Law). Let's get rid of those.
We'll start by creating an array of common words. Let's also make it a constant variable so we don't have to worry about override problems. Remember that we stripped out punctuation, so we need to maintain the words without apostrophes:
STOPWORDS = %w{a about above across after again against all am an and any are arent as at be because been before being below between both but by cant cannot could couldnt did didnt do does doesnt doing dont down during each few for form further had hadnt has hasnt have havent having he her here heres hers herself him himself his how i id ill im ive if in into is isnt it its itself lets me more most mustnt my myself my myself no nor not of off on once only or other ought our ours ourselves out over own same shant she should shouldnt so some such than that the their theirs them themselves then there these they this those through to too under until up very was we were what when where which while who why with would you your yours yourself yourselves}
Now we'll add this to our readFile
method and tell Ruby to remove words that appear in the array:
require 'open-uri'
STOPWORDS = %w{a about above across after again against all am an and any are arent as at be because been before being below between both but by cant cannot could couldnt did didnt do does doesnt doing dont down during each few for form further had hadnt has hasnt have havent having he her here heres hers herself him himself his how i id ill im ive if in into is isnt it its itself lets me more most mustnt my myself my myself no nor not of off on once only or other ought our ours ourselves out over own same shant she should shouldnt so some such than that the their theirs them themselves then there these they this those through to too under until up very was we were what when where which while who why with would you your yours yourself yourselves}
def readFile(url)
# Strip out HTML tags, alphanumeric characters, and punctuation, lower-case
# all words, and split the words apart
uri_file = open(url).read.gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'""':.)]/, '').downcase.split(' ') - STOPWORDS
return uri_file
end
The program should now remove words that appear inside of the stopwords
array. Now we have something a little more useful to us.
So, the program in its entirety should now look like:
#!/usr/bin/ruby -w
# FREQr.rb
#
# Written by Jason A. Heppler
#
# This program is free software.
# You can distribute/modify this program under the terms of
# the GNU Lesser General Public License version 2.1.
#
# Last Modified: Mon Jan 10 23:15:08 CST 2011
require 'open-uri'
STOPWORDS = %w{a about above across after again against all am an and any are arent as at be because been before being below between both but by cant cannot could couldnt did didnt do does doesnt doing dont down during each few for form further had hadnt has hasnt have havent having he her here heres hers herself him himself his how i id ill im ive if in into is isnt it its itself lets me more most mustnt my myself my myself no nor not of off on once only or other ought our ours ourselves out over own same shant she should shouldnt so some such than that the their theirs them themselves then there these they this those through to too under until up very was we were what when where which while who why with would you your yours yourself yourselves}
# Strip out HTML tags, alphanumeric characters, and punctuation, then
# lower-case all words, split the words apart, and remove stopwords
def readFile(url)
uri_file = open(url).read.gsub(/<\/?[^>]*>/, "").gsub(/"*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'""':.)]/, '').downcase.split(' ') - STOPWORDS
return uri_file
end
# Create a dictionary of n-grams
url = ARGV[0]
uri_file = readFile(url)
# Save output to HTML
File.open("output.html", "w") do |output|
frequency = Hash.new(0)
uri_file.each { |word| frequency[word] += 1 }
frequency.sort_by { |x,y| y }.reverse().each do |w,f|
output.write "<p>#{f}, #{w}</p>\n"
end
end
# Give the user an exported-to message
puts "\nFile exported to #{Dir.pwd}.\n"
Simply type in ruby frequency.rb http://www.framingredpower.org/archive/newspapers/frp.wapo.19721102.xml
and the program will output an HTML file and confirm the successful completion of the program. Congrats! You now have your first full Ruby program. Do some hacking on this program. Add a function or feature to it or optimize the code and see what you can accomplish. Perhaps, for example, you want another method so you can output an HTML file that generates keywords in context or a word cloud. Or, if you're really ambitious, maybe you can learn about Ruby on Rails and make this program run as a webpage rather than the command line.
If you've stuck through reading The Rubyist Historian to the end, you should now have a working knowledge of the Ruby programming language. I hope that I've been able to competently explain key concepts and ideas of Ruby. But we've only touched the surface of Ruby. There are several resources out there to continue learning about Ruby. I would start with these:
In about ten years you can call yourself a programmer.
Visit the Rubyist Historian Table of Contents for more sections, and check out the Github repository for an archive of all the code examples.
See something that's wrong? Examples that don't work? Explanations that are unclear or confusing? Embarrassing typographic errors? Drop me an email at jason.heppler+feedback at gmail and I'll fix things right up!
Topic structure, examples, and explanations for the Rubyist Historian are inspired by, credited to, and drawn from Stephen Ramsay and his course Electronic Text.