Debugging delayed_job “Job NilClass# errors on Heroku”

TLDR: Make sure you are specifying the correct version of Ruby in your gemfile! Our live server was running 1.9.2 instead of 1.9.3 and this was causing the NilClass# error on delayed jobs.
Recently I was working on enhancing the functionality of one of our client’s webapps by adding the ability to have the app dump data directly into their Microsoft Dynamix GP ERP system via FTP. This data was previously handled by CSV export that was sent via email, then manually entered, so while the process was semi-automated, it was not as efficient as it could be.

Developing the initial FTP solution was so simple it was shocking – I was astounded by the productivity offered by the Ruby on Rails code. After this code was deployed to production, however, we ran into a bit of a problem – FTP does not fail as gracefully as the previous email setup. The FTP / ERP server was on-premises, without backup power or internet connection, and it just so happened that there was an internet outage at their facility during the week we deployed. With email this wasn’t a problem as email systems will ‘spool’ messages until they can be discovered, but with FTP the inability to connect resulted in an ugly error.

That is where Delayed Job came to the rescue! Fortunately I have had the benefit of sharing a small town with Collective Idea – the team that built delayed_job, and generally awesome ruby builders. I asked them for advice on implementing a solution to make the FTP process more robust and they recommended delayed_job because, not only would it pull the FTP transfer out of the main front-end process, but it has built-in retry functionality in case an error (like inability to connect) occurs.

There are plenty of tutorials out there on delayed_job so I won’t rehash the setup, however I will say that this process was also quite easy. The plan was to deploy this update on Heroku and then use the ‘free’ Heroku scheduler to work our jobs. You can configure scheduler to run your jobs every 10 minutes using a scheduler task like: “rake jobs:workoff”.

The deploy itself went smoothly, the database migrated, however once we ran our first test on the live server results were not as expected. There were no errors. There were no CSV files on the remote server. There was no explanation. After checking the logs it appeared that the task was successfully running and working off the queued jobs, the CSV files were just not making it to the remote server somehow.

At this point we decided to roll back the deploy and regroup as this is a live application and is customer facing. Fortunately rolling back a deploy on Heroku is also very simple ‘heroku rollback’ and you are there. It was time to start working backwards to identify the problem. We started with the staging server.

Our staging server is also running on Heroku such that this is where we can test new deploys before changing the live environment, thus avoiding breaking stuff. Unfortunately it didn’t catch the problem this time – the process, background queuing and all, worked as intended on this server. The only differences between the staging environment and live environment was in the config vars we use to set the FTP credentials and other application configuration info. We tested that too – the FTP credentials for the remote server were correct.

What’s next? Tests. Of course we ran our test suite before deploying the code, however when dealing with external resources like FTP, the tests can only be so good. We had good test coverage on our app, and tests were passing.

Without much to go on I started reviewing the logs from the production and staging environments with a fine tooth comb to see if there were any inconsistencies that I could identify that were not immediately apparent. Here is a link to the raw logs in a GitHub Gist as our blog is not set up to handle this content well at this time: https://gist.github.com/jafoca/9858884

Note that in both cases the resulting output from running the jobs was ‘0 failed jobs, process exited with status 0’. Nothing out of the ordinary – the jobs on the live server were not ‘failing’ necessarily, just somehow not completing. After looking at the logs even closer I found a clue – the live environment was running job “Job NilClass# (id=5) RUNNING” whereas the staging server was running job “Job Order#upload_without_delay (id=8) RUNNING” and the second, of course, appears to be more correct.

Busting out the old Google_fu, I started digging around for the culprit and eventually found this thread on StackOverflow discussing the Job NilClass# error. The thread didn’t directly solve my issue however – it did clue me into checking on what Ruby server was being used as several commenters pointed to using Unicorn instead of Thin to fix this type of error.

I restarted both servers again to verify the which ruby server was being used – both were running WEBrick 1.3.1 so I didn’t think that was the culprit. On closer examination I saw something else….

2014-03-29T15:20:13.203488+00:00 app[web.1]: [2014-03-29 15:20:13] INFO ruby 1.9.2 (2014-01-23) [x86_64-linux]
2014-03-29T15:20:13.204031+00:00 app[web.1]: [2014-03-29 15:20:13] INFO WEBrick::HTTPServer#start: pid=2 port=33789
2014-03-29T15:20:13.203488+00:00 app[web.1]: [2014-03-29 15:20:13] INFO WEBrick 1.3.1

The important bit is that the live environment was running on ruby 1.9.2 – both the staging server and my dev setup had been set up after the live environment, so they both were running ruby 1.9.3

All that remained was specifying the ruby version in the Gemfile as 1.9.3 to tell Heroku to use 1.9.3 and sure enough this cleared up my vague problem with Job NilClass# in the job description, and cleared the way to allow the FTP transfer to work as desired. Do I know why ruby 1.9.2 was causing the jobs to fail? No. I just happy to have it resolved at this point!