An Incident Postmortem Of A LAMP Stack Server Running a WordPress Site

Debugging a 500 server error

I recently debugged a LAMP stack server as part of my projects in the ALX SE program. The server always returned a 500 internal server error without any indication of the problem in the log file. I'm writing the blog post as my first short and straight-to-the-point postmortem of the incident.

Incident Summary

A LAMP stack running WordPress on an Ubuntu 14.04 machine experienced a service outage from October 9, 2023 (6 a.m. WAT) through October 15, 2023 (10 p.m. WAT), which made the site totally inaccessible. Specifically, the server returned a 500 internal server error, indicating an error in the backend setup of the server, and not from a user's end.

Timeline

  • I detected the problem around 6 a.m. (WAT) on October 9, 2023.

  • The detection came from users' inability to access the site.

  • After detection, I took steps to troubleshoot the different components of the server. An initial assumption was that there was an error in the Apache server's config file.

  • I checked the Apache server's config file, but there were no errors. Everything was as it should be.

  • Having eliminated the initial suspicion, and with the problem persisting, my next assumption was that of an incompatibility between the versions of the Apache server, PHP, and WordPress. This suspicion was false because the problem lingered even after I updated and upgraded all packages in the Ubuntu machine.

  • After a more detailed troubleshooting of the server with strace, I found the problem to be a typo in the name of a PHP script referenced from a settings file.

Root Cause Analysis And Resolution

Having eliminated all previous suspicions, I used strace to track the server's operation while it served an HTTP request. The trace summary showed that there were some errors such as I/O and socket connection errors. However, there was nothing I could do about most of the errors because they were kernel-level I/O errors (e.g the No such file or directory error on "/etc/ld.so.nohwcap" below).

To get a more detailed view of what was going on with the server, I specifically traced the pid of the Apache server with the command: strace -p <apache server pid>. I used two tmux panes for this process: the trace was in one pane while I made a request to the server on the other pane. I got the server's pid with the command ps -auxf | grep apache2.

💡
Terminal Multiplexer, tmux for short, allows you to open multiple independent terminal windows in one view.

The latest trace showed that apart from errors shown in the previous trace, there were file I/O errors associated with the file class-wp-locale.php. I noticed after a few moments that the file suffix was unusual. The normal suffix for PHP files is .php and not .phpp as shown below.

Now I just needed to find the file(s) containing the line with that erroneous suffix and correct them. I did that by using grep to search recursively through all files in /var/www/html, the folder containing all the site's files.

After fixing the error in the wp-setting.php file, I restarted the server and tested again with a curl request. This time, it worked as expected and returned a 200 HTTP response code.

Future Preventive Measure

A test of each component of the web stack before and after deployment could have caught this error much earlier and avoided prolonged downtime.