Diagnosing Production Problems: Zeroth Law
August 14, 2010
Stephen's first law of diagnosing problems in production should have been: make sure you actually know the scope of the problem. We have a process that checks for duplicates in an inbound file. Records marked as duplicates are not moved into production. A refinement of the process was installed this week. All the sudden, e-mails showed that thousands of records were being marked as duplicates. I came over to help investigate, and found people looking at code, trying to figure out what was going on, because they knew without doubt that these records were not in fact duplicates.
But we needed to step back and ask what is the scope of the problem? We looked at the e-mail with the duplicates, picked a name or two from the list, and looked in the original input files and confirmed that they were not in the files. So how/why were they reported? But then let's set that aside, and ask: did all of today's records make it into production? If yes, then we have a problem but not a crisis. The answer was yes: we could see a 1-1 match between inbound file and outbound production data. Therefore, we have a minor reporting problem, but the core of the system was working just fine. Panic averted.
So what was the cause? A staging table that had not been truncated after a previous file was processed. All those records were being reported as duplicates.