The second one is a step where there are two errors:

Job with two steps that fail

There also is a job where the job and the step always fail.

Single step job that always fails

And, of course, a job that just works:

Job always succeeds

For most of these steps, they are simple "select 1" for success or "select 1/0" for a failure. This ensures the steps always succeed or fail.

These are four scenarios, and not completely comprehensive, but give me a good set of situations in which to test my query. Now I want to check and see if my code is working. I'll do that in an automated way. I'll use code to run sp_start_job, and then other code to check if my query returns what I expect.

Automated Testing

I wanted to test this, but not by starting the job manually in SSMS and then running my query. It's easy to forget to run a particular test or not properly reset things when there's a problem. To me, this is best fixed with automated testing. I decided to write some tSQLt tests to cover my cases.

The Action

The first step was to put my query in a stored procedure. This just means I'm not copy/pasting code into multiple tests. Plus, if I really want to add this as an alert, I'd want a procedure. The code is shown here:

CREATE OR ALTER PROC CheckForFailedJobSteps AS; WITH cteActivity (job_id, start_execution_date) AS ( -- get the latest job execution for all jobs     SELECT job_id,            MAX(start_execution_date) AS start_execution_date     FROM msdb.dbo.sysjobactivity     GROUP BY job_id) , cteJobStep (Job_ID, Step_Name, run_date, run_time)     AS (SELECT jh.job_id,                jh.step_name AS JobStepName,                jh.run_date,                jh.run_time         FROM msdb.dbo.sysjobhistory jh             INNER JOIN cteActivity ja                 ON jh.job_id = ja.job_id         WHERE jh.run_status = 0 --step failed               AND jh.step_id != 0               --         and jh.job_id = CONVERT(uniqueidentifier, '8C673935-F8C1-4E7D-94D3-1F3CAE50D7DC')               AND --this block ensures that we just pull information from the most recent job run             (             -- The start time of the step, converted to datetime             CONVERT(DATETIME, RTRIM(jh.run_date))             + (jh.run_time * 9 + jh.run_time % 10000 * 6 + jh.run_time % 100                * 10               ) / 216e4 >= ja.start_execution_date -- the time the job last started             )) SELECT COUNT(*) FROM cteJobStep c     INNER JOIN msdb.dbo.sysjobhistory     AS     h         ON h.job_id = c.Job_ID            AND h.run_date = c.run_date            AND h.run_time = c.run_time            AND h.step_id = 0            AND h.run_status = 1; RETURN; GO

The Tests

I have a template for tests that does an outline of what I need. I've written some articles on tSQLt, and the basic one shows a simple test with the Assemble, Act, Asset pattern. That's what I'll use here, with a simple test. The code is below for one test. This starts with a new test class and then the test.

EXEC tsqlt.NewTestClass @ClassName = N'AdminChecks' -- nvarchar(max) GO CREATE PROCEDURE [AdminChecks].[test JobFailureProc with No Failures] AS BEGIN     -------------------------------------------- ----- Assemble --------------------------------------------     DECLARE @expected INT       , @actual INT SELECT @expected = 0 CREATE TABLE #ProcResult (actual INT)          -------------------------------------------- ----- Act -------------------------------------------- -- Clear jobs     EXEC msdb.dbo.sp_start_job @job_name = 'Second Job with Two Errors', @step_name='Fourth Step'  WAITFOR DELAY '00:00:02'     EXEC msdb.dbo.sp_start_job @job_name = 'A Job with a Step Failure', @step_name='Export Data'  WAITFOR DELAY '00:00:02'     EXEC msdb.dbo.sp_start_job @job_name = 'Job that always fails', @step_name='Success Step'  WAITFOR DELAY '00:00:02' -- run the job EXEC msdb.dbo.sp_start_job @job_name = 'A Job that Always Succeeds'  WAITFOR DELAY '00:00:02'  -- GET results INSERT #ProcResult EXEC CheckForFailedJobSteps SELECT @actual = actual FROM #ProcResult AS pr     -------------------------------------------- ----- Assert -------------------------------------------- EXEC tsqlt.AssertEquals @Expected = @expected,                          @Actual = @actual,                            @Message = N'Incorrect result'  END     GO

I start in the assemble area by creating a table for the results and then a variable with my expected result. For this test, there are no failures with this job, so I expect a zero for the number of failed job steps.

In the Act part, I first need to be sure we don't have any failures. I could clear history, but since I know the state of this system. I could run the purge job, but that requires sysadmin, and potentially that's a problem here. In a team environment, this would be a complex system to set up. In a dev area where we might use containers, this is simpler. I'd likely leave a note in the proc itself that there need to be no job failures before running tests. I introduce a delay, to be sure the job completes before running the next line.

In any case, this is a minor flaw in the repeatable testing process. Hopefully someone finding failing tests would be able to determine a problem is the result of either bad code or previously existing failed steps. Since this requires some jobs to exist, this already is a more complex testing scenario than I'd expect.

For me, I run all my test jobs to ensure that the latest execution of all jobs has no failures. This means I'd return a zero by default. Next, I run the job I want to run, which is the "always succeeds" job for this test. I take the result and store it in the @actual variable.

In the asset area, I check if the expected and actual values are the same.

The other tests look just like this, expect the @expected value is set to a different number, depending on what is required. I also call a different job in the "act" section. I've attached the script to this article for the tests if you want to duplicate this. I end up with tests for these cases:

  • No job failure steps
  • A single job failure step
  • A single job with two failure steps
  • A job that has a failure step, but then runs successfully

To execute the tests, you can use the tsqlt.run or tsqlt.runtestclass commands with the test name or the test class. For me, I can run the tests in SQL Test, which I do when I'm doing things over and over. I like that I can start this and then still check queries in my query window. All four of my tests succeeded (eventually).

Test results in SQL Test

The command behind the scenes would be:

EXEC tsqlt.RunTestClass @TestClassName = N'AdminChecks'        

This gives me these results:

Manual test class execution results

Summary

One of the edge cases in monitoring jobs is the places where a job step might fail, but the job succeeds. There are reasons why a particular job might be set up in this manner, but often we still want to know if a job step failed. This article presents a way to get that data for the last run only. I chose that method, as often failures are transient and I don't want to be querying for failures that might have resolved themselves.

We also set up a testing framework to automate testing of our code, which allows us to add new cases or ensure regressions do not occur as we might enhance our code in the future.