Database Programming: Closing The Door On The "Discredited" Prime Number Tangent


This post will close the loop on the “discredited” prime number syntax tangent, the code for which was first posted here and subsequently modified here (and briefly referenced here).


When we last left this discussion, I’d abandoned this approach as a contender for first place in Denis’ Prime Number Challenge because the approaches Denis and Hugo used, both based on the Sieve of Eratosthenes, were fundamentally less computation-intensive.  However, I had a feeling that this code should run way faster than the couple of hours I was anticipating for the last version, and I’ve indeed got two interesting versions of this syntax that, under our current rules of engagement, run in under 30 seconds.


Since the last post on this topic, I’ve made the following modifications:



  • DataTypes changed back to int
  • Dropped the INSTEAD OF INSERT trigger from the design
  • Re-introduced the SetBuilder table
  • In the loop/insert, restored comparison with SetBuilder, but changed the WHERE clause to stop looking at the square root of the CandidateValue rather than the CandidateValue itself

This last point bears a little discussion, and was the primary performance issue with the previous version of the code.  There’s an old fable about a group of people running away from a hungry bear.  The key to survival in this scenario is not to be the fastest human in the pack, but rather to not be the slowest individual.


Similarly, we don’t need to build a list of factors for the numbers that aren’t prime numbers; we just need to diagnose their lack of primacy.  So, if a particular number has any factors greater than one but less than or equal to its square root, it’s not a prime number.  All of this hopefully makes good sense.


Now, To The Interesting Part..


I then coded up two alternatives for populating ComparisonMatrix: a cursor loop and a straight insert (in order to render this post easier to deal with, I’ve placed the code that builds and populates all the tables except ComparisonMatrix in a comment to this post here):



  • Cursor Loop INSERT Code


  keep some house for what comes..


DECLARE @Now DATETIME,


        @PreviousInterval DECIMAL(12,2),


        @Interval DECIMAL(12,2),


        @Counter int,


        @CandidateValue int


 


  log a status message


SELECT GETDATE() AS [Processing: INSERTing into ComparisonMatrix..]


 


  set timer


SET @Now = GETDATE()


 


  initalize a counter


SET     @Counter = 1


 


  declare a cursor to traverse the Candidate table


DECLARE TraverseCandidateTable CURSOR FAST_FORWARD READ_ONLY FOR


SELECT  CandidateValue


FROM    Candidate


ORDER BY CandidateValue


 


  open the cursor and get the first record


OPEN    TraverseCandidateTable


 


FETCH NEXT


FROM  TraverseCandidateTable


INTO  @CandidateValue


 


  log a status message


SELECT GETDATE() AS [Cursor Loop Starts..]


 


BEGIN TRAN


 


  start a loop which will run as long as the cursor finds a record


WHILE @@FETCH_STATUS = 0


BEGIN


 


        INSERT  ComparisonMatrix (


            Base,


            Factor


        )


        SELECT  @CandidateValue,


                Id


        FROM    SetBuilder


        WHERE   Id <= SQRT(@CandidateValue)


        AND     @CandidateValue % Id  = 0


        UNION ALL


        SELECT  @CandidateValue,


                @CandidateValue


 


  log a database commit and a status message and start a new transaction


  if it’s been 20000 transactions since we did so


    (I experimented with batch sizes of 5K, 10K, 20K, 25K, and 50K;


          20K provided the best performance on my sandbox)


    IF  @Counter % 20000 = 0


    BEGIN


        COMMIT TRAN


 


        SET @Interval = DATEDIFF (ss, @now, GETDATE())


 


        SELECT  GETDATE()               AS [Processing: Status Datetime..],


                @Counter                AS [Current Count..],


                @CandidateValue         AS [Current Candidate..],


                @Interval               AS [ElapsedSecondsThisBatch..],


                CASE


                    WHEN @PreviousInterval = 0 THEN 100


                    ELSE (@Interval/@PreviousInterval) * 100


                END                     AS [PercentPerformanceOfPreviousBatch]


 


  initialize loop variables for the next batch


 


        SET @Now = GETDATE()


        SET @PreviousInterval = @Interval


 


        BEGIN TRAN


    END


 


  get the next candidate value


    FETCH NEXT


    FROM  TraverseCandidateTable


    INTO  @CandidateValue


 


  increment the final transaction


    SET   @Counter = @Counter + 1


END


 


  we’re out of the loop;


  we still need to log one more status record and commit the final transaction


 


SET @Interval = DATEDIFF (ss, @now, GETDATE())


 


SELECT  GETDATE()               AS [Processing: Status Datetime..],


        @Counter                AS [Current Count..],


        @CandidateValue         AS [Current Candidate..],


        @Interval               AS [ElapsedSecondsThisBatch..],


        CASE


            WHEN @PreviousInterval = 0 THEN 100


            ELSE (@Interval/@PreviousInterval) * 100


        END                     AS [PercentPerformanceOfPreviousBatch]


 


COMMIT TRAN


 


  clost and deallocate the cursor


CLOSE TraverseCandidateTable


 


DEALLOCATE TraverseCandidateTable


 


  log a status message


SELECT GETDATE() AS [Cursor Loop Complete..]



  • “Straight” INSERT Code


  keep some house for what comes..


DECLARE @Now DATETIME,


        @Interval DECIMAL(12,2)


 


  log a status message


SELECT GETDATE() AS [Processing: INSERTing into ComparisonMatrix..]


 


  set timer


SET @Now = GETDATE()


 


  thia INSERT/SELECT will get the factors from 1 to the SQRT


      (which ia the largest possible factor except the candidate itself)


INSERT  ComparisonMatrix (


    Base,


    Factor


)


SELECT  c.CandidateValue,


        sb.Id


FROM    SetBuilder sb


JOIN    Candidate c


  use of a rendered column for the square root results in (essentially)


      a doubling of execution time!  try it!


ON      sb.Id <= SQRT(c.CandidateValue) –c.SQRTCandidateValue


AND     c.CandidateValue % sb.Id  = 0


 


  log a status message


SELECT  GETDATE()               AS [ComparisonMatrix First Insert: Completed At..],


        @@ROWCOUNT              AS [Rows Inserted..],


        CAST(DATEDIFF (ms, @now, GETDATE()) AS DECIMAL (12,3))/1000      AS [ElapsedSeconds..]


GO


Once the ComparisonMatrix table is populated, all we need to do is run a SELECT against it to get the list of prime numbers (more on that in a minute).


What’s really fascinating to me about this is that the cursor-based solution outperforms the full-set-based solution, every time.  I get the best performance with 20K row batchs (20 seconds); I tested 5K, 10K, 20K, 25K, and 50K batch sizes and never got worse than 28 second response time.  The set-based solution never ran under 31 seconds.


As Adam notes here, sometimes cursors help, and this is one of those times.


Note also that the SELECT statements are different for each alternative due to a slight difference in rendering style (that did not impact the results):



  • Cursor Loop INSERT Code


  generate the list of prime numbers


      if the SUM(IsQuotientInteger)=2 (where IsQuotientInteger=1 when a particular member of the set


      is a precise integer factor of the candidate value), then that candidate is a prime number,


      because IsQuotientInteger will equal 1 for 1 and the candidate itself.  All candidates with


      additional factors will thus be filtered out by the HAVING cluase.


SELECT  Base


FROM    ComparisonMatrix


GROUP BY Base


HAVING SUM(IsQuotientInteger) = 2


ORDER BY Base


  • “Straight” INSERT Code

      generate the list of prime numbers


          if the SUM(IsQuotientInteger)=1 (where IsQuotientInteger=1 when a particular member of the set


          is a precise integer factor of the candidate value), then that candidate is a prime number,


          because IsQuotientInteger will equal 1 for 1 and any factor other than candidate itself (the candidate itself is


          assumed, but not INSERTed, for performance reasons (ALL numbers are factors of themselves, so I don’t feel


         the need to render that)).  All candidates with additional factors will thus be filtered out by the HAVING cluase.


    SELECT  Base


    FROM    ComparisonMatrix


    GROUP BY Base


    HAVING SUM(IsQuotientInteger) = 1


    ORDER BY Base


  • Conclusion


    When I started on this project, I didn’t know about the Sieve of Eratosthenes, so I approached it with something of a blunt stick, and my early results betrayed this state of affairs.  Through careful tuning of the code, I was able to bring a process whose first draft would’ve taken two weeks to run down to a 20 second execution time.


    Even though the Sieve of Eratosthenes-based solutions are obviously preferable for a production implementation of this functionality, tuning this code was a worthwhile exercise from an educational standpoint.


    “Love your mistakes”, indeed.


         -wp

    Comments (4)

    1. Anonymous says:

      Denis’ Prime Number Challenge just won’t die.&amp;nbsp; I think this topic has spurred more dialog than any&amp;nbsp;other…

    2. Anonymous says:

      What’s this, you say? Useful T-SQL making a return to this blog? Yep. The first T-SQL I’ve posted since

    3. Ward Pond says:

      Here’s the DDL for the three tables used by the discredited syntax:

      USE PrimeNumberTestBeta

      GO

      —  log a status message

      SELECT GETDATE() AS [Starting: Building and Populating Schema..]

      SET NOCOUNT ON

      GO

      —  remove the three tables we’re going to use

      —  if they’re present from a previous run

      IF EXISTS (SELECT 1 FROM sys.objects WHERE name = ‘SetBuilder’)

         DROP TABLE SetBuilder

      —   SELECT GETDATE() AS [SchemaCreate: SetBuilder Dropped..]

      GO

      IF EXISTS (SELECT 1 FROM sys.objects WHERE name = ‘Candidate’)

      BEGIN

         DROP TABLE Candidate

      —    SELECT GETDATE() AS [SchemaCreate: Candidate Dropped..]

      END

      GO

      IF EXISTS (SELECT 1 FROM sys.objects WHERE NAME = ‘ComparisonMatrix’)

      BEGIN

         DROP TABLE ComparisonMatrix

      —    SELECT GETDATE() AS [SchemaCreate: ComparisonMatrix Dropped..]

      END

      GO

      —  create the table to hold the set of values between 1 and 1M

      CREATE TABLE SetBuilder (

         Id int PRIMARY KEY

      )

      –SELECT GETDATE() AS [SchemaCreate: SetBuilder Created..]

      GO

      —  create the table to hold the set of values to test from primacy

      —  (we are going to remove some low-hanging fruit for performance reasons,

      —      so we need a distinct table

      CREATE TABLE Candidate (

         CandidateValue  int,

         SQRTCandidateValue AS SQRT(CandidateValue)  PERSISTED –CONSTRAINT Candidate_PK PRIMARY KEY CLUSTERED WITH FILLFACTOR = 100

      )

      –SELECT GETDATE() AS [SchemaCreate: Candidate Created..]

      GO

      —  create the table where the calculations will be performed

      —      Base is the number we’re testing for primacy

      —      Factor is the number we’re dividing it by

      —      IsQuotientInteger is a bit flag indicating whether the Quotient is an integer

      CREATE TABLE ComparisonMatrix (

         Base int,

         Factor int,

         IsQuotientInteger AS CASE

             WHEN Base % Factor = 0 THEN CAST(1 AS int)

             ELSE CAST(0 AS int)

         END           PERSISTED

      )

      –SELECT GETDATE() AS [SchemaCreate: ComparisonMatrix Created..]

      GO

      —  build indexes on the ComparisonMatrix table to improve perfomance

      –CREATE UNIQUE /*CLUSTERED*/ INDEX ComparisonMatrix_PK ON ComparisonMatrix(Base, Factor) WITH FILLFACTOR = 90

      –SELECT GETDATE() AS [SchemaCreate: Clustered Index On ComparisonMatrix Created..]

      GO

      — CREATE INDEX ComparisonMatrix_IDX1 ON ComparisonMatrix (IsQuotientInteger) WITH FILLFACTOR = 100

      –SELECT GETDATE() AS [SchemaCreate: Index On ComparisonMatrix Created..]

      GO

      —  log a status message

      –SELECT GETDATE() AS [SchemaCreate: Tables Created..]

      GO

      —  using Shaun’s syntax (because it doesn’t have a database footprint),

      —  populate SetBuilder

      ;WITH digits(i) AS (

         SELECT 1 AS I UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL

         SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL

         SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL

         SELECT 0

      ),

      —      generate 1M rows each with a unique row number, i

      sequence(i) AS (

         SELECT d1.i + (10*d2.i) + (100*d3.i) + (1000*d4.i) + (10000*d5.i) + (100000*d6.i)

           FROM digits AS d1,

                digits AS d2,

                digits AS d3,

                digits AS d4,

                digits AS d5,

                digits AS d6

      )

      INSERT  SetBuilder (Id)

      SELECT  i+1 — CTE gives us 0-999999; we want 1-1000000

      FROM    sequence

      —  SELECT GETDATE() AS [SetBuilder Populated..]

      GO

      —  using Shaun’s syntax (because it doesn’t have a database footprint),

      —  populate Candidate

      ;WITH digits(i) AS (

         SELECT 1 AS I UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL

         SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL

         SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL

         SELECT 0

      ),

      —      generate 1M rows each with a unique row number, i

      sequence(i) AS (

         SELECT d1.i + (10*d2.i) + (100*d3.i) + (1000*d4.i) + (10000*d5.i) + (100000*d6.i)

           FROM digits AS d1,

                digits AS d2,

                digits AS d3,

                digits AS d4,

                digits AS d5,

                digits AS d6

      )

      INSERT  Candidate (CandidateValue)

      SELECT  i+1 — CTE gives us 0-999999; we want 1-1000000

      FROM    sequence

      WHERE   i+1 > 1  — exclude 1 right away

      —  SELECT GETDATE() AS [Candidate Populated..]

      GO

      —  now delete the low-hanging fruit from Candidate

      —    WE ARE ONLY DONG THIS FOR PERFORMANCE; the algorhitm can handle them

      —    NOTE that these fifteen DELETE statements, which encompass the first 15 prime numbers and

      —      thier multiples, together remove 86.13% of the candidate values

      DELETE  Candidate WHERE   (CandidateValue % 2 = 0  AND CandidateValue >= 2+1)

      DELETE  Candidate WHERE   (CandidateValue % 3 = 0  AND CandidateValue >= 3+1)

      DELETE  Candidate WHERE   (CandidateValue % 5 = 0  AND CandidateValue >= 5+1)

      DELETE  Candidate WHERE   (CandidateValue % 7 = 0  AND CandidateValue >= 7+1)

      DELETE  Candidate WHERE   (CandidateValue % 11 = 0 AND CandidateValue >= 11+1)

      DELETE  Candidate WHERE   (CandidateValue % 13 = 0 AND CandidateValue >= 13+1)

      DELETE  Candidate WHERE   (CandidateValue % 17 = 0 AND CandidateValue >= 17+1)

      DELETE  Candidate WHERE   (CandidateValue % 19 = 0 AND CandidateValue >= 19+1)

      DELETE  Candidate WHERE   (CandidateValue % 23 = 0 AND CandidateValue >= 23+1)

      DELETE  Candidate WHERE   (CandidateValue % 29 = 0 AND CandidateValue >= 29+1)

      DELETE  Candidate WHERE   (CandidateValue % 31 = 0 AND CandidateValue >= 31+1)

      DELETE  Candidate WHERE   (CandidateValue % 37 = 0 AND CandidateValue >= 37+1)

      DELETE  Candidate WHERE   (CandidateValue % 41 = 0 AND CandidateValue >= 41+1)

      DELETE  Candidate WHERE   (CandidateValue % 43 = 0 AND CandidateValue >= 43+1)

      DELETE  Candidate WHERE   (CandidateValue % 47 = 0 AND CandidateValue >= 47+1)

      —  these bear less fruit, but are included because they’ll still

      —  help performance; I continued to add a line as long as 1% of remaining candidates

      —  was deleted

      DELETE  Candidate WHERE   (CandidateValue % 53  = 0 AND CandidateValue >= 53+1)

      DELETE  Candidate WHERE   (CandidateValue % 59  = 0 AND CandidateValue >= 59+1)

      DELETE  Candidate WHERE   (CandidateValue % 61  = 0 AND CandidateValue >= 61+1)

      DELETE  Candidate WHERE   (CandidateValue % 67  = 0 AND CandidateValue >= 67+1)

      DELETE  Candidate WHERE   (CandidateValue % 71  = 0 AND CandidateValue >= 71+1)

      DELETE  Candidate WHERE   (CandidateValue % 73  = 0 AND CandidateValue >= 73+1)

      DELETE  Candidate WHERE   (CandidateValue % 79  = 0 AND CandidateValue >= 79+1)

      DELETE  Candidate WHERE   (CandidateValue % 83  = 0 AND CandidateValue >= 83+1)

      DELETE  Candidate WHERE   (CandidateValue % 89  = 0 AND CandidateValue >= 89+1)

      DELETE  Candidate WHERE   (CandidateValue % 97  = 0 AND CandidateValue >= 97+1)

      DELETE  Candidate WHERE   (CandidateValue % 101 = 0 AND CandidateValue >= 101+1)

      —  SELECT GETDATE() AS [Candidate DePopulated For Performance..]

      –CREATE CLUSTERED INDEX Candidate_PK ON Candidate(CandidateValue)

      –CREATE INDEX Candidate_IDX1 ON Candidate(SQRTCandidateValue)

      —  uncomment this SELECT statement to confirm that we have cut 1M candidates down to 119,590

      –SELECT COUNT(1) AS [Candidate Depopulated Count] FROM Candidate

      GO

      —  log a status message

      –SELECT GETDATE() AS [Tables Populated..]

      GO