Extract numeric values from text in SQL Server

I have this paragraph;

Speeding ticket is 210.99USD. Aggravated DUI could add up 1 year jail time.

This is a standard text where the pattern is like this;

Speeding ticket is [Amount]. Aggravated DUI could add up [Term] year jail time.

The ask is to extract Amount and Term from this text. The approach is to split the paragraph and use SQL IsNumeric functions to extract the values;

Here is a sample prototype;

DECLARE @ParagraphText NVARCHAR(MAX) = N'Speeding ticket is 210.99USD. Aggrevated DUI could add up 1 year jail time.'

--table variable
DECLARE @Test TABLE (ValueColumn VARCHAR(8000))
INSERT @Test
--I am using a custom function but you can use string_split() on SQL 2017 and upward
SELECT * FROM dbo.fnSplitString(@ParagraphText, ' ')

--using windows ranking function to get both values
SELECT ROW_NUMBER() OVER(ORDER BY ValueColumn) [ROW_NUMBER],*
FROM
(
    SELECT
    CONVERT(DECIMAL(20,8),
    CASE 
	WHEN IsNumeric(ValueColumn)=1 THEN 	CONVERT(FLOAT,ValueColumn)
	ELSE CONVERT(FLOAT,'0'+LEFT(ValueColumn,PATINDEX('%[^0-9.]%',ValueColumn)-1))
    END) AS ExtractedColumn
    ,ValueColumn
    FROM @Test
) x
WHERE x.ExtractedColumn > 0

SQL Server window functions

One of the most obvious and useful set of window functions are ranking functions where rows from the data set are ranked accordingly. There are three ranking functions:

ROW_NUMBER()
RANK()
DENSE_RANK()

The difference is easy to remember. For the examples, let’s assume we have this stocks data set.

IF OBJECT_ID('tempdb..#stocks') IS NOT NULL DROP TABLE #stocks;
;With Stocks AS
(
    SELECT 'MSFT' Symbol UNION ALL
    SELECT 'MSFT' Symbol UNION ALL
    SELECT 'MSFT' Symbol UNION ALL
    SELECT 'AAPL' Symbol UNION ALL
    SELECT 'GOOG' Symbol UNION ALL
    SELECT 'GOOG' Symbol UNION ALL
    SELECT 'YHOO' Symbol UNION ALL
    SELECT 'T' Symbol
)
SELECT * INTO #stocks FROM Stocks;
--SELECT * FROM #stocks

ROW_NUMBER()

This will assigns unique numbers to each row within the PARTITION given to the ORDER BY clause. SQL Server require an explicit ORDER BY clause in the OVER() clause for string data type. ORDER BY is not mandatory in monetary columns (INT, DECIMAL, FLOAT …).

SELECT Symbol, ROW_NUMBER() OVER(ORDER BY Symbol) [ROW_NUMBER]
FROM #stocks

RANK()

This behaves like ROW_NUMBER(), except that “equal” rows are ranked the same. If we substitute RANK() from previous query:

SELECT Symbol, RANK() OVER(ORDER BY Symbol) [RANK]
FROM #stocks

As you can see, we have gaps between different ranks. We can avoid those gaps by using following;

DENSE_RANK()

DENSE_RANK() is a rank with no gaps, i.e. it is “dense”. We can write:

SELECT Symbol, DENSE_RANK() OVER(ORDER BY Symbol) [DENSE_RANK]
FROM #stocks

To get a good understanding of these three ranking functions is to see them all in action side-by-side. Run this query

SELECT
    SYMBOL,
    ROW_NUMBER() OVER(ORDER BY Symbol) [ROW_NUMBER],
    RANK() OVER(ORDER BY Symbol) [RANK],
    DENSE_RANK() OVER(ORDER BY Symbol) [DENSE_RANK]
FROM #stocks

Sometimes we don’t have order by column and we simple want to return row numbers using Row_Number function. Here is the same query with changes;

SELECT
    SYMBOL,
    ROW_NUMBER() OVER(ORDER BY Symbol) [ROW_NUMBER],
    RANK() OVER(ORDER BY Symbol) [RANK],
    DENSE_RANK() OVER(ORDER BY Symbol) [DENSE_RANK]
FROM #stocks
SELECT
	SYMBOL,
    ROW_NUMBER() OVER(ORDER BY (SELECT 1)) [ROW_NUMBER],
    RANK() OVER(ORDER BY (SELECT 1)) [RANK],
    DENSE_RANK() OVER(ORDER BY (SELECT 1)) [DENSE_RANK]
FROM #stocks

If you compare this result with earlier, you can see that RANK and DENSE_Rank functions doesn’t like the constant, so the values are 1 in respective columns.

You can use any literal value in Order by clause;

order by (select 0)
order by (select 1)
order by (select null)
order by (select 'test')

The above means that when you are using constant ordering is not performed by query optimizer.

Resources

https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-ver15

https://stackoverflow.com/questions/44105691/row-number-without-order-by

Word searching/matching in SQL Server

This is a sample of how to match strings which are not exact and have a different order of words. Usually the strings have similar digit patterns but the words may be in different order.

https://stackoverflow.com/questions/48380545/fuzzy-string-matching-sql-words-in-different-order

Another approach is using TF*IDF. This is known as Term Frequency and Inverse Document Frequency. Here is a reference;

TF*IDF in C# example

SSIS Fuzzy lookup has a good support for this;

Fuzzy lookup using SSIS

The transaction log for database ‘SampleDb’ is full due to ‘LOG_BACKUP’.

The ETL process worked fine for the last 3 days. Today it started failing. The reason, Log_Backup. The database was in Full recovery mode and transaction log got full.

As a matter of fact, the staging database has to be in Simple recovery mode.

First to view disk space occupied by database, run this;

sp_helpdb SampleDb

To change database recovery model, run this;

USE SampleDb
GO
SELECT * FROM sys.database_files

--Truncate the log by chaning the database recovery model to SIMPLE
ALTER DATABASE SampleDb
SET RECOVERY SIMPLE
GO

--Shrink the truncated log file to 1MB
DBCC SHRINKFILE (SampleDb, 1)
GO

--Reset the database recovery model, if required
/*
ALTER DATABASE SampleDb
SET RECOVERY FULL
GO
*/

If DBCC SHRINKFILE takes longer, we can use following command to see the progress;

select * from sys.dm_exec_requests

There is a Percentage_Completed and Estimated_Completion_time columns. These columns are not populate for every operation, but they are for shrink. You can find the row for your connection during the shrinking, and inspect the values to get an estimate of completion time. If the values are not changing, you’ll need to investigate whether the process is blocking something.

Upon checking the database’s Log file growth setting, the log file was limited growth of 1GB. So what happened is when the job ran and it asked SQL server to allocate more log space, but the growth limit of the log declined caused the job to failed. I modified the log growth and set it to grow by 50MB and Unlimited Growth and the error went away.

Resource

https://stackoverflow.com/questions/21228688/the-transaction-log-for-database-is-full-due-to-log-backup-in-a-shared-host/21235001

Using SSIS to pull data in chunks from remote server

Recently I hit by server memory issue while running SSIS package. I was getting binary data from remote server but the production server was unable to process it because of limited memory.

There are two choices to resolve this. Increase server memory. This wouldn’t solve the problem because the data will grow day by day.

The second is to split the batch into multiple batches say, 300 records per batch. if there are 2000 records then it would be 7 round trips to the remote server to load the data. 6 full 300 and seventh one would be 200.

Here is the design;

I will be using SQL OFFSET FETCH Feature and SSIS Script component for this. For an OLEDB example you can click on the link under Resources.

Declare four variables as follows:

1) vRowCount (Int32): Stores the total number of rows in the source table
2) vRC_IncrementValue (Int32): Stores the number of rows we need to specify in the OFFSET 
3) vRC_Increment: (Int32): Stores the number of rows we need to specify in this operation
4) vRCChunkValue (Int32): Specifies the number of rows in each chunk of data in this operation 
5) vRCBatchValue (Int32): Specifies the number of rows in each chunk of data

After declaring the variables, we assign a default value for the vRC_ChunkValue variable; in this example, we will set it to 100.

Our select query inside Script source component is like this;

string vSqlStatement = $@"
            SELECT *
	     FROM [dbo].[tblSample] 
	     WHERE 1=1
            -- get chunks
            ORDER BY SampleID
            OFFSET {Variables.vRCIncrementValue} ROWS
            FETCH NEXT {Variables.vRCBatchValue} ROWS ONLY";

Next, Add an Execute SQL Task to get the total number of rows from source table and change result set property to Single Row;

Assign the return value to vRowCount variable.

Next, add two expression task to copy values from operating variables to query variables;

Next, Add a For Loop Container, with the following configuration;

Add a data flow task inside For Loop Container. Add a script component and configure it as source;

Configure Data output on Inputs and Outputs tab;

Configure Connection on Connection Manager tab;

Click “Edit Script” and make these changes;

IDTSConnectionManager100 connMgr;
SqlConnection sqlConn;
SqlDataReader sqlReader;

public override void AcquireConnections(object Transaction)
{
    //base.AcquireConnections(Transaction);
    connMgr = this.Connections.GoldenConn;
    sqlConn = (SqlConnection)connMgr.AcquireConnection(null);

}

public override void ReleaseConnections()
{
    //base.ReleaseConnections();
    connMgr.ReleaseConnection(sqlConn);
}

public override void PreExecute()
{
    base.PreExecute();

    //create sql statement
    string vSqlStatement = $@"
    SELECT *
    FROM [dbo].[tblSample] 
    WHERE 1=1
    -- get chunks
    ORDER BY SampleID
    OFFSET {Variables.vRCIncrementValue} ROWS
    FETCH NEXT {Variables.vRCBatchValue} ROWS ONLY";

    //MessageBox.Show(vSqlStatement);

    SqlCommand cmd = new SqlCommand(vSqlStatement, sqlConn);
    /*
   7200 sec = 120 min = 2 hours. This can be set to 0 for non timeout at all
   Will this work? It also depends on server timeout settings. In most SQL install, SQL default timeout for remote queries is 600 seconds, 10 minutes.
   */
   cmd.CommandTimeout = 7200;
   sqlReader = cmd.ExecuteReader();
}

public override void PostExecute()
{
    base.PostExecute();
    sqlReader.Close();
}

public override void CreateNewOutputRows()
{
    try
    {
        while (sqlReader.Read())
        {
            {
                SampleDataBuffer.AddRow();
                SampleDataBuffer.SampleID = sqlReader.GetString(0);
                SampleDataBuffer.AddDate = sqlReader.IsDBNull(7) ? null : sqlReader.GetString(7);
            }
        }
    }
    catch (Exception ex)
    {
        //set to true to cause execution to abort
        bool cancel = false;
        //raise the error event to SSIS
        ComponentMetaData.FireError(-1, "CreateNewOutputRows()", ex.Message, "", -1, out cancel);
    }
}

Next, Add Script task inside For Loop container to calculate remaining rows;

Edit Script and add this;

// make sure all rows are accounted for
int rowCount = (int)Dts.Variables["User::vRowCount"].Value;
int rowIncrement = (int)Dts.Variables["User::vRC_Increment"].Value;
int rowChunkValue = (int)Dts.Variables["User::vRC_ChunkValue"].Value;

//this is our new offset value
rowIncrement = rowIncrement + rowChunkValue;
Dts.Variables["User::vRC_IncrementValue"].Value = rowIncrement;

//calculate remaining rows
int remainingRows = rowCount - rowIncrement;
//MessageBox.Show($"RowCount: {rowCount}\nRowIncrmenet: {rowIncrement}\nRowChunkValue{rowChunkValue}\nRemainingRows{remainingRows}");
if ((remainingRows <= rowChunkValue))
{
      //short circuit
      Dts.Variables["User::vRC_BatchValue"].Value = remainingRows;
      //for loop assign expression is [@vRC_Increment = @vRC_Increment + @vRC_ChunkValue], let's reverse this for last loop iteration
      Dts.Variables["User::vRC_Increment"].Value = rowIncrement - rowChunkValue;
}

Hope this will help.

Resources

Getting Data Chunks using OLEDB in SSIS