Training, Validation and Test Data in Machine Learning

Training Data

  • Purpose: Used to teach (train) the model.
  • Contents: Contains both input features and corresponding output labels (in supervised learning).
  • Usage: The model learns patterns, relationships, and parameters from this data.
  • Size: Typically the largest portion of the dataset (e.g., 70–80%).

Example:
If you’re training a model to recognize handwritten digits:

  • Input: Images of digits
  • Label: The digit (0–9)

Test Data

  • Purpose: Used to evaluate how well the model performs on unseen data.
  • Contents: Same format as training data (features + labels), but not used during training.
  • Usage: Helps assess model accuracy, generalization, and potential overfitting.
  • Size: Smaller portion of the dataset (e.g., 20–30%).

Key Point: It simulates real-world data the model will encounter in production.

Validation Data

  • Purpose: Used to tune the model’s hyperparameters and monitor performance during training.
  • Contents: Same format as training/test data — includes input features and labels.
  • Usage:
    • Helps choose the best version of the model (e.g., best number of layers, learning rate).
    • Detects overfitting early by evaluating on data not seen during weight updates.
  • Not used to directly train the model (no weight updates from validation data).

Summary Table

AspectTraining DataValidation DataTest Data
Used forTraining modelTuning modelFinal evaluation
Used duringModel trainingModel trainingAfter model training
Updates model?YesNoNo
Known to modelYesSeen during trainingNever seen before

Tip:

In practice, for small datasets, we often use cross-validation, where the validation set rotates among the data to make the most of limited samples.

Typical Size Ranges for Small Datasets

Dataset TypeNumber of Samples (Roughly)
Very Small< 500 samples
Small500 – 10,000 samples
Medium10,000 – 100,000 samples
Large100,000+ samples

Why Size Matters

  • Small datasets are more prone to:
    • Overfitting – model memorizes data instead of learning general patterns.
    • High variance in performance depending on the data split.
  • Big models (e.g., deep neural networks) usually need large datasets to perform well.

💡 Common Examples

  • Medical diagnosis: Often < 5,000 patient records → small dataset.
  • NLP for niche domains: < 10,000 labeled texts → small.
  • Handwritten digit dataset (MNIST): 60,000 training images → medium-sized.

🔁 Tip for Small Datasets

If your dataset is small:

  1. Use cross-validation (like 5-fold or 10-fold).
  2. Consider simpler models (e.g., logistic regression, decision trees).
  3. Use data augmentation (e.g., rotate/scale images, reword texts).
  4. Apply transfer learning if using deep learning (e.g., pre-trained models like BERT, ResNet).

Export SonarQube issues (community edition) in Excel File

SonarQube community edition has no direct way to export issues to excel file. Here are the steps to export;

  1. Install Python from here

https://www.python.org/downloads

Go through custom installation. Specify a manual path e.g. c:\Python313. Check all checkboxes.

Verify installation using “Python –version” in python console.

Clone following repository from GitHub;

https://github.com/talha2k/sonarqube-issues-export-to-excel

open “sonarqube-issues-export-to-excel” in python IDE (IDLE) and edit SonarQube URL, Project_Key and Token. Save file.

Run the script

python sonar-export.py

The script will fetch the issues and save them to an Excel file named sonarqube_issues.xlsx.

Hope this will help someone.

Install SonarQube using Docker

Sonar is a popular open-source platform for continuous inspection of code quality. One of the easiest ways to install and use Sonar is to use Docker, a containerization platform that makes it easy to deploy and manage applications.

Prerequisites

Before getting started, you will need to have Docker installed on your machine. If you do not have Docker installed, you can download and install it from the Docker website.

Step 1: Pull the Sonar Docker Image

The first step in installing Sonar with Docker is to pull the Sonar Docker image from the Docker Hub repository. To do this, open a terminal or command prompt and run the following command:

docker pull sonarqube

This will download the latest version of the Sonar Docker image to your machine.

Step 2: Create a Docker Network

Next, we need to create a Docker network that will allow the Sonar container to communicate with the database container. To create a Docker network, run the following command:

docker network create sonar-network

Step 3: Start a Database Container

Sonar requires a database to store its data. In this example, we will use a PostgreSQL database, but you can also use a MySQL or Microsoft SQL Server database if you prefer. To start a PostgreSQL database container, run the following command:

docker run -d --name sonar-db --network sonar-network -e POSTGRES_USER=sonar -e POSTGRES_PASSWORD=sonar -e POSTGRES_DB=sonar postgres:9.6

Step 4: Start the Sonar Container

Once the database container is running, we can start the Sonar container. To do this, run the following command:

docker run -d --name sonar -p 9000:9000 --network sonar-network -e SONARQUBE_JDBC_URL=jdbc:postgresql://sonar-db:5432/sonar -e SONAR_JDBC_USERNAME=sonar -e SONAR_JDBC_PASSWORD=sonar sonarqube

Step 5: Access the Sonar Dashboard

Once the Sonar container is running, you can access the Sonar dashboard by opening a web browser and navigating to http://localhost:9000. The default username and password are admin and admin, respectively.

When to use ConfigureAwait(true/false)

ConfigureAwait(false) is a C# feature used with async and await to control how the continuation of an async method is scheduled after an awaited operation completes.

Here’s a breakdown of when and why you might need to call ConfigureAwait(false):

Understanding Synchronization Context

  • What it is: A synchronization context (System.Threading.SynchronizationContext) represents a way to queue work (continuations) to a specific thread or threads.
    • UI Applications (WinForms, WPF, Xamarin, old ASP.NET on .NET Framework): These have a UI thread. Operations that update the UI must run on this specific UI thread. The synchronization context ensures that continuations after an await (if it wasn’t configured otherwise) are posted back to this UI thread.
    • ASP.NET Core: By default, ASP.NET Core does not have a synchronization context that behaves like the UI one. It uses a more efficient internal mechanism for managing requests.
    • Console Apps / Library Code (typically): Often have no special synchronization context, or they use the default thread pool context.
  • How await uses it: When you await a Task, by default (without ConfigureAwait(false)), the runtime captures the current SynchronizationContext (if one exists) and the current TaskScheduler. When the awaited task completes, it attempts to post the remainder of the async method (the continuation) back to that captured context or scheduler.

When You SHOULD GENERALLY Use ConfigureAwait(false)

  1. Library Code (Most Common and Important Case):
    • If you are writing a general-purpose library (e.g., a NuGet package, a shared business logic layer, a data access layer) that is not specific to any UI framework or ASP.NET Core.Reason: Your library code doesn’t know what kind of application will call it. If it’s called from an application with a restrictive synchronization context (like a UI app), and your library’s async methods always try to resume on that captured context, it can lead to:
      • Performance Issues: Unnecessary context switching back to the UI thread for non-UI work.Deadlocks: Especially if the calling code is blocking on the async method (e.g., using .Result or .Wait(), which is an anti-pattern but can happen). The UI thread might be blocked waiting for your library method to complete, while your library method is waiting to get back onto the UI thread to complete. This is a classic deadlock.
      Solution: Use ConfigureAwait(false) on all (or almost all) await calls within your library. This tells the runtime, “I don’t need to resume on the original context; any available thread pool thread is fine.”
// In a library
public async Task<string> GetDataAsync()
{
    // _httpClient is an HttpClient instance
    var response = await _httpClient.GetStringAsync("some_api_endpoint")
                                  .ConfigureAwait(false); // Don't need original context

    // Process the response (doesn't need original context)
    var processedData = Process(response); 

    await Task.Delay(100).ConfigureAwait(false); // Another example

    return processedData;
}
}

When You MIGHT NOT NEED ConfigureAwait(false) (or when ConfigureAwait(true) is implied/desired)

  1. Application-Level Code in UI Applications (e.g., event handlers, view models directly interacting with UI):
    • If the code after an await needs to interact directly with UI elements (e.g., update a label, change a button’s state).Reason: You want the continuation to run on the UI thread’s synchronization context. This is the default behavior, so explicitly using ConfigureAwait(true) is redundant but not harmful. Omitting ConfigureAwait achieves the same.
// In a UI event handler (e.g., WPF, WinForms)
private async void Button_Click(object sender, RoutedEventArgs e)
{
    MyButton.IsEnabled = false; // UI update
    var data = await _myService.FetchDataAsync(); // This service call might use ConfigureAwait(false) internally
    // The continuation here will be back on the UI thread by default
    MyLabel.Text = data; // UI update
    MyButton.IsEnabled = true; // UI update
}

  1. Application-Level Code in ASP.NET Core (e.g., Controllers, Razor Pages, Middleware):
    • Generally, not strictly needed: ASP.NET Core doesn’t have a SynchronizationContext that causes the same deadlock problems as UI frameworks or older ASP.NET. HttpContext and related services are accessible without needing to be on a specific “request thread” in the same way UI elements need the UI thread.
    • However, some developers still apply ConfigureAwait(false) out of habit or for consistency if their codebase also includes library projects where it is critical. It typically doesn’t hurt in ASP.NET Core and might offer a micro-optimization by avoiding an unnecessary check for a context.
    • If you do rely on HttpContext (e.g., HttpContext.User) after an await, ensure your understanding of its flow. In ASP.NET Core, HttpContext is generally available to continuations regardless of ConfigureAwait, but being explicit about context requirements is never a bad idea.
  2. Console Applications:
    • Usually, console applications don’t have a restrictive SynchronizationContext (they use the thread pool context). So, ConfigureAwait(false) is often not strictly necessary for preventing deadlocks.
    • However, if the console app uses libraries that might install a custom SynchronizationContext, or if you are writing code that might be reused in other contexts, using ConfigureAwait(false) can still be a good defensive measure.

Summary Table:

ContextRecommendation for ConfigureAwait(false)Reason
General-Purpose Library CodeStrongly Recommended (Use it)Prevent deadlocks, improve performance when called from context-sensitive environments (e.g., UI).
UI Application Code (Event Handlers, VMs)Generally Not Needed (Default is fine)You often need to return to the UI thread for UI updates.
ASP.NET Core Application CodeOptional / Good PracticeNo UI-like SynchronizationContext causing deadlocks. Can be a micro-optimization or for consistency.
Console Application CodeOptional / Good PracticeUsually no restrictive context, but good for reusability or if custom contexts are involved.

Export to Sheets

Key Takeaway:

The most critical place to use ConfigureAwait(false) is in library code to make it robust and performant regardless of the calling application’s environment. In application-level code, the necessity depends on whether you need to return to a specific context (like the UI thread).

As of current date (May 16, 2025), this guidance remains standard practice in the .NET ecosystem.

.NET Code Analysis with Roslyn Analyzers

NET compiler platform (Roslyn) analyzers inspect your C# or Visual Basic code for style, quality, maintainability, design, and other issues. This inspection or analysis happens during design time in all open files.

Here are the key take aways;

Maintainability index range and meaning

For the thresholds, we decided to break down this 0-100 range 80-20 to keep the noise level low and we only flagged code that was suspicious. We’ve used the following thresholds:

Index value Color Meaning
0-9 Red Low maintainability of code
10-19 Yellow Moderate maintainability of code
20-100 Green Good maintainability of code

Code metrics – Class coupling

“Module cohesion was introduced by Yourdon and Constantine as ‘how tightly bound or related the internal elements of a module are to one another’ YC79. A module has a strong cohesion if it represents exactly one task […], and all its elements contribute to this single task. They describe cohesion as an attribute of design, rather than code, and an attribute that can be used to predict reusability, maintainability, and changeability.”

The Magic Number
As with cyclomatic complexity, there is no limit that fits all organizations. However, S2010 does indicate that a limit of 9 is optimal:

“Therefore, we consider the threshold values […] as the most effective. These threshold values (for a single member) are CBO = 9[…].” (emphasis added)

Code metrics – Cyclomatic complexity

https://learn.microsoft.com/en-us/visualstudio/code-quality/code-metrics-cyclomatic-complexity?view=vs-2022

Cyclomatic complexity is defined as measuring “the amount of decision logic in a source code function” NIST235. Simply put, the more decisions that have to be made in code, the more complex it is.

The Magic Number
As with many metrics in this industry, there’s no exact cyclomatic complexity limit that fits all organizations. However, NIST235 does indicate that a limit of 10 is a good starting point:

“The precise number to use as a limit, however, remains somewhat controversial. The original limit of 10 as proposed by McCabe has significant supporting evidence, but limits as high as 15 have been used successfully as well. Limits over 10 should be reserved for projects that have several operational advantages over typical projects, for example experienced staff, formal design, a modern programming language, structured programming, code walkthroughs, and a comprehensive test plan. In other words, an organization can pick a complexity limit greater than 10, but only if it’s sure it knows what it’s doing and is willing to devote the additional testing effort required by more complex modules.” NIST235

As described by the Software Assurance Technology Center (SATC) at NASA:

“The SATC has found the most effective evaluation is a combination of size and (Cyclomatic) complexity. The modules with both a high complexity and a large size tend to have the lowest reliability. Modules with low size and high complexity are also a reliability risk because they tend to be very terse code, which is difficult to change or modify.” SATC

Putting It All Together
The bottom line is that a high complexity number means greater probability of errors with increased time to maintain and troubleshoot. Take a closer look at any functions that have a high complexity and decide whether they should be refactored to make them less complex.

Code metrics – Depth of inheritance (DIT)

Depth of Inheritance. Depth of inheritance, also called depth of inheritance tree (DIT), is defined as “the maximum length from the node to the root of the tree”.

High values for DIT mean the potential for errors is also high, low values reduce the potential for errors. High values for DIT indicate a greater potential for code reuse through inheritance, low values suggest less code reuse though inheritance to use. Due to lack of sufficient data, there is no currently accepted standard for DIT values.

You can read full article here;

https://learn.microsoft.com/en-us/visualstudio/code-quality/code-metrics-values?view=vs-2022

Add following nuget package in your project;

Microsoft.CodeAnalysis.NetAnalyzers

Integration in Azure DevOps

To integrate in Azure DevOps, follow this article;

https://secdevtools.azurewebsites.net/helpRoslynAnalyzers.html