SEM/EDX Particle Clustering and Analysis

Practical Internship

February 2025 – June 2025

Overview

As part of our 3rd year of study we had to find and complete a practical internship with a company. By working in a real world environment with real, often imperfect data we, as students, have learned not only how to act in a professional environment but also how to pivot and adapt solutions to the available data and resources. In order to provide value we have to apply data-driven solutions in various fields. Additionally, a big part of the challenge is becoming familiar with the domain of the project in order to find the most appropriate solutions.

Following the advice of Dr. Alican Noyan, I started applying and sending open applications to companies with headquarters in the High Tech Campus in Eindhoven. One of the open applications was sent towards Eurofins EAG, a material science lab. My previous University experience studying robotics alongside my experience working with NPEC and Specifix, made me very interested in the possibility of finding data-driven solutions for material science applications, combining machine learning with both physics and chemistry. The interview presented the opportunity of working with SEM/EDX datasets in order to classify particles from samples, which I was intrigued to research and find out more about.

Datasets

Each row in the dataset represents the unified results for individual particles. The columns consist of physical and chemical properties of particles.

Physical properties include:

Area (μm²)Aspect RatioBreadth (μm)Direction (°)ECD (μm)Length (μm)Perimeter (μm)Shape
Area of the particleAspect Ratio of bounding boxWidth of bounding boxAngle of Electron BeamEquivalent Circular DiameterLength of bounding
box
Perimeter of Bounding BoxA quantified approximation to the
shape of the feature

Chemical properties include:

C Wt%N Wt%O Wt%F Wt%Mg Wt%Al Wt%Si Wt%P Wt%
Carbon Weight PercentNitrogen Weight PercentOxygen Weight PercentFluoride Weight PercentAll elements of the periodic table w/o H, He, LiMagnesium Weight PercentAluminum Weight PercentSilicon Weight PercentPhosphorous Weight Percent

These dataset are all tied to samples sent in by customers. The data comes from the SEM/EDX process and is first ran through the AZTEC software suite for preliminary analysis. From then on everything is done to order. Customers can request different analyses, for example: the number of steel particles per mm2, checking for oxidation, contamination, etc. All of these requests can be interpreted as counting and classifying a big number of particles, depending on the sample and the measurement area, process which is being done manually.

Approach

Initially it seemed like a basic classification problem where we have features and classes to train a multi-class model on. Unfortunately, the reality is that while many datasets had been manually labelled, the data storage solution at Eurofins EAG is not intended for long term storage of client data, especially since these labelled datasets are a preliminary stage of a final report that is handed in.

This changed the perspective of the project from a supervised to an unsupervised machine learning problem, which has different challenges. This discovery was made in the initial weeks of the project, while in the process of understanding and exploring the data.

As a solution I arrived at the idea of creating a labelling tool aided by clustering algorithms. This tool also helps scientists perform preliminary cleaning of the data, analyze the found clusters and label entire datasets with ease.

In the long term, this speeds up the labelling process. Additionally with a more standardized approach to labelling, enough data can be gathered to train a classification model based either on the chemical composition of particles or on the trained parametric UMAP model.

Results