Categorization Engine: Main challenges of applying ML and AI
In the process of helping its customers deal with the new variety and volatility of data sources, CRIF has come across some common concerns to think about when considering a new ML classification project. Here are some of them:
How much data is needed to create a categorization engine?
This is the most frequently asked question, and the answer is not straightforward because it depends on many factors, and given that the data the categorization process is working with is highly regulated, this makes things even more difficult.
In general, to create a categorization engine, the data sample must be “representative”:
- It should contain all the types of cases that need to be detected. For example, transaction data at the end of the year is completely different from a mid-year sample: pensions/salaries/mortgages/taxes have a natural internal periodicity that the model needs to capture.
- Transaction proportions must be properly sampled over the whole period you want to cover, as well as the amounts.
- It must cover the different types of customers that the bank has, so that the algorithm is not biased between high and low spending customers.
In order to ensure that the data to be used is promising, an initial statistical check is recommended to ensure that these requirements are met so that the engine performs as expected in each possible scenario.
How many transactions is a categorization engine able to enrich? What percentage of the results are correct?
The evaluation of performance is essential for the continuous improvement of the categorization engine. Therefore, CRIF put a lot of effort into studying and defining state-of-the-art metrics to inspect every corner of the system’s algorithm, presenting a summary of the most important metrics for multiclass classification problems to the scientific community (for more information, see the CRIF paper Metrics for Multiclass Classification: an Overview) and developing accountability tools to study the algorithm.
Among all the metrics, the two most important KPIs from a business perspective are Coverage and Accuracy:
- Model Coverage is the percentage of transactions that can be classified by the model: the Coverage level is normally higher than 95% and is measured using the most recent production data. The transactions that cannot be categorized are mostly those for which the description and other fields leveraged by the categorization engine are empty or simply filled with casual strings or series of numbers.
- Model Accuracy is the percentage of transactions classified in the most appropriate category included in the Taxonomy. The CRIF Categorization Engine Accuracy level is higher than 90% and is also measured using the most recent production data. It’s important to remember that a realistic top performance value is around 93%-94% due to the ambiguous nature of the data used by the model.
Does a categorization engine require maintenance?
Transaction data, by its very nature, is constantly evolving, with new merchants entering the market every day, and spending habits that can change dramatically (think of the impact of the pandemic on food deliveries and, more generally, online shopping). Similarly, the categorization engine should not be thought of as a static model, but as a product that needs to be constantly tuned and maintained to keep a high level of performance. CRIF models are frequently monitored and finetuned: this constant evolution allows the algorithms used by the categorization engine to be kept at the cutting edge of technology.
What is the best analytics approach to classifying a banking transaction using a categorization engine?
At first glance, rule-based classification systems are more effective: you have absolute certainty of the results and full explainability. In practice, the definition of these rules and their hierarchy is not an easy task: if a rule that filters the keyword “tax” as a “taxes” category is used, this could lead to the incorrect categorization of “taxi” as a tax instead of transportation. Also, a rule-based system raises performance issues, since rules must be processed one by one until a match is found, and of course, the more rules there are, the greater the computation time.
A machine learning model can differentiate between ambiguous cases by using the other elements of the transactions, such as the description and the amount. Therefore, better classification results can be achieved when the available structured data is limited. In addition, artificial intelligence allows automation and scaling of the solution, with continuous learning over hundreds of millions of transactions, which is otherwise impossible with only human defined rules.
Finally, CRIF’s experience over the past few years suggests that the most effective approach is a hybrid one: rules are more effective when rich metadata is available and can be used to uniquely associate a category with a specific value of a variable, while machine learning excels when less, unstructured information is available.
How is Artificial Intelligence used in the CRIF Categorization Engine?
- The CRIF Categorization Engine uses a hybrid combination of machine learning (ML) and rules engine (RE) to understand and interpret the information contained in a banking transaction. The ML core does most of the work, leaving the rules to deal with specific and deterministic cases.
- The ML algorithms used during the training phase are based on supervised learning techniques that automatically process predictions based on a series of examples that are initially and progressively provided to the algorithm. The learning process requires a training phase involving a user whose task is to read, understand and manually assign a category to a transaction. The user is guided through the process using an active learning approach that minimizes the human effort required.
- After this training phase, the algorithm can interpret the transactions and make predictions about the category to which they should be assigned, i.e., classifying them.
The CRIF Categorization Engine is made up of two separate components:
- Categorization Trainer: the component responsible for the machine learning training
- Categorization Classifier: the component responsible for making the model available for the production environment
How does the Categorization Trainer work?
The Categorization Trainer is a web application where the user can manually assign a category to a set of banking transactions. The labeled transactions are used by a supervised learning algorithm to create a prediction model.
Since labeling is the most time-consuming task, the Categorization Trainer provides a series of automated processes to reduce the labeling effort as much as possible. Banking transactions are usually similar to each other except for just a few fields, e.g., the transaction date. The first step in the training process is to identify similar transactions and group them.
Once the transactions have been grouped, a set of groups is selected by means of predicted labels, if available, or similar characteristics, and is passed to the users to be manually categorized. Once manually categorized, the anomaly detection system analyzes the consistency between different transactions with similar features. The highlighted anomalies are sent back for an additional check. A new model is generated each time a selection process runs, using the categorized transactions. The model runs against the uncategorized documents and the ones categorized with the lowest confidence are sent back for the manual categorization step.