Deep Feature Synthesis, the algorithm that will automate machine learning

Years ago, the problem in obtaining present or future (predictions) knowledge with data science was the lack of systems for storing and processing large amounts of information. This is not the case today. The efforts are therefore focused on how to analyze the data in order to extract true value. Deep Feature Synthesis helps in this task because it is an algorithm that automates automatic learning.

Machine learning is the set of processes whereby an algorithm is capable of making predictions from data, and where the result of that projection enhances the machine’s own learning to improve its predictions. Machines learning from their mistakes and successes. It is therefore a specific branch of artificial intelligence that is applied in fields as diverse as banking to detect fraud processes, health for hospital management, or retailing for optimized price calculation.

The two creators of the algorithm are two prominent members of the MIT‘s Computing Science and Artificial Intelligence Laboratory in Cambridge. Their names: James Max Kanter and Kalyan Veeramachaneni. They both presented the project in a document entitled ‘Deep Feature Synthesis: Towards Automating Data Science Endeavors’ (PDF), in which they summarize the characteristics of their ‘creature’.

Deep Feature Synthesis does exactly what its own name announces. It is an algorithm capable of automatically creating features between sets of relational data to synthesize automatic learning processes. The algorithm applies mathematical functions to the data sets in the source field and transforms them into new groups with new and deeper features.

In this process of evolution of the source data, one can begin with some information variables referring to the gender or age and at the end of the process applied by Deep Feature Synthesis have features that will make it possible to make other deeper calculations as percentages.

Deep Feature Synthesis and the Gaussian copula

This automatic machine learning process is perfected, according to its creators, thanks to the probability theory of the Gaussian copula. Many of the stages in an automatic learning process have parameters that, looking for an appropriate result, require a tuning process. The less tuning is used, the less predictive the model will be.

Small variations in that improvement can result in chaos. The large amount of combinations of parameters turns any minimum deviation at the beginning into a huge error at the end. In economic predictive models, this means billions of dollars. The Gaussian copula helps the algorithm’s creators to model the relationship between the options of the parameters and the performance of the entire model. From there, a decision is made as to the best parameters for optimizing the result.

The Gaussian copula was the statistical model used to avoid a major credit crisis like the one in 2008 and it obviously did not succeed. This method was used in VAR (Value at Risk) analyses, that measure the losses that the market could sustain in normal conditions with a confidence level of 95%. In other words, that an investor with a one million euros portfolio can only lose 25,000 euros each of the 20 days (1/20=5% remaining from the confidence level). The day when the great international crisis broke out in 2008, those values soared outside the margins.

Implementation of Deep Feature Synthesis

The Deep Feature Synthesis algorithm and its Data Science machine are implemented on top of an MySQL database, using InmoDB as the engine for the tables, an open source data storage solution for this type of relational database. InmoDB replaces the previous table technology for MySQL and MyISAM. It is more reliable, more consistent, more scalable and, therefore, it offers more performance.

All the data sets with which Deep Feature Synthesis works are converted into the data scheme used by MySQL. The calculation logic, the management and the handling of the features of all this information is done through the programming language Python, the most widely used syntax for the design and configuration of Data Science processes.

Why did the creators of Deep Feature Synthesis use a relational database like MySQL? Because the algorithm’s requirements and the way in which the data in this type of database are sorted are similar. The Data Science machine used by the algorithm’s creators implements AVG (), MAX (), MIN (), SUM (), STD () and COUNT () type functions. It also adds others for another type of operations with the data, such as length () or WEEKDAY () and MONTH () to convert the dates into the days of the week and the month when they occurred.

The use of functions, plus the creation of filters, enable the algorithm to address two really important matters in predictive models:

● Application of functions only to the cases in which a given condition is taken to be true. Without data filtering it is impossible.

● Construction of time interval values, with limits above and also below, based on a time date.

This makes it much easier to optimize database queries.

The three processes of Deep Feature Synthesis

The Data Science based on Deep Feature Synthesis uses the three usual processes with the data to prepare predictive models:

● Data pre-processing: the preliminary work with the data is essential before considering automatic learning work. The parameters need to be reviewed in order to reject, for example, null values.

● Selection of features and reduction of dimensionality: the algorithm generates a large amount of features for each entity and a preliminary selection and reduction task is necessary.

● Modeling: decision trees are used for data modeling.

Follow us on @BBVAAPIMarket

It may interest you

What is an API, types of APIs and how they work

An API is a very useful mechanism that connects two pieces of software equipment to exchange messages or data in a standard format such as XML or JSON. Thus, it becomes an instrument that can be used to search for revenue, open the doors to talent or innovate and automate processes.

APIs , Banking as a service , Desarrollo de negocio , Digital transformation / 18 December 2023
What is business process automation, and what is it used for?

APIs can be a great support when automating business processes Companies, often with a focus on SMEs, spend too many man-hours on time-consuming business processes, thereby making mistakes that a machine would never make. How can business process automation (BPA) help these companies? Is it possible to make use of APIs for BPA? What is […]

APIs , Banking as a service , Digital transformation / 07 September 2023
How is ecommerce developing in Spain?

Ecommerce has continued to grow steadily in Spain, except during the pandemic, which has already been overcome in terms of online shopping. Ecommerce has been making inroads among the Spanish for over two decades. In 2000, it was a marginal and niche activity. Now it is almost universal. Almost all Spaniards with internet access shop online […]

Digital ecosystem , Digital transformation , e-Commerce , Payment gateway , Payments / 27 December 2022

Name	Owner	Duration	Description
gobp.lang	BBVA	1 month	Language preference
aceptarCookies	BBVA	1 year	Configuration Accepted Cookies
_abck	BBVA	1 year	Helps protect against malicious website attacks
bm_sz	BBVA	4 hours	Helps protect against malicious website attacks
ADRUM_BTs	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT1	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BTa	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
xt_0d95e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
__s9744cdb192d044faa1bf201d29fafd1e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
wpml_browser_redirect_test	WPML	Session	Text translation in the portal
wp-wpml_current_language	WPML	24 hours	Text translation in the portal

Name	Owner	Duration	Description
AMCV_***	Adobe Analytics	Session	Unique Visitor IDs used in Cloud Marketing solutions
AMCVS_***	Adobe Analytics	2 years	Unique Visitor IDs used in Cloud Marketing solutions
demdex (safari)	Adobe Analytics	180 days	Create and store unique and persistent identifiers
sessionID	Adobe Analytics	Session	Launch's internal cookie used to identify the user
gpv_URL	Adobe Analytics	Session	Adobe Analytics plugin: getPreviousValue Capture the value of a certain variable in the following page view, in this case the prop1
gpv_level1	Adobe Analytics	Session	Cookie used to store the DataLayer levl1 of the previous page.
gpv_pageIntent	Adobe Analytics	Session	Cookie used to store the pageIntent of the previous page.
gpv_pageName	Adobe Analytics	Session	Cookie used to store the pagename of the previous page.
aocs	Adobe Analytics	Session	Cookie that stores the first values collected at the beginning of a process.
TTC	Adobe Analytics	Session	Cookie used to store the time between the App Page Visit event and the App Completed event.
TTCL	Adobe Analytics	Session	Cookie used to store the time between the LogIn event and App Completed.
s_cc	Adobe Analytics	Session	Determine if cookies are active
s_hc	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_ht	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_nr	Adobe Analytics	2 years	Determine the number of user visits
s_ppv	Adobe Analytics	Permanent	Adobe Analytics plugin: getPercentPageViewed Determine what percentage of the page a user views
s_sq	Adobe Analytics	Session	ClickMap/ActivityMap features
s_tp	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_visit	Adobe Analytics	2 years	Cookie used by Adobe to know when a session has been started.

Name	Owner	Duration	Description
OT2	VersaTag	90 days	VersaTag Cookie used to store a user id and the number of user visits.
u2	VersaTag	90 days	VersaTag Cookie where the user ID is stored
TargetingInfo 2	MediaMind	1 year	Cookie that serves to assign a unique random number that generates MediaMind.

Name	Owner	Duration	Description
mbox	Adobe Target	9 days	Cookie used by Adobe Target to test user experience customization.

Deep Feature Synthesis, the algorithm that will automate machine learning

Deep Feature Synthesis and the Gaussian copula

Implementation of Deep Feature Synthesis

The three processes of Deep Feature Synthesis

It may interest you

What is an API, types of APIs and how they work

What is business process automation, and what is it used for?

How is ecommerce developing in Spain?