SAP Data Intelligence: a technical analysis

When SAP Data Intelligence (SAP DI) – formerly known as SAP Data Hub – started, I was part of the group that developed it. However, since then, I have tried my best to forget that time. After six years of tiptoe walking, I decided to put my knowledge to good use instead.

I’ve said it before: a product must target at least one business problem. According to SAP product pageSAP DI targets three:

  • to “integrate and orchestrate massive volumes and flows of data at scale”,
  • to “streamline, operationalize and govern machine learning-driven innovation”,
  • and “optimize governance and minimize compliance risk with comprehensive metadata management rules.”

SAP DI for Data Integration

To integrate data, SAP DI must obviously have connectivity to the various systems in an IT landscape. For example, a customer may want to extract SAP ERP data and load it into an Oracle database. From a technical perspective, loading any database has varying levels of support.

  1. Prepared statements. The insert statement is created once and then called multiple times with the different row values. Without it, the database would have to parse, validate, and create an execution plan for each row, which takes about 0.1 seconds per row. This means that instead of loading 10,000 rows per second, only one hundred rows can be loaded.
  2. Table processing. When executing a prepared statement, the database driver provides the ability to pass multiple rows at once. This allows for better network utilization, as the database only needs to allocate free space once, and many other internal optimizations are possible. This translates into another performance boost of a factor of ten.
  3. No data conversion. If a date value is supplied as a string and placed in a datetime column, the database must convert it. It doesn’t take long, but when processing huge volumes at scale, it becomes a problem. It is therefore essential to transmit the data using the correct data type. It would be strange to read a date/time, convert it to a string, and then provide that string to the database for it to convert back, wouldn’t it?
  4. Support for database-specific bulk loaders. When loading large amounts of data, all databases provide a vendor-specific method to get the data quickly. After the table segment is locked, the data is now written by the database directly to the database file. Everything else is bypassed. In Oracle this is called the Direct Path API. The problem for any data integration tool is that there is no standard. It must be implemented individually for each database provider.

Every data integration tool in the world supports all four levels described above – with the exception of SAP DI, which doesn’t even have an Oracle table loader, let alone any other table loader options. non-SAP database. It allows users to execute single SQL statements (Flowagent SQL Executor) but in most cases people write their own loaders, for example in Python, with support for the second level, array processing, at most. On the other hand, SAP Data services support all, like any other commonly used tool. Interestingly, SAP Data Services has been integrated into SAP DI as an engine – for read data!

With this knowledge in mind, stating that SAP DI “integrates [..] massive data volumes [..] on a large scale” is decidedly audacious.

Machine learning models

A few years ago, the SAP machine learning team provided their predefined models to SAP Data Hub. This also resulted in the name change to SAP Data Intelligence. ML projects suffered in two areas at the time, both solved by SAP DI. First, out-of-the-box templates for typical problems were not available. They had to be created manually. And second, the deployment of models, especially from test to production, was problematic. Today, SAP DI has templates for image classification, OCR and more (full list here) and every SAP DI chart can be put into production, including ML models. The popular way to build ML models without SAP DI was Apache Spark, which didn’t support any of the move-to-production options.

Today, Tensor flow is the most popular method for creating ML logic. Since SAP DI supports Python as a programming language, using Tensorflow is also the most popular method here (here is a Example). Any Python runtime can be used for this, but the move to production is well resolved in Tensorflow itself. During development, a model object is configured with all the different layers of ML methods and then trained. This model can be exported and contains all the information to run it anywhere. The model even adapts to the environment it is running on, for example, training can take place on high-speed GPUs, but the model is deployed on a smartphone and thus executes its logic on hardware provided by the user. ‘device.

From this perspective, SAP DI today is just a complicated and expensive way to run ML models.

Metadata management

Any data-driven project has many sources, targets, and intermediate transformations. Managing the entire graph is the goal of metadata management. It’s such a fundamental requirement that every data integration tool already had it built in 15 years ago. SAP DI lags behind common features in the industry.

Customers

Given the issues described above, it seems reasonable to ask: are customers even using the product? Finding reference clients is very difficult. Fun fact, about the three reference customers mentioned by SAP product sheetthe first one says: “Machine learning using SAP Data Intelligence Cloud will be be deployed [..]”. The second says, “Based on a proof of concept project [..] plans to deploy the SAP Data Intelligence solution”. The third reference customer mentions SAP DI once without practical context.

Some customers use SAP DI. However, the more interesting question is how they use it and if there are better and cheaper options, both considering licensing costs as well as hardware and administration costs.

My biggest concern is that the situation as it is now is not sustainable. SAP’s development costs are too high for the revenue it generates. But again, this wouldn’t be the first product that SAP advises against.

Sam D. Gomez