Incremental Data Load from SAP to Snowflake is always a challenging situation in any Data Warehouse projects. Now that many businesses had taken decision to move their on-premise data from SAP, SABBW, SAPBWHANA or S4HANA to Snowflake for various benefits present in Snowflake. Incremental load is definitely an need of the hour.
Traditional SAP connectors send the data in batched both for existing and new datasets. This approach is feasible for smaller datasets however it wont be optimal solution for the larger datasets or for few tables in SAP where there is no timestamp or changed on date column. In such cases we need a brand new connector which serves this purpose.
SAP CDC Connector which was released by SAP in 2022 exactly address this challenging situation.
Architecture:
SAP CDC (Change Data Capture) is a connector present in Azure Data Factory which enables delta (incremental) loads. The SAP CDC solution comprises of 2 sides feature.
Firstly the SAP side ODP includes the ODP extractor to extract the full or delta data load from the SAP Tables/ODP extractors.
Secondly, Azure side includes the ADF (Azure Data Factory) pipelines which does the mapping from source to the sink. Sink can be a Azure Blob, Azure Data Lake or Snowflake in this case.
Fig: High-level Architecture of SAP CDC Solution
Steps:
SAP Side:
Run LTRC on the replication server to configure SLT.
Select RFC Connection
Scenario: ODP (Operational Data Provisioning)
Queue Alias: Select a context of data extractions via ODP to Data Factory (ex: SLTADP)
Run program RODPS_REPL_TEST to test the extraction for this RFC connection.
Azure side:
Setup a SHIR (Self Hosted Integration Runtime) You will need source IP address and port numbers (3200, 3300)
Create a Linked Service and Data source set. (how to create a linked service)
Create a Azure Blob storage (it will used as temporary storage before moving data to Snowflake)
Create a sink as a Snowflake table.
Create a Pipeline with a recurrence window of every 5 mins.
Monitoring of data load can be done in SAP side at ODQMON or in Azure side in pipeline log activities.
Lessons Learnt:
Can't control delta queue in Azure Data Factory, only can be done in SAP side.
Time consumed to insert records for full load (1 time activity) is huge. Need to load the data in smaller chunks.
LTRS shows data has inserted to Azure blob however the pipeline might have failed due to connection issues.
If sink is Snowflake table, manual mapping needs to be done in Azure Data Factory where '/' needs to be replaced with '_' for custom fields from SAP.
Job frequency of the pipeline needs to be optimized based on the data volumes expected.
In case of any clarifications, do write to me at upgrade@rajeevjagatap.com
Further more on Data Analytics
Comments