We are using Power Apps with Azure Synapse Jobs to extract data from Dynamics 365 and incrementally push it into Azure Storage. The data syncing works well, but we are encountering issues with data consolidation.
We have an Azure Function App set on a time trigger, which runs hourly. Instead of using a blob trigger, we list all timestamp folders containing a model.json file and maintain their status in an Azure Table, marked as "Processed" or "Unprocessed". Each hour, the function consolidates data from the unprocessed folders into a target container in Azure Storage, appending new data to avoid recalculating all timestamp records.
However, we are experiencing issues with missed records. Below is the detailed logic we are using:
Identify Root-Level Folders with model.json:
Traverse through the Azure Blob Storage to identify folders at the root level that contain a model.json file.
Confirm the existence of model.json and its size is greater than 0 bytes to ensure it's not empty, which may indicate readiness of data in that folder for processing.
Verify Timestamp and Processing Status:
Extract the timestamp from the folder's name using a regex pattern.
Consult the Azure Table (ProcessingStatus) to determine whether the timestamp has already been processed.
To ensure data is ready for consolidation, check that the model.json file's modified time is within one hour of the folder's creation time. This aligns with the hourly cycle of our Azure Function and ensures data has not been prematurely flagged for processing.
Data Consolidation for Unprocessed Folders:
For each folder marked as unprocessed:
Aggregate CSV files by scanning each unprocessed folder and collecting data.
Load these files into pandas DataFrames and concatenate them based on similar schemas or table names.
Append this new consolidated data to existing datasets in the target container while ensuring no duplicates (assuming uniqueness in the first column).
After consolidation, upload the updated DataFrame back to the target Azure Blob container.
Update Processing Status:
Post consolidation, update the processing status in the Azure Table to "Processed" with the current timestamp, indicating that no further actions are needed for this batch in subsequent runs.
We are wondering whether to implement additional checks for the model.json file size and modification time to improve data readiness verification. Specifically, should we:
Only read timestamp folders if the model.json file is larger than 0 bytes?
Consider a timestamp folder for processing only if the model.json file's modified time is equal to or greater than one hour from the folder's creation time, matching the specified time (1 hour) configured in the Azure Synapse jobs in Power Apps?
Could these additional checks help in ensuring data completeness and prevent the issue of missing records? Could you please review the detailed logic and let us know if there is anything we are missing?
