ContextAwareAI/Data Journey

Roberto Carlos Trujillo Torres
9 min readOct 5, 2024

--

As a data practitioner, over time I’ve seen many new hypes emerge: some destined to fade, and others poised to transcend. Last week, while chatting with a colleague, we touched on the point that data diversity remains a major pain point in the data journey for many companies, frustrating countless data teams around the globe. What if we could do something different? That was the question…

Problem statement:

Most of the time, in SMEs or Big Companies there should be at least two systems that need to be connected at some point to feed another business process that aims to create value for the organization. It’s not weird to see many systems in some companies, having information from the same context but in different foundation models, such as non-canonical models and governed by vendor models.

Imagine the next: Once upon a time, there was a company called Lorem Inc. They had two distinct data sources, creatively named DataSourceX and DataSourceY. Each came from different systems within the same company, updated every month like clockwork. Your task? Harmonize both data sources to feed critical processes. Sounds easy, right?

But as every end-of-month approaches, it quickly becomes a nightmare. New columns pop up, schemas no longer align, and the team’s energy starts to dwindle. The worst part? It’s all managed manually.

If that’s your reality — congratulations, you’re part of the club.

Eg.

DataSourceY — Contains information from transactions coming from a SAP system.

| BELNR   | BUZEI | BUKRS | BLART | BUDAT     | KUNNR  | LIFNR  | HKONT  | DMBTR  | WAERS | KOSTL  | SGTXT                                           |
|---------|-------|-------|-------|-----------|--------|--------|--------|--------|-------|--------|-------------------------------------------------|
| 2891078 | 4 | 1500 | KR | 2024-03-13| 72810 | | 432271 | 3824.24| GBP | CC845 | Diverse static knowledge user |
| 2627212 | 1 | 5293 | RE | 2024-09-21| | 73984 | 458561 | 587.53 | EUR | CC293 | Self-enabling contextually-based website |
| 2803796 | 9 | 2407 | AB | 2024-05-05| 21938 | | 424175 | 1082.17| AUD | CC370 | Horizontal analyzing time-frame |
| 2486874 | 10 | 9744 | SA | 2024-09-09| | 41989 | 483471 | 3763.78| GBP | CC404 | Re-engineered methodical policy |
| 2230769 | 10 | 2632 | KG | 2024-08-20| 63132 | 62019 | 488869 | 4414.74| AUD | CC668 | Organized object-oriented instruction set |
| 2522170 | 1 | 5436 | RE | 2024-04-11| | | 487091 | 3422.25| GBP | CC524 | Networked optimal projection |
| 2362248 | 10 | 5977 | KG | 2024-02-07| 44842 | 69882 | 477994 | 1281.21| GBP | CC654 | Down-sized bifurcated Graphic Interface |
| 2198062 | 7 | 3688 | RE | 2024-07-01| | | 496354 | 4832.83| GBP | CC214 | Fully-configurable fresh-thinking utilization |
| 2969508 | 8 | 2309 | KR | 2024-07-26| 63071 | | 451622 | 2668.59| AUD | CC544 | Secured zero-defect time-frame |
| 2819459 | 8 | 4213 | KR | 2024-10-01| 87264 | 55264 | 451552 | 1855.55| USD | CC123 | Automated upward-trending standardization |

DataSourceX — Contains information from transactions coming from a JDE system.

| DOCM  | DOCO   | DCT | AN8  | GLC  | TRDJ       | AID    | CRCD | CO   | AORG | SUB  | VR02                                          |
|-------|--------|-----|------|------|------------|--------|------|------|------|------|------------------------------------------------|
| 11001 | 558679 | RI | 7606 | 4992 | 2024-07-09 | 1974.79| GBP | 3230 | 5191 | 391 | extend viral relationships |
| 70268 | 565676 | DM | 4436 | 4325 | 2024-09-10 | 1769.1 | AUD | 5814 | 6607 | 546 | re-contextualize e-business e-business |
| 47745 | 561408 | RM | 9524 | 4582 | 2024-04-23 | 4387.62| GBP | 7238 | 1633 | 327 | transform next-generation architectures |
| 77543 | 594962 | DM | 7610 | 4246 | 2024-01-04 | 2787.3 | EUR | 2411 | 3691 | 637 | productize B2C web services |
| 65567 | 506442 | RI | 9411 | 4626 | 2024-07-12 | 2641.4 | USD | 8258 | 7913 | 912 | architect ubiquitous communities |
| 95974 | 592123 | PO | 8648 | 4962 | 2024-02-07 | 1395.62| AUD | 260 | 4031 | 274 | generate user-centric relationships |
| 96118 | 532650 | RI | 7785 | 4213 | 2024-09-04 | 3623.63| AUD | 961 | 6712 | 777 | deliver synergistic vortals |
| 18881 | 558523 | RI | 3410 | 4677 | 2024-09-26 | 3280.46| USD | 4728 | 6829 | 112 | syndicate holistic communities |
| 54833 | 586789 | PO | 7154 | 4433 | 2024-04-09 | 1965.28| USD | 1006 | 5015 | 144 | extend sticky methodologies |
| 32923 | 591835 | PO | 3226 | 4266 | 2024-04-29 | 3369.22| USD | 5421 | 4810 | 380 | scale virtual info-mediaries

Hypothesis

If there’s a data source coming from X system, and a data source coming from a Y system: we need to define a canonical model that will act as the glue between both dimensions.

We can manually do the mapping but, if you remember we want to explore the potential of the AI, in this case, to alleviate some of the load in this process and trust in the inference of an LLM or even better a RAG powered system that will make our life easier.

Experiment

  1. Generate some synthetic data for DataSourceY(SAP)
prompt: Generate some fake data for the dataSource x simulating a CSV coming 
from a SAP system. Be mindful of the desire, to have column naming according
to the most common implementation in the SAP ecosystem: do not go into the
canonical model.

2. Generate some synthetic data for DataSourceX(JDE)

prompt: Generate some fake data for the dataSource Y simulating a CSV coming 
from a JDE system. Be mindful of the desire, to have column naming according
to the most common implementation in the JDE ecosystem: do not go into the
canonical model.

3. Define the canonical model

Canonical Model for Transactions

| Field Name           | Data Type   | Description                                                                 | Example                 |
|----------------------|-------------|-----------------------------------------------------------------------------|-------------------------|
| transaction_id | String | Unique identifier for the transaction across systems (document number) | `TXN123456` |
| line_item_id | String | Line item identifier within a transaction | `001` |
| transaction_type | String | Type of transaction (e.g., invoice, payment, credit memo) | `Invoice` |
| company_code | String | Code identifying the company within the organization | `1000` |
| posting_date | Date | Date the transaction was posted to the accounting system | `2024-10-05` |
| document_date | Date | Date the document was created (if different from posting date) | `2024-10-01` |
| customer_vendor_id | String | Unique identifier for the customer or vendor | `CUST001` or `VEND001` |
| customer_vendor_type | String | Indicates whether the ID refers to a customer or a vendor | `Customer` or `Vendor` |
| gl_account | String | General ledger account associated with the transaction | `400100` |
| amount | Decimal | Monetary amount of the transaction | `1500.00` |
| currency | String | Currency in which the transaction is denominated | `USD` |
| cost_center | String | Cost center where the transaction is allocated | `CC100` |
| profit_center | String | Profit center where the transaction is allocated (if applicable) | `PC200` |
| tax_code | String | Tax code associated with the transaction | `T1` |
| payment_terms | String | Payment terms agreed upon for the transaction | `NET30` |
| reference_document | String | Reference to another related document (e.g., purchase order number) | `PO12345` |
| text | String | Free-form description or notes regarding the transaction | `Payment for goods` |
| status | String | Status of the transaction (e.g., pending, approved, completed) | `Approved` |
| created_by | String | User or system that created the transaction | `user123` |
| created_date | DateTime | Date and time the transaction was created | `2024-10-01 12:34:56` |
| updated_by | String | User or system that last updated the transaction | `user456` |
| updated_date | DateTime | Date and time the transaction was last updated | `2024-10-02 09:10:11` |

Magic comes here! Ask for the merge of two data sources based on the canonical model.

prompt: Based on the datasourceX, and the datasourceY i need from you to merge
both datasets in one single dataset. You should follow the objective to make
each column being compatible with the canonical model, eg. if there is a
column in x and y that matches the destiny column in the canonical you
should merge both just to end up in a final column with the header of the
canonical model. If the column does not match in the canonical model, do not
drop it, just put this to the end of the columns disregarding if this is coming
from x or y. Make sure you concatenate both sources into the a csv file,
while giving me the link to download the csv resultant.

Results — Et voila! Some seconds had passed and our friend chatGPT has nailed it…

ChatGPT 4o Console.

Just some processing after this and there you have!

| transaction_id | line_item_id | transaction_type | company_code | posting_date | customer_vendor_id | gl_account | amount  | currency | cost_center | text                                           | SUB_x |
|----------------|--------------|------------------|--------------|--------------|--------------------|------------|---------|----------|-------------|------------------------------------------------|-------|
| 11001 | 558679 | RI | 3230 | 2024-07-09 | 7606 | 4992 | 1974.79 | GBP | 5191 | extend viral relationships | 391 |
| 70268 | 565676 | DM | 5814 | 2024-09-10 | 4436 | 4325 | 1769.10 | AUD | 6607 | re-contextualize e-business e-business | 546 |
| 47745 | 561408 | RM | 7238 | 2024-04-23 | 9524 | 4582 | 4387.62 | GBP | 1633 | transform next-generation architectures | 327 |
| 77543 | 594962 | DM | 2411 | 2024-01-04 | 7610 | 4246 | 2787.30 | EUR | 3691 | productize B2C web services | 637 |
| 65567 | 506442 | RI | 8258 | 2024-07-12 | 9411 | 4626 | 2641.40 | USD | 7913 | architect ubiquitous communities | 912 |
| 95974 | 592123 | PO | 260 | 2024-02-07 | 8648 | 4962 | 1395.62 | AUD | 4031 | generate user-centric relationships | 274 |
| 96118 | 532650 | RI | 961 | 2024-09-04 | 7785 | 4213 | 3623.63 | AUD | 6712 | deliver synergistic vortals | 777 |
| 18881 | 558523 | RI | 4728 | 2024-09-26 | 3410 | 4677 | 3280.46 | USD | 6829 | syndicate holistic communities | 112 |
| 54833 | 586789 | PO | 1006 | 2024-04-09 | 7154 | 4433 | 1965.28 | USD | 5015 | extend sticky methodologies | 144 |
| 32923 | 591835 | PO | 5421 | 2024-04-29 | 3226 | 4266 | 3369.22 | USD | 4810 | scale virtual info-mediaries | 380 |
| 2891078 | 4 | KR | 1500 | 2024-03-13 | 72810 | 432271 | 3824.24 | GBP | CC845 | Diverse static knowledge user | |
| 2627212 | 1 | RE | 5293 | 2024-09-21 | | 458561 | 587.53 | EUR | CC293 | Self-enabling contextually-based website | |
| 2803796 | 9 | AB | 2407 | 2024-05-05 | 21938 | 424175 | 1082.17 | AUD | CC370 | Horizontal analyzing time-frame | |
| 2486874 | 10 | SA | 9744 | 2024-09-09 | | 483471 | 3763.78 | GBP | CC404 | Re-engineered methodical policy | |
| 2230769 | 10 | KG | 2632 | 2024-08-20 | 63132 | 488869 | 4414.74 | AUD | CC668 | Organized object-oriented instruction set | |
| 2522170 | 1 | RE | 5436 | 2024-04-11 | | 487091 | 3422.25 | GBP | CC524 | Networked optimal projection | |
| 2362248 | 10 | KG | 5977 | 2024-02-07 | 44842 | 477994 | 1281.21 | GBP | CC654 | Down-sized bifurcated Graphic Interface | |
| 2198062 | 7 | RE | 3688 | 2024-07-01 | | 496354 | 4832.83 | GBP | CC214 | Fully-configurable fresh-thinking utilization | |
| 2969508 | 8 | KR | 2309 | 2024-07-26 | 63071 | 451622 | 2668.59 | AUD | CC544 | Secured zero-defect time-frame | |
| 2819459 | 8 | KR | 4213 | 2024-10-01 | 87264 | 451552 | 1855.55 | USD | CC123 | Automated upward-trending standardization | |

To be continued….

--

--