0

I'm streaming data into a BigQuery table building an InsertAllRequest which is then inserted using the insertAll-method from com.google.cloud.bigquery.BigQuery. I git it all to work in the sense that I can insert data into the table but I'm out for a specific behavior: I'd like to implement some kind of a "composite key" in the table.

Here's what the table looks like:

Field name      | Type      | Mode
--------------------------------------
order_id        | STRING    | REQUIRED
modified_ts     | TIMESTAMP | REQUIRED
order_sum       | INTEGER   | NULLABLE
order_reference | STRING    | NULLABLE

So, I'd like the key to be order_id and modified_ts; with other words, I'd like to be able to track changes of an order over time. If an existing key is inserted again, I'd hope for some error - or just ignoring this new row (regarding it as a duplicate) would work fine for me as well.

Unfortunately, I didn't yet succeed in telling BigQuery to do so. Here's the code I tested:

String rowId = String.valueOf("order_id, modified_ts");

InsertAllRequest req = InsertAllRequest.newBuilder(ORDER)
        .addRow(rowId, mapOrder(o, modifiedTs))
        .build();

InsertAllResponse resp = bigQuery.insertAll(req);
log.info("response was: {}", resp.toString());

ORDER in newBuilder is a TableId-object and mapOrder(o, modifiedTs) maps the incoming object to a Map<String, Object>. All works fine if I define rowId as String.valueOf("order_id") but obviously all updates of an order just update the existing row, not generating any history. The solution above with comma-separated column-names behaves the same way, simply ignoring modified_ts.

So, my question is simply: how can I get this to work? What I want is - somewhat simplified - the following:

order_id | modified_ts | data
------------------------------------------
    1    | 2020-12-10  | some data
    1    | 2020-12-15  | some changed data
    2    | 2020-12-15  | some more data
Daniel.H
  • 21
  • 5

2 Answers2

1

The composite key or UNIQUE concept doesn't exists in BigQuery. There are no keys and indexes.

Engineer your app so that allows duplicates to be inserted.
On top of your table create a view, that reads the most recent row of the record, based on the concept you already laid out.

This way you have access to versioned data as well, and always you have the up to date version using the view as from clause in a query.

Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • Had a meeting with Google about this yesterday which basically confirms what you say. Concerning the limit of characters in this comment I'll write an answer to my own question with the information I got. Thanks! – Daniel.H Dec 18 '20 at 08:03
1

As written in the comment on Pentium 10s answer, a meeting with a Google representative confirmed its content.

Basically, I misunderstood the functionality of adding a "rowId" to my row, indicating its key: String rowId = String.valueOf("order_id, modified_ts"); This is nothing more than what Google calls "Best effort de-duplication" and it's just that - a best effort and no guarantee whatsoever. I mistook this as a technique to rely on, my bad.

The recommended way to deal with this is in your own code, either before or after streaming into BigQuery. "Before" would mean implementing logic in your app handling duplicates before writing data into BQ which includes some way of keeping what you identify as keys in memory. "After" is what Pentium 10 suggests: stream all the data into BigQuery and persist it, take care of the rest then.

There are 3 ways to solve this problem "after": Views with (the very handy!) window-functions may be a way (but remember that the processing power of the whole underlying query is needed every time you query the view), materialized views might be a solution (if/when Google supports window functions in those) or you create and update a table with the desired data yourself, managing some king of scheduling.

I hope this answer helps clear up things a bit and serves as a complement to the provided one.

Daniel.H
  • 21
  • 5