data-lake-as-code icon indicating copy to clipboard operation
data-lake-as-code copied to clipboard

Enhanced discovery through ElasticSearch

Open paulu-aws opened this issue 5 years ago • 7 comments

paulu-aws avatar May 21 '20 17:05 paulu-aws

Hi @paulu-aws, what would the architecture look like for this? Do you have a diagram?

martyn-swift avatar Dec 12 '22 10:12 martyn-swift

@martyn-swift, Originally I had in mind an S3 trigger on the data-lake bucket that would trigger a lambda to insert records into Elasticsearch. However, the cheat code for this Quilt. https://quiltdata.com/

Full disclosure, I don't work for them, this is my own opinion and not my employers: Quilt is an awesome product for this kind of use case. I'm not sure I'd try and implement my own indexing and elasticsearch architecture if Quilt was an option.

paulu-aws avatar Dec 12 '22 17:12 paulu-aws

@paulu-aws, does the Glue job write to S3 in large batches? Would the job trigger per object? Is Eventbridge an alternative for batching up s3 put events?

martyn-swift avatar Jan 17 '23 10:01 martyn-swift

@martyn-swift, part of the beauty of the Glue is how Dynamic Frames abstracts away this kind of detail, UNLESS you really want to control it. glueContext.write_dynamic_frame.from_options() format and format options parameters (like block size) give you some control as to the behavior of the write activity to S3. However, I'd advise against trying to outsmart it. The dynamic frame is going to do a much more efficient job mapping writes across your DPUs in parallels into S3 than anything someone might cook up. Live the dream. Let the dynamic frame do its job. Lets you focus on enrolling more datasets and less on plumbing.

paulu-aws avatar Jan 17 '23 14:01 paulu-aws

@martyn-swift, I'll also mention if you are trying to plug the Glue Job into an event-driven architecture, its probably not a good idea to rely on S3 triggers as the message bus. S3 triggers characterize write behaviors to an S3 bucket, not logical processing steps. Multi-file outputs, failed-writes, object versions, etc all become challenges relying on S3 triggers as an eventing mechanism. Better options in order of complex to least complex (IMO) would be AWS EventBridge, Amazon MQ, AWS Step, and Amazon SNS. All of those services can be called directly from inside your Glue job using python (or Java) APIs to send messages over the duration of your job. For example, directly after the .write_dynamic_frame. you'll know the files are written, what bucket, key name, format, and options used and you can pass that along into your preferred messaging bus.

paulu-aws avatar Jan 17 '23 14:01 paulu-aws

@paulu-aws, can you give me an example of the SNS call? Is it using boto3 with a call after the job.commit() ?

martyn-swift avatar Jan 18 '23 08:01 martyn-swift

@martyn-swift,

The SNS call would look like any other Boto3 call you might see in the api basics examples. You just need to make sure you glue jobs execution role has IAM permissions to publish to the SNS topic. You may eventually find yourself peppering in several .publish() calls to SNS topics over the course of your Glue job write operations to provide updates or metrics.

Just keep in mind, doing this will start you down a path of blending business workflow state with your data engineering code which are probably best kept abstracted. As a one-off or short term solution, SNS inside the Glue job is fine. But anything else really deserves a more sophisticated business workflow framework like AWS Step. Here is a quick Step State machine I mocked up in a few minutes that triggers a glue workflow and publishes to SNS topics WITHOUT requiring any changes to the Glue job itself.

image

paulu-aws avatar Jan 18 '23 15:01 paulu-aws