OLake 4th Community Meetup

Summary

The community meetup focused on recent developments and features within the OLake ecosystem, with participants sharing their experiences and insights. Priyansh Khodiyar provided an overview of new functionalities, including a faster target writer for normalization, a new stats file for performance metrics, and the implementation of Docker Compose for MongoDB replica sets. He highlighted the introduction of the Split Vector Strategy, which allows users to fetch data based on chunk size or timestamp, and encouraged attendees to review the extensive documentation available for these updates. Further discussions included the addition of MySQL and Postgres as data sources in OLEC, driven by user demand. Shubham Baldava elaborated on the Iceberg Writer, currently integrated with the Debezium Iceberg Server for schema evolution, and noted that its basic features include upserts and concurrency improvements with future enhancements planned. The conversation shifted to CDC implementation strategy, where a write-ahead log approach is favored. Tushar Kesarwani proposed a federated cataloging system to manage multiple Glue catalogs efficiently, and Shubham agreed to explore it further. Additionally, Pankaj Kaushik introduced the idea of integrating blockchain technology for immutable data transaction records. Priyansh concluded the meetup with a demo on MongoDB replica sets and data syncing, urging community members to contribute to documentation and share feedback on schema evolution challenges in Iceberg.

Chapters & Topics

Community Introductions and Updates

Pankaj Kaushik shared his background as a software architect at Hitachi Venterra, focusing on data solutions for on-prem customers. Priyansh Khodiyar led the introductions, highlighting the global participation of community members and engaging them in conversation. The group discussed their locations and experiences while waiting for additional attendees.

New Features and Updates Overview

Priyansh Khodiyar discussed several new features released in the last two weeks, such as an improved parquet writer and a stats file that tracks various performance metrics. He highlighted the addition of Docker Compose for MongoDB replica settings and explained the Split Vector Strategy, which allows for data fetching based on chunk size or timestamp. These updates enhance the functionality and usability of the platform.

Updates on OLEC Features and Iceberg Writer Development

Priyansh Khodiyar discussed the progress of new features, including the integration of MySQL and Postgres as data sources, which were prioritized due to user requests. He also provided an update on the Iceberg Writer, which is currently developed in Java, with plans to transition to Rust for better memory management. Shubham Baldava clarified details about the releaser for creating Docker images and provided insights into the Iceberg Writer's functionality.

Catalog Support and CDC Implementation in Iceberg

Shubham Baldava outlined the support for REST catalogs, Hive, and Glue, indicating that documentation will be published soon. In response to Pankaj Kaushik's question about CDC implementation, Shubham explained that they are using a write-ahead log (WAL) approach for performance reasons, rather than notification methods that could overload the database.

Discussion on Federated Cataloging System

Shubham Baldava highlighted the Debezium server's support for REST catalog, Glue, and Hive. Tushar Kesarwani proposed the idea of implementing a federated cataloging system similar to Polaris, which would allow for seamless integration of multiple Glue catalogs. Shubham welcomed this suggestion and encouraged Tushar to create an improvement ticket for further exploration.

Compaction Strategy and GDPR Compliance

Shubham Baldava outlined plans to start researching data compaction in March, with a potential feature release around June. Tushar Kesarwani raised concerns about GDPR regulations, suggesting that instead of deleting snapshots, they should be moved to cold storage to avoid legal issues. Shubham agreed on the necessity of considering these compliance requirements in the compaction design.

Blockchain Solutions in Data Transactions

Pankaj shared insights on using blockchain to enhance data transaction security, referencing his work with IBM's Hyperledger Fabric. He highlighted how blockchain can prevent disputes in financial transactions by providing an immutable record of file transfers. Shubham expressed the need to familiarize himself with blockchain concepts before contributing further.

MongoDB Replica Set Demo

Priyansh Khodiyar provided a demo on configuring a MongoDB replica set with a docker-compose command, emphasizing the use of local builds for testing. He clarified the function of the dockerfile in creating a local environment to understand how OLIC operates. Ankit Kumar was acknowledged for his contributions as a backend developer.

Data Sync and Catalog Configuration Overview

Priyansh Khodiyar provided an overview of generating a streams.json file and syncing data streams. He demonstrated how to format the catalog, select collections to sync, and execute commands for dumping data into local storage or S3. Additionally, he highlighted the configuration requirements for S3 and showcased the speed of the data sync process.

Community Engagement and Contribution Opportunities

Priyansh Khodiyar opened the floor for questions and highlighted the resources available for community members, including benchmarking documentation for MongoDB and upcoming comparisons with Flink. Sandip Kumar Dey, a new participant, shared his experience with open source contributions and asked about the tech stack, to which Priyansh responded that they primarily use Go for the backend.

Discussion on MySQL Support and Schema Evolution Challenges

Priyansh Khodiyar confirmed that the MySQL implementation will support RDS and Aurora, addressing the growing demand for AWS compatibility. Tushar Kesarwani highlighted issues with schema evolution, noting that Iceberg struggles with major changes, such as renaming columns or altering nested data structures, which can disrupt data pipelines.

Action Items

Pankaj Kaushik will follow up on the discussion regarding GDPR and legal compliance related to data movements across different systems.
Priyansh Khodiyar will share the recording of the meeting on the community Slack channel for reference later.
Shubham Baldava will create tickets for the suggestions discussed during the meeting to facilitate further discussion and implementation.
Tushar Kesarwani will create issues on GitHub regarding the suggestions discussed in the meeting by next week.

Details

Summary

Slides

4th Community Meetup

Hosted By

Priyansh Khodiyar

Shubham Satish Baldava

Ankit Kumar

Summary

Chapters & Topics

Community Introductions and Updates

New Features and Updates Overview

Updates on OLEC Features and Iceberg Writer Development

Catalog Support and CDC Implementation in Iceberg

Discussion on Federated Cataloging System

Compaction Strategy and GDPR Compliance

Blockchain Solutions in Data Transactions

MongoDB Replica Set Demo

Data Sync and Catalog Configuration Overview

Community Engagement and Contribution Opportunities

Discussion on MySQL Support and Schema Evolution Challenges

Action Items

Ready to Join our next OLake community meetup?