Learning Modules
Before jumping in and setting up a development environment and contributing to OLake, if you have limited knowledge of the required technologies and tools, this is the section where you should learn everything important that's required. This page hosts learning modules covering essential concepts and technologies that will help you get started with OLake development.
Required Learning Modules
These modules are mandatory and should be completed before you start contributing to OLake. They are listed in the recommended order of priority.
1. Golang (Go)
OLake is primarily written in Golang, so having a solid grasp of Go makes it much easier to understand the codebase, contribute features, and debug issues.
-
If you prefer videos: Watch this YouTube playlist for a step‑by‑step Go learning path (from basics to advanced topics):
-
If you prefer reading and already know basic programming concepts: Follow this structured Go tutorial series:
Focus on: Need inputs from interns
2. Docker
OLake heavily uses Docker for local development and the Docker CLI workflow. Understanding Docker helps you:
- Run OLake containers correctly
- Understand volumes, networks, and images used in the examples
- Debug environment‑related issues
Recommended video:
3. ETL and ELT
OLake is fundamentally a data migration / ETL tool, so understanding ETL/ELT concepts is critical:
- What it means to extract, transform, and load data
- How data flows from operational systems into warehouses or lakehouses
- Why transformation might happen before (ETL) or after loading (ELT)
Recommended reading:
4. Apache Iceberg
OLake writes data into Apache Iceberg tables, which is the core table format powering many OLake destinations. Understanding Iceberg helps you reason about:
- How OLake writes snapshots, data files, and metadata
- How schema evolution and partitioning work
- Why Iceberg is chosen over traditional table formats
Recommended video:
Key concepts to grasp:
- Snapshots and time‑travel
- Partitioning
- Manifest files and metadata
5. OLake Docs
Familiarizing yourself with OLake's official documentation is essential for understanding how to set up, configure, and work with OLake effectively. The documentation provides comprehensive guides on:
- Setting up a development environment
- Understanding OLake's architecture and data flow
- Configuring sources, destinations, and sync modes
- Using OLake CLI commands and flags
- Debugging and troubleshooting
Recommended starting point:
6. Change Data Capture (CDC)
OLake supports CDC (Change Data Capture) and Incremental syncs. Knowing CDC concepts helps you understand:
- How OLake tracks inserts, updates, and deletes over time
- Why resume tokens, log positions, and state files (
state.json) are important - How incremental syncs differ from full refreshes
Recommended reading:
Focus on:
- The high‑level CDC flow (source logs → captured changes → target)
- Common CDC implementation patterns (log‑based CDC, triggers, etc.)
Additional Resources (Optional)
The following tutorials and resources are not mandatory but are recommended for a deeper understanding of related technologies and concepts that can enhance your OLake development experience.
1. Data Lakehouse Architecture
Understanding the data lakehouse concept helps you appreciate how OLake fits into modern data architectures. A lakehouse combines the best of data lakes and data warehouses, enabling both structured and unstructured data processing.
Recommended video:
2. Apache Parquet File Format
OLake supports Parquet as a destination format. Understanding Parquet helps you understand:
- How columnar storage works and why it's efficient for analytics
- How OLake writes data in Parquet format
- The relationship between Parquet and Iceberg (Iceberg can use Parquet files)
Recommended video:
3. PostgreSQL Fundamentals
While not required if you're only working with other sources, understanding PostgreSQL is valuable since it's one of the most commonly used sources with OLake. This knowledge helps you:
- Understand source database concepts and structures
- Better configure PostgreSQL connections and CDC settings
- Debug source-related issues
Recommended reading: