EngineerJobs.io
← Back to all jobs

Job Description

Bitus Labs invites applications for a Senior Data Engineer who speaks Chinese Mandarin to architect and scale an AWS based data lakehouse, develop production grade data pipelines in Java and Python, and steer data quality, governance, and platform decisions while mentoring engineers.

Responsibilities

  • Architect and scale a medallion pattern data lakehouse on AWS S3 using Apache Iceberg, covering Bronze, Silver, and Gold layers.
  • Design and maintain high throughput ETL and ELT pipelines with AWS Glue, EMR (Spark), and Lambda.
  • Implement schema evolution, partitioning strategies, and Iceberg table compaction to optimize storage and query performance.
  • Produce production ready pipeline code in Java and Python, selecting the language based on performance and maintainability needs.
  • Build and operate event driven data pipelines using Amazon Kinesis Data Streams, Kinesis Firehose, or Apache Kafka (MSK).
  • Define streaming processing semantics such as exactly-once and at-least-once using Apache Flink or Spark Structured Streaming on EMR.
  • Manage infrastructure as code with AWS CDK or Terraform to enable repeatable, auditable deployments.
  • Optimize cost and performance across AWS services including S3, Glue, Athena, Redshift Spectrum, EMR, Lambda, Step Functions, and EventBridge.
  • Apply data security best practices: IAM least-privilege policies, KMS encryption, VPC networking, and Lake Formation access controls.
  • Develop and maintain CI/CD pipelines for data workloads using AWS CodePipeline, GitHub Actions, or equivalent tools.
  • Implement data quality frameworks such as Great Expectations or Deequ and integrate validation steps into pipeline orchestration.
  • Define and enforce data contracts between producers and consumers.
  • Contribute to data cataloguing and lineage tracking via AWS Glue Data Catalog or Apache Atlas.
  • Collaborate with data scientists, ML engineers, and analysts to translate data needs into performant, well-documented datasets.
  • Mentor mid-level and junior engineers through code reviews, design discussions, and pair programming sessions.
  • Document architectural decisions (ADRs) and contribute to the internal engineering knowledge base.

Requirements

  • Minimum 5 years of professional data engineering experience, including at least 3 years on AWS cloud platforms.
  • Proven track record delivering production data pipelines at scale (TB+ datasets, high throughput SLAs).
  • Experience with data lakehouse architectures using the medallion pattern and open table formats (Iceberg preferred; Delta Lake or Hudi acceptable).
  • Java proficiency (8+) for Spark jobs, Iceberg connectors, and performance-critical components; familiarity with Maven or Gradle.
  • Python proficiency (Python 3) for AWS Glue scripts, orchestration logic, data quality checks, and automation; experience with pandas, PySpark, boto3, and packaging best practices.
  • Storage and compute: S3, Glue (jobs, crawlers, Data Catalog), EMR (Spark/Flink), Lambda, EC2.
  • Streaming: Kinesis Data Streams, Kinesis Firehose, or MSK.
  • Orchestration: Step Functions, MWAA, or EventBridge Scheduler.
  • Querying: Athena, Redshift, or Redshift Spectrum.
  • Security and governance: IAM, KMS, Lake Formation, Secrets Manager, VPC.
  • DevOps: AWS CDK or CloudFormation; CodePipeline or equivalent CI/CD tools.
  • Experience with Apache Spark (PySpark and/or Spark Java API), including distributed transformations, performance tuning, and memory management.
  • Knowledge of Apache Iceberg features such as time travel, snapshot management, and partition evolution.
  • Strong SQL skills for data transformation, including window functions, CTEs, and query optimization.

Technologies

Core tools and platforms include Java and Python, AWS services (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation), IaC and CI/CD (CDK, Terraform, GitHub Actions, CodePipeline), big data engines (Spark, Flink, Spark Structured Streaming), data formats (Iceberg with familiarity with Delta Lake and Hudi), streaming and messaging (Kinesis, MSK, Firehose), orchestration (MWAA, Airflow, EventBridge), security and governance tools, SQL, pandas, boto3, Apache Atlas, Great Expectations, Deequ, and related data tooling.

Benefits

  • 401(k) plan with matching contributions
  • Health, dental, and vision insurance
  • Life insurance coverage
  • Paid time off and parental leave
  • Retirement plan offerings

Compensation

Salary is USD 130,000 per year, with the listed compensation package.

Location and Work Setup

  • Location: Irvine, California, onsite
  • Work arrangement: In person; onsite presence is required
  • Ability to commute: Irvine, CA 92618 (required)

Language

Chinese language proficiency is required for this role.

Tech Stack at a Glance

  • Languages: Java (8+), Python 3
  • Cloud: AWS (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation, CDK)
  • Processing: Apache Spark, Apache Flink, Spark Structured Streaming
  • Table formats: Apache Iceberg (primary); familiarity with Delta Lake, Hudi
  • Streaming: Kinesis, MSK, Kinesis Firehose
  • Orchestration: MWAA, Step Functions, Apache Airflow
  • IaC & CI/CD: AWS CDK, Terraform, GitHub Actions, CodePipeline
  • Related tooling: Maven, Gradle, PySpark, boto3, Great Expectations, Deequ, Apache Atlas
  • SQL: advanced querying with window functions, CTEs, and optimization

Similar Jobs

Get Job Alerts

New jobs delivered to your inbox.