Bitus Labs invites applications for a Senior Data Engineer who speaks Chinese Mandarin to architect and scale an AWS based data lakehouse, develop production grade data pipelines in Java and Python, and steer data quality, governance, and platform decisions while mentoring engineers.

Responsibilities

Architect and scale a medallion pattern data lakehouse on AWS S3 using Apache Iceberg, covering Bronze, Silver, and Gold layers.
Design and maintain high throughput ETL and ELT pipelines with AWS Glue, EMR (Spark), and Lambda.
Implement schema evolution, partitioning strategies, and Iceberg table compaction to optimize storage and query performance.
Produce production ready pipeline code in Java and Python, selecting the language based on performance and maintainability needs.
Build and operate event driven data pipelines using Amazon Kinesis Data Streams, Kinesis Firehose, or Apache Kafka (MSK).
Define streaming processing semantics such as exactly-once and at-least-once using Apache Flink or Spark Structured Streaming on EMR.
Manage infrastructure as code with AWS CDK or Terraform to enable repeatable, auditable deployments.
Optimize cost and performance across AWS services including S3, Glue, Athena, Redshift Spectrum, EMR, Lambda, Step Functions, and EventBridge.
Apply data security best practices: IAM least-privilege policies, KMS encryption, VPC networking, and Lake Formation access controls.
Develop and maintain CI/CD pipelines for data workloads using AWS CodePipeline, GitHub Actions, or equivalent tools.
Implement data quality frameworks such as Great Expectations or Deequ and integrate validation steps into pipeline orchestration.
Define and enforce data contracts between producers and consumers.
Contribute to data cataloguing and lineage tracking via AWS Glue Data Catalog or Apache Atlas.
Collaborate with data scientists, ML engineers, and analysts to translate data needs into performant, well-documented datasets.
Mentor mid-level and junior engineers through code reviews, design discussions, and pair programming sessions.
Document architectural decisions (ADRs) and contribute to the internal engineering knowledge base.

Requirements

Minimum 5 years of professional data engineering experience, including at least 3 years on AWS cloud platforms.
Proven track record delivering production data pipelines at scale (TB+ datasets, high throughput SLAs).
Experience with data lakehouse architectures using the medallion pattern and open table formats (Iceberg preferred; Delta Lake or Hudi acceptable).
Java proficiency (8+) for Spark jobs, Iceberg connectors, and performance-critical components; familiarity with Maven or Gradle.
Python proficiency (Python 3) for AWS Glue scripts, orchestration logic, data quality checks, and automation; experience with pandas, PySpark, boto3, and packaging best practices.
Storage and compute: S3, Glue (jobs, crawlers, Data Catalog), EMR (Spark/Flink), Lambda, EC2.
Streaming: Kinesis Data Streams, Kinesis Firehose, or MSK.
Orchestration: Step Functions, MWAA, or EventBridge Scheduler.
Querying: Athena, Redshift, or Redshift Spectrum.
Security and governance: IAM, KMS, Lake Formation, Secrets Manager, VPC.
DevOps: AWS CDK or CloudFormation; CodePipeline or equivalent CI/CD tools.
Experience with Apache Spark (PySpark and/or Spark Java API), including distributed transformations, performance tuning, and memory management.
Knowledge of Apache Iceberg features such as time travel, snapshot management, and partition evolution.
Strong SQL skills for data transformation, including window functions, CTEs, and query optimization.

Technologies

Core tools and platforms include Java and Python, AWS services (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation), IaC and CI/CD (CDK, Terraform, GitHub Actions, CodePipeline), big data engines (Spark, Flink, Spark Structured Streaming), data formats (Iceberg with familiarity with Delta Lake and Hudi), streaming and messaging (Kinesis, MSK, Firehose), orchestration (MWAA, Airflow, EventBridge), security and governance tools, SQL, pandas, boto3, Apache Atlas, Great Expectations, Deequ, and related data tooling.

Benefits

401(k) plan with matching contributions
Health, dental, and vision insurance
Life insurance coverage
Paid time off and parental leave
Retirement plan offerings

Compensation

Salary is USD 130,000 per year, with the listed compensation package.

Location and Work Setup

Location: Irvine, California, onsite
Work arrangement: In person; onsite presence is required
Ability to commute: Irvine, CA 92618 (required)

Language

Chinese language proficiency is required for this role.

Tech Stack at a Glance

Languages: Java (8+), Python 3
Cloud: AWS (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation, CDK)
Processing: Apache Spark, Apache Flink, Spark Structured Streaming
Table formats: Apache Iceberg (primary); familiarity with Delta Lake, Hudi
Streaming: Kinesis, MSK, Kinesis Firehose
Orchestration: MWAA, Step Functions, Apache Airflow
IaC & CI/CD: AWS CDK, Terraform, GitHub Actions, CodePipeline
Related tooling: Maven, Gradle, PySpark, boto3, Great Expectations, Deequ, Apache Atlas
SQL: advanced querying with window functions, CTEs, and optimization

Senior Data Engineer (Chinese Mandarin Speaker)

Job Description

Responsibilities

Requirements

Technologies

Benefits

Compensation

Location and Work Setup

Language

Tech Stack at a Glance

Similar Jobs

Senior Software Engineer

Data Engineer

Data Engineer

Data Engineer

Senior Data Analytics Engineer

Senior Data Analytics Engineer

Get Job Alerts