Cloud-Native Kubernetes Platform

Enterprise-grade Kubernetes infrastructure hosting 10+ critical applications for a major real estate group - from on-premise clusters to AWS EKS, spanning 5 years of continuous operations, incident management, and cloud migration.

2019 - 2024

~5 years

Technical Lead then Engineering Manager - Cloud Infrastructure

KubernetesAWS EKSDockerHelmGitLab CINginx IngressVarnishMemcachedTyk API GatewayCert-ManagerLet's EncryptCentreonNew RelicMySQLPostgreSQLMongoDBAWS S3AWS NLBBashLinux

Cloud-Native Kubernetes Platform

2019 - 2024

~5 years

Technical Lead then Engineering Manager - Cloud Infrastructure

KubernetesAWS EKSDockerHelmGitLab CINginx IngressVarnishMemcachedTyk API GatewayCert-ManagerLet's EncryptCentreonNew RelicMySQLPostgreSQLMongoDBAWS S3AWS NLBBashLinux

Applications Hosted

10+

Websites, PIM, APIs, batch jobs, gateways

Incidents Managed

40+

INC/SRQ/ISS tracked & resolved

Cluster Generations

On-premise K8s then AWS EKS

Infrastructure Lifecycle

Jan 2019 to Mar 2024

Presentation

An enterprise Kubernetes platform for the digital transformation of real estate

The Kubernetes infrastructure of a major French real estate group hosted the entire portfolio of digital applications: PIM Akeneo for product data management, the Export Ligneurs batch processing pipeline, all corporate websites (branded as PWR - Pichet Web Resources), the company Intranet, the PSR partner leads API, the Tyk API Gateway, and various microservices including a connected housing IoT platform. The project spanned two major phases: on-premise Kubernetes clusters managed by Claranet (formerly Oxalide) from 2019 to 2021, followed by a full migration to AWS EKS (Elastic Kubernetes Service) on the eu-west-3 region from 2022 to 2024.

Project Nature

Multi-application Kubernetes infrastructure - 10+ containerized applications deployed across 2 cluster generations (on-premise then AWS EKS), with industrialized CI/CD pipelines, proactive monitoring, and managed hosting through Claranet.

Business Domain

Real Estate - the Pichet Group websites (pichet.fr, pichet-immobilier.fr, stock-invest.pichet.com, monespace.pichet.com) are the commercial front doors of the group. Any downtime directly impacts business revenue and brand reputation.

Platform Architecture

10+ applications orchestrated on AWS EKS with Nginx Ingress routing

Applications Hosted (by criticality)

Objectives, Context, Stakes & Risks

Understanding the strategic vision behind the infrastructure

Objectives

Ensure high availability of all critical applications (websites, PIM, APIs) with zero unplanned downtime during business hours
Industrialize deployments through GitLab CI + Helm, reducing deployment time from hours to minutes
Migrate to AWS EKS for improved scalability, resilience and managed Kubernetes benefits
Proactively monitor resources (CPU, memory, disk, SSL certificates) to prevent incidents
Manage secure access through SSH bastions, VPN tunnels, and certificate management

Context & Stakes

The organization's entire digital presence depended on this infrastructure. With 5+ commercial websites serving thousands of daily visitors, a PIM system feeding product data to all channels, and batch processes synchronizing data with external partners, the stakes were considerable. The infrastructure team operated in a managed hosting model with Claranet, requiring close coordination between internal development teams and external operations engineers.

Identified Risks

Data loss during migration

Moving from on-premise to AWS EKS required careful data migration strategies, especially for persistent volumes and database connections.

Prolonged service interruptions

The commercial websites were critical business assets. Any extended downtime would directly impact sales and customer trust.

Certificate expiry & SSL misconfiguration

Multiple incidents (INC2009171, SRQ0409264) showed SSL was a recurring risk area, with misconfigured certificates causing HTTPS failures.

Resource exhaustion on cluster nodes

Recurring critical alerts for node memory, CPU load, swap, and inodes exhaustion threatened cluster stability (July-October 2019 crisis).

On-premise vs AWS EKS Comparison

The Steps - What I Did

A concrete, phase-by-phase journey through the infrastructure lifecycle

Phase 1

Kubernetes On-Premise Setup & Operations

2019 - 2021

Gained Maintainer access on GitLab Docker/Nginx repositories and configured CI pipelines for containerized builds
Resolved first major incidents: Varnish PWR crashes (INC2009335), Memcached failures, rsync volume issues on Intranet (SRQ0345307)
Debugged a critical CI error deploying production config to preprod (SRQ0389097) - implemented environment safeguards
Configured Tyk API Gateway on Kubernetes: MongoDB databases setup (SRQ0506433), API routing for PIM (PIMUP23-168)
Managed the July-October 2019 node resource crisis: critical memory, CPU load, swap, and inode alerts across k8s.prod.kariba.fr
Created S3 bucket "kariba-assets" with read-only IAM user for asset storage and backup (November 2019)
Optimized PHP-FPM Docker image OPcache configuration for PIM Akeneo performance

Phase 2

AWS EKS Migration & Stabilization

2022 - 2024

Requested and managed SSH bastion provisioning for AWS PROD & PREPROD environments (ISS-423267)
Resolved first EKS incidents: node-exporter heartbeat CRIT alerts (ISS-316691) indicating monitoring gaps in the new platform
Managed recurring EKS platform incidents during the stabilization period (ISS-329644, ISS-346412, ISS-346473)
Fixed CI/CD pipeline failures for Intranet & PSR on EKS preprod (ISS-392190) - adapted Helm charts for EKS compatibility
Addressed EKS AMI version warnings and coordinated node group updates with Claranet
Integrated with Azure DevOps organization "groupepichet" for cross-platform collaboration

CI/CD Deployment Pipeline

GitLab CI + Helm Charts - from merge request to production with manual validation gate

Infrastructure Lifecycle Timeline

5-year infrastructure lifecycle from on-premise K8s to AWS EKS

Migration Progress Over Time

The Actors - Interactions

A complex ecosystem of internal teams and external providers

The infrastructure management required constant coordination between the internal development team at Groupe Pichet and the managed hosting team at Claranet (formerly Oxalide). As the primary consumer of the Kubernetes platform (PIM, Export Ligneurs) and deployment supervisor, I served as the bridge between development needs and infrastructure operations, receiving all monitoring alerts and participating directly in incident resolution.

Franck C.

N+1, Web SI Lead

Infrastructure coordination, deployment validation, CI/CD industrialization strategy. Key quote: "We industrialized the deployment like other K8S projects (GitLab CI and Helm)."

Antoine D.

Claranet/Oxalide

Monitoring incident resolution, health pages implementation, infrastructure troubleshooting during on-premise phase.

Kevin W.

Claranet/Oxalide

Tyk Gateway incidents, infrastructure alert resolution, platform maintenance coordination.

Rémi P.

SI Marketing Project Manager

Tyk/MongoDB requests, infrastructure coordination for marketing platform needs.

Thomas R.

Kariba Developer

PIM contributions, K8s job vérification, collaborative debugging of deployment issues.

Sébastien B.

Kariba Team

Export Ligneurs job vérification on K8s prod/preprod, batch processing validation.

Results

Measurable impact for the organization and personal growth

Personal Growth

Deep expertise in Kubernetes operations: pod lifecycle management, resource limits, horizontal pod autoscaling, persistent volume claims, and ingress controllers
Hands-on AWS cloud skills: EKS cluster management, S3 configuration, IAM policies, NLB load balancers, and bastion architecture
Incident management maturity: developed systematic approaches to triaging, escalating, and resolving infrastructure incidents under pressure
CI/CD pipeline design: GitLab CI + Helm Charts deployment automation, environment-specific configurations, and deployment safeguards
Évolution from developer to infrastructure supervisor: this project fundamentally changed my understanding of the full software delivery lifecycle

Business Impact

Continuous availability

10+ critical applications maintained at high availability over 5 years

Incident resolution

40+ documented incidents (INC/SRQ/ISS) tracked and resolved, reducing mean time to recovery

Cloud migration

Successful transition from on-premise K8s to AWS EKS with minimal business disruption

Deployment industrialization

Fully automated CI/CD pipeline via GitLab CI + Helm, enabling reproducible and consistent deployments

Cost efficiency

Optimized resource allocation through monitoring, reducing unnecessary infrastructure spending

Incident Distribution by Category

Infrastructure Metrics Radar

Project Aftermath

Beyond migration - the long-term évolution of the platform

Immediate Aftermath

After the AWS EKS migration was fully stabilized, the infrastructure entered a mature operational phase. The monitoring setup with Centreon and New Relic provided proactive alerting, and the Helm-based deployment pipeline enabled teams to deploy with confidence. The bastion access management, while initially challenging (ISS-423267 remained open for months), eventually provided a secure and auditable access path to production systems.

Long-Term Évolution

The Kubernetes platform continued operating beyond my departure from the group in March 2024. The architectural décisions made during the initial setup - standardized Helm charts, automated cert-manager for SSL/TLS, clear namespace séparation between environments - proved durable and enabled the infrastructure to scale with the organization's growing digital needs. The migration from on-premise to AWS EKS validated the cloud-first strategy and set the foundation for future cloud-native initiatives.

Today

Today, the infrastructure principles established during this project - containerization, orchestration, automated deployments, proactive monitoring - are industry standards. The experience of managing a 5-year infrastructure lifecycle, from initial setup through a major cloud migration, provides a unique perspective on the long-term implications of infrastructure décisions. The lessons learned directly inform my current approach to infrastructure-as-code and cloud architecture.

Critical Reflection

Honest retrospective on 5 years of infrastructure management

What Worked Well

The GitLab CI + Helm industrialization was a major success. Standardized deployment pipelines across all applications brought consistency and reliability that dramatically reduced deployment-related incidents after the initial setup period.
The dual-cluster strategy (preprod + prod) with clear environment séparation prevented many potential production issues. The incident where production config was deployed to preprod (SRQ0389097) actually reinforced the importance of environment guardrails.
Proactive monitoring with Centreon + New Relic caught many issues before they impacted end users, transforming infrastructure management from reactive firefighting to proactive prevention.

Areas for Improvement

The resource planning during Phase 1 was insufficient. The July-October 2019 crisis with recurring node memory/load/swap alerts could have been avoided with better capacity planning and resource requests/limits on pods.
The bastion access management (ISS-423267) dragged on for too long. A more structured access management process with pre-provisioned SSH keys and automated rotation would have saved significant coordination overhead.
Documentation of infrastructure décisions and runbooks was inconsistent. Creating comprehensive runbooks for common incident patterns earlier would have accelerated onboarding and reduced resolution times.

What I Would Do Differently

With hindsight, I would have pushed for the AWS EKS migration earlier. The on-premise phase, while educational, consumed significant operational effort that managed Kubernetes would have eliminated. I would also have implemented GitOps practices (ArgoCD or Flux) from the start, and established infrastructure-as-code with Terraform for all cloud resources rather than relying on manual Claranet requests. Finally, I would have invested more in automated testing of Helm charts and Kubernetes manifests before deployment, catching configuration errors in CI rather than in production.

Lessons Learned

Infrastructure is never "done" - it requires continuous attention, monitoring, and évolution. The 5-year lifecycle taught me that the initial architectural décisions have compound effects over time.
The managed hosting model (Claranet) has both advantages (expertise, 24/7 support) and limitations (dependency, slower iteration). Understanding where to draw the line between managed and self-managed is a critical skill.
Incident management is a skill that can only be developed through real practice. The pressure of production incidents with business impact taught me composure, systematic thinking, and clear communication under stress.