ICONIQ Capital · San Francisco, CA

Mujtaba Jafri — IT Engineer · AI & Infrastructure

Currently building AI-powered cybersecurity and automated access management at ICONIQ Capital. Previously embedded with Meta's Physics AI research team — running a 64-node, 512-GPU NVIDIA DGX cluster and executing dual 10G→100G WAN/LAN upgrades while researchers kept shipping.

Years
11+
GPUs
512
Backbone
100G
Meta rank
Top 5%
mujtaba@meta-dgx-01 — zshlink up · 100G
mujtaba@meta-dgx-01:~$
dgx-cluster · sausalito
64 nodes · 512 GPUs
util
0%
node-01 → node-64 live workload
0
GPUs maintained
0
QSFP modules terminated
0
Tickets closed @ Meta
0%
Customer satisfaction
// experience

Where I've kept the lights on.

Seven years across AI research, fintech, enterprise, and healthcare. Each role chosen for the same reason: complex infrastructure, no excuse for downtime.

ICONIQ Capital

IT Engineer

Mar 2026 — Present
San Francisco, CA
Current
Status
  • Building custom automation for distribution list and access management with Slack approval integration — replacing manual IAM workflows with self-service tooling.
  • Integrating AI into cybersecurity operations for phishing detection and prevention — leveraging ML models to analyze email patterns and flag threats in real time.
  • Developing AI-powered documentation and knowledge base integration for Jira — enabling intelligent search and auto-suggested resolutions from historical tickets.
  • Full-spectrum IT engineering across a premier investment firm — endpoint management, network operations, security tooling, and infrastructure automation.
AutomationAI/MLCybersecuritySlack APIJiraDL/Access Management
Meta

Network & Infrastructure Engineer

Physics AI Research · via Insight Global
Aug 2025 — Feb 2026
Sausalito, CA
512
GPUs maintained
64
DGX nodes
175
QSFP modules terminated
13+
Unplanned outage hrs avoided
  • Primary on-site infra engineer for Meta's Physics AI research team — kept researchers running simulations without interruption.
  • Architected dual 10G→100G WAN and LAN upgrades: 14 switches, 175 QSFP modules, multi-team downtime orchestration.
  • Administered 64-node NVIDIA DGX fleet (512 GPUs): drive swaps, kernel rollbacks, crash diagnostics, RMA pipelines.
  • Delivered full data center PDU upgrade with zero risk to active research workloads.
  • Built cross-site network for motion capture + event production teams supporting hundreds of real-time users.
  • Authored Chef remediation scripts that eliminated recurring driver and OS-level failures across the compute fleet.
  • Led Windows 11 compliance migration for 117 lab machines — on time, zero workflow disruption.
  • Hardened air-gapped lab perimeters: compliance check-in, patch enforcement across cameras, 3D printers, lab gear.
  • Closed 464+ tickets spanning switch imaging, hardware swaps, on-call response, and storage provisioning.
NVIDIA DGX100G WAN/LANChefCiscoPalo AltoJuniper
Prosper Marketplace

Desktop Support Engineer

Sep 2023 — Jan 2025
San Francisco, CA
40%+
Tickets handled
−15%
Resolution time cut
+20%
Team efficiency gain
20-30
Offboards / mo
  • Automated IAM workflows across AD and Azure AD with RBAC + scripted provisioning — +20% team efficiency.
  • Built AI-powered and logic-driven Slack bots from scratch for ticket triage and self-service — −15% resolution time.
  • Owned 40%+ of all company tickets; top-3 in productivity; 100% SLA on PCI-compliant offboards.
  • Led PCI compliance audits with security; owned full Google Workspace + M365 user lifecycle.
  • Full Mac/Windows support, imaging, encryption, and A/V production for board meetings and all-hands.
Active DirectoryAzure ADSlack botsPCI-DSSOkta
Meta

Enterprise Support Technician

May 2022 — May 2023
Menlo Park, CA
2,504
Cases closed
98%
CSAT
Top 5%
Global rank
6,000+
Technicians trained
  • Closed 2,504 cases at 98% CSAT — ranked top 5% of 6,000+ Meta technicians worldwide.
  • Designed enterprise training on Zero Trust, DevOps automation, and the software lifecycle — −18% ramp time.
  • Frontline incident commander during outages — cut unplanned downtime by 5-8 hours / month.
  • Ran enterprise mobile ops across AT&T, Verizon, T-Mobile, Sprint — +12% efficiency, −10% carrier costs.
  • Partnered with security to deploy patches, MDM tooling, and automation across the fleet.
Zero TrustDevOpsMDMiOS/Android fleet
Stanford HealthCare

Senior Field Service Technician

Sep 2018 — May 2022
Palo Alto, CA
1,000+
Users supported
500+ users
Relocation cutover
+20%
Helpdesk efficiency
24/7
Critical uptime
  • Built and maintained advanced lab environments with Abbott, Thermo Scientific, Illumina, and Bio-Rad for gene slicing, DNA analysis, and vaccine testing.
  • Directed end-to-end relocation for 500+ users — 100% on-time cutover, zero first-day disruption.
  • Engineered ServiceNow automation for device deployment, signage, and ticketing — +20% efficiency.
  • Enforced HIPAA full-disk encryption + Imprivata EAM across ORs, ERs, and patient rooms — 24/7 uptime.
  • Scaled Mac/Windows, VoIP, wireless printers, and barcode scanners across 1,000+ users.
HIPAAServiceNowImprivata EAMLab infra
Box IT

IT Engineer

Apr 2017 — Jun 2018
San Francisco, CA
3,500+
Tickets completed
300+
Companies supported
6,000+
Users supported
87%
Same-day resolution
  • Provided multi-tenant IT support for 6,000+ users across 300+ companies — remote and onsite.
  • Deployed servers, VMs, and workstations; installed and configured operating systems at scale.
  • Completed 3,500+ tickets with 87% same-day resolution and 94% CSAT.
  • Configured Cisco and Ubiquiti networking gear: access points, routers, switches, and gateways.
MSPCiscoUbiquitiVMwareMulti-tenant
SunRun

IT Specialist

Mar 2017 — Oct 2017
San Francisco, CA
  • Desktop and phone support for 5,000+ employees — remote and onsite.
  • Imaged machines and enrolled users in Active Directory and Okta SSO.
  • Managed site inventory: asset tagging, AD binding, and documentation.
  • Resolved Tier I–III escalations and supported network admins on time-sensitive projects.
  • Enhanced organization-wide security measures — +25% customer satisfaction improvement.
Active DirectoryOkta SSOImagingAsset management
Cernx

Network & Systems Administrator

Sep 2015 — Apr 2017
Emeryville, CA
350
Users managed
  • Administered Active Directory, workstations, and network security for 350 users across multiple office and warehouse locations.
  • Created network topologies and implemented servers and networking equipment — provided VPN access to remote workstations.
  • Engineered new servers to optimize IT infrastructure, minimize hardware load, and reduce software issues.
  • Configured, installed, and maintained routers, switches, and firewalls.
  • Advised the board of directors on IT needs, planning future expansion and establishing trust between management and IT.
Active DirectoryNetworkingServer adminVPN
Saweedo

Lead Web Developer

Apr 2012 — Jun 2013
El Cerrito, CA
  • Built and maintained web pages using HTML/CSS with advanced scripting practices.
  • Engineered layouts and UIs; integrated data from back-end services and databases.
  • Delivered installation, configuration, and support for web-based applications.
HTML/CSSJavaScriptWeb appsBackend integration
// selected work

Projects that shipped without drama.

A traditional resume can't show the scale. These are the things I'm proudest of.

ICONIQ Capital

AI Cybersecurity — Phishing Detection

Integrating AI/ML models into cybersecurity operations for real-time phishing detection and prevention — analyzing email patterns and flagging threats before they reach users.

AI/MLCybersecurityEmail security
ICONIQ Capital

Automated DL & Access Management

Custom-built automation for distribution list and access management with Slack approval integration — replacing manual IAM workflows with self-service tooling.

AutomationSlack APIIAM
ICONIQ Capital

AI Knowledge Base for Jira

AI-powered documentation and knowledge base integration for Jira — enabling intelligent search and auto-suggested resolutions from historical ticket data.

AIJiraKnowledge base
Meta · Physics AI

10G → 100G WAN/LAN Upgrade

Dual-path backbone uplift across multiple sites. 14 switches, 175 QSFPs, choreographed downtime — 13+ outage hours eliminated.

100GQSFP28CiscoJuniper
Meta · Physics AI

64-Node DGX Cluster Operations

Hands-on for 512 GPUs powering AI simulation. Diagnostics, RMA, kernel ops, and patch discipline that kept researchers shipping.

NVIDIA DGXLinuxRMAKernel
Meta · Physics AI

Chef Auto-Remediation

Fleet-wide recipes that detect and fix driver conflicts, package failures, and OS instability before they wake a human.

ChefBashReliability
Prosper Marketplace

AI-Powered Slack Bots

Triage + self-service bots that cut resolution time 15% and absorbed the boring half of L1 support.

AISlack APIAutomation
Meta · Physics AI

Air-Gapped Lab Hardening

Compliance check-in and patching across cameras, 3D printers, and lab gear inside isolated research networks.

SecurityComplianceLab Infra
Stanford HealthCare

500-User Relocation Cutover

Full network reconfig, hardware deploy, printer + drive maps — 100% on time with zero first-day tickets.

Project leadNetworkEndpoints
Personal

Self-Hosted AI Portfolio

This website — a full-stack SSR app with a self-hosted LLM chatbot running on bare metal. Docker, Nginx, custom Node.js server, streaming AI responses.

ReactNode.jsLLMDocker
// stack

The toolchain.

From rack-and-stack to RBAC. I pick the tool that keeps things boring and uptime high.

CiscoMerakiPalo AltoJuniperUbiquitiDHCPDNSVLANsVPN10G/100G WAN/LANFiber terminationNVIDIA DGXPDU managementNAS / SANDrive replacementRMA pipelinesRack & stackFiber cablingAWSAzureGCPVMwareDockerVirtualBoxCitrixLinux (CentOS, Ubuntu, Fedora, Kali)WindowsmacOSiOS / iPadOSAndroidChefBashPythonPowerShellAI-powered Slack botsServiceNow workflowsJira integrationsOktaSSOCrowdStrikeCarbon BlackFileVaultBitLockerHIPAAPCI-DSSZero TrustImprivata EAMFIDO2DuoAI phishing detectionML-based threat analysisAI knowledge base integrationLLM deploymentSelf-hosted inferenceActive DirectoryAzure ADGoogle WorkspaceMicrosoft 365ServiceNowJiraIBM BigFixMySQLHTML/CSSJavaScriptReactNode.jsDockerNginxSSRCiscoMerakiPalo AltoJuniperUbiquitiDHCPDNSVLANsVPN10G/100G WAN/LANFiber terminationNVIDIA DGXPDU managementNAS / SANDrive replacementRMA pipelinesRack & stackFiber cablingAWSAzureGCPVMwareDockerVirtualBoxCitrixLinux (CentOS, Ubuntu, Fedora, Kali)WindowsmacOSiOS / iPadOSAndroidChefBashPythonPowerShellAI-powered Slack botsServiceNow workflowsJira integrationsOktaSSOCrowdStrikeCarbon BlackFileVaultBitLockerHIPAAPCI-DSSZero TrustImprivata EAMFIDO2DuoAI phishing detectionML-based threat analysisAI knowledge base integrationLLM deploymentSelf-hosted inferenceActive DirectoryAzure ADGoogle WorkspaceMicrosoft 365ServiceNowJiraIBM BigFixMySQLHTML/CSSJavaScriptReactNode.jsDockerNginxSSR
Networking
CiscoMerakiPalo AltoJuniperUbiquitiDHCPDNSVLANsVPN10G/100G WAN/LANFiber termination
Infra & Data Center
NVIDIA DGXPDU managementNAS / SANDrive replacementRMA pipelinesRack & stackFiber cabling
Cloud & Virtualization
AWSAzureGCPVMwareDockerVirtualBoxCitrix
Operating Systems
Linux (CentOS, Ubuntu, Fedora, Kali)WindowsmacOSiOS / iPadOSAndroid
Automation & Scripting
ChefBashPythonPowerShellAI-powered Slack botsServiceNow workflowsJira integrations
Security & Compliance
OktaSSOCrowdStrikeCarbon BlackFileVaultBitLockerHIPAAPCI-DSSZero TrustImprivata EAMFIDO2Duo
AI & Cybersecurity
AI phishing detectionML-based threat analysisAI knowledge base integrationLLM deploymentSelf-hosted inference
Tools & Platforms
Active DirectoryAzure ADGoogle WorkspaceMicrosoft 365ServiceNowJiraIBM BigFixMySQL
Web Development
HTML/CSSJavaScriptReactNode.jsDockerNginxSSR
certifications & education
Google IT Support Professional
Google / Coursera · Apr 2020
Systems Networking & Management
West Contra Costa USD · May 2015
Digital Media & Web Design
West Contra Costa USD · May 2015
High School Diploma
El Cerrito High School · Jul 2013
// ask the AI

Talk to a self-hosted AI about me.

This isn't a third-party widget. The model runs on my own hardware, with my resume injected as a system prompt. Ask it anything — experience, projects, skills, or whether I'd be a fit for your team.

jafri.ai — v1/chat/completions
Tell me about the 100G upgrade.
jafri.ai
>
model online · streaming · zero tokens logged
Resume-Grounded
My full work history, projects, and skills are injected as the system prompt. Every answer is factually tied to real experience.
Self-Hosted
Runs on my own GPU node behind a local vLLM/llama.cpp server. No OpenAI keys, no data leaving my network.
End-to-End Encrypted
TLS 1.3 from Cloudflare edge to origin. API keys are server-side secrets — never exposed to the browser.
Streaming Tokens
Real-time SSE streaming. You see the model think word-by-word, just like ChatGPT — but on my metal.
Try: "What's his experience with DGX?" · "Does he know OSPF?" · "Why hire him?"
// running this site

The stack behind this page.

This site, the AI chatbot, and the network it rides on — all self-hosted on my own metal. Same operational discipline I bring to production infra at work.

metal

Compute

Bare-metal that hosts both this site and the LLM serving the chatbot.

  • GPU nodeNVIDIA RTX-class GPU for inference (CUDA + cuDNN)
  • CPU hostMulti-core x86_64, ECC RAM, NVMe storage
  • PlatformUnraid — Docker containers for web, model, observability
  • OSUnraid OS (Slackware-based), Docker-managed services
inference

AI Serving

Self-hosted OpenAI-compatible model server — the brain behind the chat widget.

  • RuntimevLLM / llama.cpp — OpenAI-compatible /v1 endpoint
  • ModelOpen-weight LLM (Llama/Qwen class), quantized for the GPU
  • Context injectionResume + projects passed as system prompt server-side
  • StreamingServer-sent events, token-by-token render
delivery

Web & Edge

TanStack Start app, running as a Docker container behind Nginx on Unraid.

  • FrameworkTanStack Start (React 19, Vite 7, server functions)
  • StylingTailwind v4 + custom OKLCH design tokens
  • Reverse proxyNginx — TLS termination, gzip, proxy to Node.js container
  • ContainerizedDocker — multi-stage build, Node 22 Alpine, port 3000
wire

Network

Same discipline I run at work — segmented, monitored, no flat networks.

  • Reverse proxyNginx — SSL termination, routing to Docker containers
  • FirewallOPNsense / UniFi — stateful, IDS/IPS, geo-blocks
  • SegmentationVLANs for lab / IoT / mgmt; AI server in restricted VLAN
  • DNSInternal Pi-hole + Unbound; public via Cloudflare
hardening

Security

Defense in depth. API keys live in secrets, never in the browser.

  • SecretsEndpoint URL + API key stored as server-side env vars in Docker
  • TransportTLS 1.3 end-to-end — Nginx → Node.js container
  • AppStrict input caps, message-history trim, no PII logged
  • AccessSSH key-only, fail2ban, MFA on every admin surface
running it

Ops & Observability

If I can't see it, I can't fix it before users notice.

  • MetricsPrometheus + Grafana dashboards (GPU temp, tokens/sec, latency)
  • LogsLoki — structured logs from web + model server
  • UptimeUptime Kuma — public status, multi-region probes
  • BackupsRestic → off-site encrypted snapshots, weekly restore tests
// initiate connection

Got an infrastructure problem that can't go down?

Open to staff / senior infra and network roles in the Bay Area, remote-friendly. Fastest way to reach me is email.

© 2026 Mujtaba Jafri
system status: nominal