/ AI Image Generation / How to Keep Google Colab from Disconnecting During Training 2025
AI Image Generation 25 min read

How to Keep Google Colab from Disconnecting During Training 2025

Complete guide to preventing Google Colab disconnections during AI training. JavaScript keep-alive scripts, checkpointing strategies, Colab Pro comparison, and reliable workflows.

How to Keep Google Colab from Disconnecting During Training 2025 - Complete AI Image Generation guide and tutorial

Your LoRA training hits the 3-hour mark when Google Colab suddenly disconnects. Hours of GPU compute vanish. Your training progress disappears without saved checkpoints. Google Colab's 90-minute idle timeout and 12-hour maximum runtime create constant disconnection anxiety. Combining JavaScript keep-alive techniques with robust checkpointing strategies enables reliable long-duration training on Colab's free and Pro tiers.

Quick Answer: Prevent Google Colab disconnections using browser console JavaScript to bypass the 90-minute idle timeout, implement model checkpointing every 15-30 minutes to preserve training progress, upgrade to Colab Pro for 24-hour runtimes, and structure training sessions in resumable segments that automatically save state and continue from interruptions.

TL;DR: Keeping Colab Connected
  • Idle Timeout Solution: JavaScript console scripts simulate activity preventing 90-minute disconnection
  • Progress Protection: Checkpoint every 15-30 minutes to Google Drive preserving training state
  • Colab Pro Benefits: 24-hour runtime (vs 12 hours free), better GPU availability, longer idle timeouts
  • Best Practice: Combine keep-alive scripts with checkpointing for maximum reliability
  • Alternative: Split training into multiple shorter sessions with automatic resumption from checkpoints

You started training at 10 PM expecting to wake up to a completed LoRA model. Instead you find "Runtime disconnected" with zero progress saved. The frustration compounds when you realize this happens repeatedly, wasting free GPU hours and preventing completion of training projects. You need reliable methods that actually work in 2025 rather than outdated scripts broken by Colab interface changes.

Google Colab provides valuable free GPU access but its disconnection policies create challenges for serious AI training projects. Understanding both the disconnection mechanisms and proven mitigation strategies transforms Colab from unreliable experiment platform into viable training environment. While dedicated solutions like Apatero.com eliminate disconnection concerns entirely through stable infrastructure, mastering Colab techniques enables budget-conscious training and understanding of cloud training workflows generally.

What This Complete Colab Reliability Guide Covers
  • Understanding Google Colab's disconnection mechanisms and timeout policies
  • Implementing JavaScript keep-alive scripts that work in 2025
  • Building robust checkpointing systems that preserve training state
  • Comparing Colab Free vs Pro vs Pro+ for training reliability
  • Structuring resumable training workflows that survive disconnections
  • Troubleshooting common keep-alive script failures and CAPTCHA issues
  • Optimizing Google Drive integration for fast checkpoint saving
  • Monitoring session health and predicting disconnections before they occur

Why Does Google Colab Disconnect During Training?

Before implementing solutions, understanding Colab's disconnection mechanisms helps you choose appropriate countermeasures and set realistic expectations.

The Two Types of Colab Disconnections

Google Colab enforces two distinct timeout policies that affect training sessions differently. According to the official Colab documentation, these limits exist to ensure fair resource distribution across all users.

Idle Timeout (90 Minutes):

The idle timeout triggers when no user interaction occurs for approximately 90 minutes. User interaction means clicking buttons, running cells, or moving your mouse over the notebook interface. Your training script can run continuously processing data and your notebook still disconnects after 90 minutes of zero user interaction.

This timeout exists because idle sessions consume GPU resources other users could utilize. A notebook left open but inactive wastes expensive compute capacity. The 90-minute window gives generous time for active development work while preventing indefinite resource occupation.

Maximum Runtime Limit:

Colab Free imposes a 12-hour absolute runtime limit. After 12 consecutive hours, the session terminates regardless of activity or training status. Colab Pro extends this to 24 hours. Colab Pro+ provides up to 36 hours for certain GPU types.

This hard limit prevents individual users from monopolizing compute resources indefinitely. It also reflects the business model where extended runtimes encourage Pro subscriptions.

Colab Tier Idle Timeout Max Runtime GPU Priority Cost
Free ~90 minutes 12 hours Low $0/month
Pro ~90 minutes 24 hours High $10/month
Pro+ ~90 minutes 36 hours Highest $50/month

Understanding these limits helps set realistic training session lengths and checkpoint frequency.

What Triggers the Idle Detection?

Colab's idle detection monitors user interaction with the notebook interface rather than code execution. Your GPU churning at 100 percent utilization doesn't prevent idle timeout if you haven't clicked anything in the browser window recently.

Monitored Activities:

The system tracks mouse movements over the notebook, clicks on cells or buttons, keyboard input in cells or interface elements, and cell execution initiated manually by user. Automated cell execution from code doesn't count as user interaction.

Not Monitored:

Training script output printing to cells doesn't register as activity. GPU utilization percentage doesn't affect idle detection. Network requests from your code to external services don't count. Progress bars updating automatically within running cells provide no protection.

This distinction is critical because it means even heavy computation training that would take hours shows as idle if you don't manually interact with the interface.

Common Misconceptions About Colab Disconnections

Several widespread misconceptions cause confusion about why disconnections occur and how to prevent them.

Misconception 1: Active code execution prevents disconnection

Many users believe that code actively running protects against idle timeout. This is false. According to Stack Overflow discussions from 2024-2025, training scripts running for 6 hours still trigger idle timeout at 90 minutes without user interaction.

Misconception 2: Colab Pro eliminates disconnections

Colab Pro extends maximum runtime and improves GPU availability but maintains the 90-minute idle timeout. Pro subscribers still need keep-alive solutions for training sessions exceeding 90 minutes without manual interaction.

Misconception 3: Printing output prevents idle detection

Generating console output through print statements or progress bars doesn't register as user activity. The idle timer continues counting down regardless of output generation.

Misconception 4: Opening multiple tabs shares the timeout

Each Colab notebook tab has independent idle timeouts. Interacting with one notebook doesn't reset idle timers for other open notebooks. Each requires separate attention to prevent disconnection.

How Do JavaScript Keep-Alive Scripts Work?

JavaScript executed in your browser console can simulate user interaction preventing idle timeout detection. This represents the most common approach for keeping Colab sessions alive during training.

Understanding Browser Console JavaScript Execution

Modern browsers allow running JavaScript code in developer consoles. This code executes in the context of the current webpage and can interact with page elements just like manual user actions.

Colab's notebook interface runs in your browser as a JavaScript application. Browser console JavaScript can trigger the same interface interactions that manual clicking would, effectively simulating user activity that resets the idle timer.

Why This Approach Works:

From Colab's perspective, JavaScript-triggered interactions are indistinguishable from manual interactions. The system tracks mouse events, clicks, and keyboard input at the browser event level. JavaScript generating these events appears identical to human-generated events.

This technique works entirely client-side in your browser. Your training code running on Google's servers remains unmodified. The keep-alive logic exists separately in your browser maintaining the connection.

Implementing the Basic Keep-Alive Script

Open your browser's developer console while viewing your Colab notebook. Press F12 on Windows and Linux or Cmd+Option+I on Mac. Alternatively right-click anywhere on the Colab page and select Inspect, then click the Console tab.

Current Working Script (2025):

Create a function called KeepClicking that logs a message to the console and uses document.querySelector to find the colab-connect-button element. Navigate through the shadowRoot to access the connect button's ID and trigger a click event on it. Wrap this function in setInterval with a 60000 millisecond delay so it repeats every 60 seconds. Paste this code into the console and press Enter to start execution.

The script runs continuously as long as the browser tab remains open and the console stays active. Closing the console or browser tab stops execution and idle timeout resumes normal counting.

How the Script Functions:

The querySelector finds the Colab connection button element in the page. The shadowRoot.getElementById navigates through the shadow DOM where Colab's custom elements hide. The click() method triggers a click event on the button. setInterval repeats this action every 60 seconds indefinitely.

According to research from Colab keep-alive implementations, clicking every 60 seconds provides sufficient activity without overwhelming Colab's systems with excessive requests.

Alternative Keep-Alive Script Approaches

Different JavaScript approaches offer variations in reliability and complexity. Some methods prove more resilient to Colab interface changes than others.

Mouse Movement Simulation:

Create a function called simulateMouseActivity that creates a new MouseEvent with type mousemove. Configure the event with view set to window, bubbles set to true, and cancelable set to true. Dispatch this event to the document and log a message confirming the simulation. Wrap this in setInterval with 60000 millisecond intervals. This script simulates mouse movement events. It's more resilient to interface changes since it doesn't depend on specific button selectors. However, recent Colab updates sometimes ignore simulated mouse movements, making this less reliable than button clicking.

Keyboard Activity Simulation:

Create a function called simulateKeyPress that generates a new KeyboardEvent of type keydown with the key property set to Shift. Dispatch this event to the document and log a confirmation message. Use setInterval to repeat this every 60000 milliseconds. Simulating Shift key presses provides another activity signal. This method avoids clicking buttons or moving the mouse but Colab's idle detection may not register keyboard events as reliably as mouse interactions.

Combined Approach:

Create a keepAlive function that first logs a keep-alive ping message. Inside a try-catch block, attempt to find the colab-connect-button using querySelector, access its shadowRoot, get the connect element by ID, and trigger a click. If this fails and throws an error, the catch block logs the failure message and dispatches a MouseEvent with type mousemove as a fallback. Set this function to run every 60000 milliseconds using setInterval. This combined script attempts button clicking and falls back to mouse movement if the button selector fails. The try-catch error handling makes the script more robust against Colab interface changes.

Troubleshooting Keep-Alive Script Failures

Keep-alive scripts occasionally fail due to Colab interface updates, browser security changes, or CAPTCHA challenges. Systematic troubleshooting identifies and resolves issues.

Script Not Running:

If pasting the script into console produces no output or errors, verify you're in the correct console tab. Some browsers have multiple console contexts. Ensure you're in the main page console, not an iframe or extension console.

Check for JavaScript errors displayed in red text. Syntax errors prevent script execution. Copy the script carefully without adding extra characters or missing code segments.

Button Selector Not Found:

If console shows "Cannot read property of null" errors, the button selector failed. Colab interface updates change element IDs and class names breaking scripts.

Inspect the connect button element using browser developer tools. Right-click the connect button, select Inspect, and examine the element structure. Update the querySelector path to match current element hierarchy.

According to recent Colab interface analysis, Google updates Colab's UI periodically requiring script adjustments. Join Colab user communities to find updated scripts when interface changes break existing solutions.

CAPTCHA Challenges:

Google occasionally presents CAPTCHA challenges even with keep-alive scripts running. The system detects suspicious patterns and requires human verification.

CAPTCHAs are manual interventions that automated scripts cannot solve. You must personally complete the CAPTCHA to continue the session. Keep-alive scripts cannot bypass this security measure.

To minimize CAPTCHA frequency, avoid running excessive scripts, use moderate keep-alive intervals (60-90 seconds rather than every 5 seconds), and don't run multiple Colab sessions simultaneously with keep-alive scripts. Responsible script usage reduces security flag triggers.

Free ComfyUI Workflows

Find free, open-source ComfyUI workflows for techniques in this article. Open source is strong.

100% Free MIT License Production Ready Star & Try Workflows

What Is Robust Checkpointing and Why Is It Essential?

Keep-alive scripts mitigate idle timeouts but don't prevent hard runtime limits or unexpected crashes. Checkpointing provides the essential safety net preserving training progress regardless of disconnection cause.

Understanding Training Checkpoints

Checkpoints are complete snapshots of training state enabling resumption from specific points. According to machine learning best practices, robust checkpointing is more important than keep-alive scripts for production training workflows.

What Checkpoints Include:

Complete checkpoints save model weights (current neural network parameters), optimizer state (Adam, SGD momentum and learning rate values), training step counter (current epoch and batch numbers), random number generator state (ensuring reproducible continuation), and training loss history (enabling monitoring across disconnections).

Partial checkpoints saving only model weights can't fully resume training. Optimizer state is critical because optimizers like Adam maintain momentum that affects learning trajectory. Resuming without optimizer state continues training but loses optimization momentum.

Checkpoint Frequency Trade-offs:

More frequent checkpoints provide better progress protection but consume more time and storage. Checkpointing every epoch works well for slow training with few epochs. Checkpointing every 100-200 steps suits fast training with thousands of steps.

According to practical testing, checkpointing every 15-30 minutes provides optimal balance for Colab training. This protects against idle timeouts (90 minutes) while limiting checkpoint overhead to 5-10 percent of training time.

Implementing PyTorch Checkpointing in Colab

PyTorch provides simple checkpointing through torch.save() and torch.load() functions. Implementing robust checkpointing requires careful state management and error handling.

Basic PyTorch Checkpoint Saving:

Save checkpoints during training loops:

After each epoch or every N steps, create checkpoint dictionary containing all state, save to Google Drive for persistence across sessions, and handle potential I/O errors gracefully.

The checkpoint dictionary should include:

model.state_dict() for model parameters, optimizer.state_dict() for optimizer state, epoch number, training loss history, and any custom training variables.

Checkpoint Loading for Resumption:

At training start, check if checkpoint exists. Load checkpoint if found, extract and restore all saved state, and continue training from saved point.

Handle the case where no checkpoint exists (first training run) versus checkpoint available (resuming training). The code should work correctly in both scenarios without manual intervention.

Google Drive Integration:

Mount Google Drive to persist checkpoints beyond session lifetime. Without Drive mounting, checkpoints save to temporary session storage that vanishes with disconnection.

Mount Drive early in your notebook before training starts. All checkpoint paths should write to /content/drive/MyDrive/checkpoints/ or similar Drive locations.

Implementing TensorFlow/Keras Checkpointing

TensorFlow and Keras provide ModelCheckpoint callback for automatic checkpointing during training. This high-level interface simplifies checkpoint management.

Keras ModelCheckpoint Setup:

Create ModelCheckpoint callback specifying checkpoint file path, monitoring metric (validation loss or training loss), save best only or save all epochs, and save frequency (every epoch or every N batches).

Pass the checkpoint callback to model.fit() which handles checkpoint saving automatically during training.

Custom TensorFlow Checkpointing:

For custom training loops, use tf.train.Checkpoint() and CheckpointManager for more control. This approach enables checkpointing of custom training variables beyond standard model weights and optimizer state.

Want to skip the complexity? Apatero gives you professional AI results instantly with no technical setup required.

Zero setup Same quality Start in 30 seconds Try Apatero Free
No credit card required

CheckpointManager handles checkpoint rotation keeping only the N most recent checkpoints. This prevents unlimited checkpoint accumulation consuming excessive Drive storage.

Optimizing Checkpoint Save Speed

Checkpoint saving speed matters because slow I/O creates training bottlenecks. Saving 5GB checkpoints every 15 minutes that take 3 minutes to write wastes 20 percent of training time.

Checkpoint Size Optimization:

Save only essential state rather than redundant information. Don't save training data or validation data in checkpoints (reload from source). Don't save generated samples or visualization images in checkpoints. Only save model parameters, optimizer state, and minimal training metadata.

Use efficient serialization formats. PyTorch's torch.save() uses pickle by default which is reasonably efficient. For extremely large models, consider safetensors format which provides faster loading and better security properties.

Parallel Checkpoint Saving:

Save checkpoints in background threads allowing training to continue immediately. Python's threading module enables parallel I/O operations.

Be careful with thread safety. Checkpoint dictionaries should be created in the main thread before background saving begins. Don't modify state dictionaries while background save operates.

Google Drive Write Performance:

Google Drive write speeds from Colab vary from 10-50 MB/s depending on current load. Large checkpoints naturally take longer.

Monitor actual checkpoint save times and adjust frequency accordingly. If 15-minute checkpoints take 5 minutes to save, reduce frequency to 30-minute intervals or optimize checkpoint size.

How Does Colab Pro Compare for Training Reliability?

Colab Pro and Pro+ subscriptions provide improvements that affect training reliability. Understanding what you get helps evaluate if the subscription is worthwhile for your projects.

Colab Pro Features and Benefits

Colab Pro costs $10 monthly and provides several improvements over free tier according to official Colab pricing.

Extended Runtime Limits:

Pro provides 24-hour maximum runtime versus 12 hours on free tier. This doubles available training time before forced disconnection. For projects requiring 15-20 hours training, Pro becomes essential rather than optional.

Note that Pro still enforces the 90-minute idle timeout. Keep-alive scripts remain necessary for unattended training sessions exceeding 90 minutes.

Better GPU Availability:

Pro users receive priority GPU access. During peak usage when free tier users can't access GPUs, Pro subscribers typically get immediate GPU allocation. This eliminates waiting and enables starting training when needed rather than when capacity happens to be available.

Pro provides access to better GPU types. While free tier users typically get T4 GPUs, Pro users can access V100 or A100 GPUs providing 2-4x training speed improvements. Faster training means completion within runtime limits becomes more feasible.

Increased Resource Limits:

Pro provides more RAM (up to 52GB vs 13GB free tier) and more disk space (up to 225GB vs 78GB free tier). For training with large datasets or models, these increased limits prevent out-of-memory errors that plague free tier users.

Is Colab Pro Worth It?:

For casual experimentation and learning, free tier suffices. For serious projects requiring regular training sessions, Pro provides valuable reliability improvements justifying the $10 monthly cost. Consider that a single wasted training session due to disconnection represents hours of lost time worth far more than $10 to most professionals.

Colab Pro+ Features and Benefits

Colab Pro+ costs $50 monthly and targets professional users requiring maximum resources. According to practical user reports, the value proposition is less clear than regular Pro.

Extended Runtime to 36 Hours:

Pro+ theoretically provides 36-hour runtimes for certain GPU types. However, users report inconsistent enforcement and many sessions still disconnect at 24 hours. The 36-hour limit appears to apply only under specific conditions not always clearly communicated.

Join 115 other course members

Create Your First Mega-Realistic AI Influencer in 51 Lessons

Create ultra-realistic AI influencers with lifelike skin details, professional selfies, and complex scenes. Get two complete courses in one bundle. ComfyUI Foundation to master the tech, and Fanvue Creator Academy to learn how to market yourself as an AI creator.

Early-bird pricing ends in:
--
Days
:
--
Hours
:
--
Minutes
:
--
Seconds
51 Lessons • 2 Complete Courses
One-Time Payment
Lifetime Updates
Save $200 - Price Increases to $399 Forever
Early-bird discount for our first students. We are constantly adding more value, but you lock in $199 forever.
Beginner friendly
Production ready
Always updated

Better GPU Options:

Pro+ provides access to premium GPUs including A100 and the possibility of V100 32GB models. These GPUs significantly outperform T4 and standard V100 options. An A100 trains approximately 4x faster than T4 for many workloads.

Background Execution:

Pro+ promises background execution allowing closure of browser tabs while training continues. However, implementation is spotty and users report mixed results. This feature doesn't work reliably enough to depend on currently.

Is Pro+ Worth It?:

For most users, Pro+ doesn't provide $50 worth of value compared to $10 Pro. The primary benefit is A100 GPU access. If your training workloads can leverage A100 performance, Pro+ becomes worthwhile. For training that runs fine on V100, regular Pro offers better value.

Many users find better value in dedicated GPU cloud providers like Vast.ai or RunPod for projects justifying Pro+ costs. These alternatives provide more predictable access and performance at comparable or lower pricing. Platforms like Apatero.com provide another alternative with managed infrastructure eliminating disconnection concerns entirely.

Comparing Colab Tiers for Specific Training Projects

Different training project types benefit differently from Colab tier features. Match your tier choice to project requirements.

Short Training (Under 6 Hours):

Free tier handles short training fine with keep-alive scripts and checkpointing. The 12-hour limit provides ample margin. GPU availability may frustrate during peak hours but patience usually gets access.

Medium Training (6-20 Hours):

Colab Pro becomes valuable in this range. Free tier's 12-hour limit cuts training short requiring restart and continuation. Pro's 24-hour limit allows single-session completion with margin for unexpected slowdowns.

Better GPU access through Pro significantly reduces frustration waiting for compute availability. Priority access means starting training when ready rather than checking back repeatedly hoping for capacity.

Long Training (20+ Hours):

Projects requiring more than 24 hours face challenges even with Pro. Pro+ theoretically helps but unreliable 36-hour limits make planning difficult.

Consider restructuring training into multiple resumable segments. Train 20 hours, save final checkpoint, start new session, load checkpoint, continue another 20 hours. This approach works across any Colab tier but requires proper checkpoint implementation.

Alternatively, use dedicated GPU cloud providers for very long training jobs. Colab works best for training completing within 12-24 hour windows with proper checkpointing.

How Do You Structure Resumable Training Workflows?

Proper workflow structure transforms training from fragile single-session jobs into robust multi-session projects that survive any disconnection.

Designing Auto-Resume Training Scripts

Auto-resume capabilities enable training to continue automatically after disconnection without manual intervention. This provides the ultimate reliability for Colab training.

Core Auto-Resume Components:

Check for existing checkpoint at training start. If checkpoint exists, load full training state and continue from last saved point. If no checkpoint exists, initialize new training from scratch. This logic runs automatically every time the notebook executes.

Implementation Pattern:

Structure your training initialization code to follow this pattern. Define checkpoint path in Google Drive, attempt loading checkpoint with error handling, extract loaded state if successful, initialize fresh training if no checkpoint found, and start training loop from correct position.

This structure means you can restart your notebook anytime and training automatically continues from last checkpoint. Disconnection becomes inconvenience rather than disaster.

Managing Training Across Multiple Sessions

Long training projects spanning multiple Colab sessions require careful state management and logging to maintain continuity.

Persistent Training Logs:

Save training logs to Google Drive alongside checkpoints. Include training loss history, validation metrics, learning rate schedule, and generation timestamps for each logged metric.

When loading checkpoints, also load training history allowing you to plot complete training curves across multiple sessions. This unified view helps identify learning issues and optimal stopping points.

Tracking Total Training Time:

Maintain cumulative training time across sessions. Each checkpoint should include total training time elapsed. When resuming, add current session time to loaded cumulative time.

This enables accurate tracking of actual training cost and helps planning future training budgets. Knowing a LoRA required 8 hours total across 3 sessions helps estimate similar future projects.

Session Metadata Recording:

Log each training session's details to Drive including session start time, session end time, GPU type used, initial checkpoint loaded, final checkpoint saved, training steps completed, and any errors or issues encountered.

This metadata proves valuable for debugging inconsistent training results and understanding which sessions contributed most to final model quality.

Implementing Graceful Shutdown Procedures

Training scripts should detect impending disconnections and save state gracefully rather than abruptly terminating mid-update.

Detecting Runtime Warnings:

Colab occasionally displays warnings before disconnecting. While you can't reliably catch these in code, you can implement periodic checkpoint checks that ensure recent checkpoints always exist.

Checkpoint at regular intervals (every 15-30 minutes as discussed) rather than only at epoch boundaries. This ensures maximum progress preservation even if disconnection occurs mid-epoch.

Handling Interrupt Signals:

Python signal handlers can catch some termination events enabling final checkpoint saving:

Register signal handlers that save checkpoints when receiving termination signals. This provides last-chance state saving during some disconnection scenarios.

However, not all Colab disconnections send catchable signals. Hard runtime limit disconnections may terminate abruptly without signal handlers executing. Periodic checkpointing remains essential regardless of signal handling.

Frequently Asked Questions

Does running code prevent Google Colab from disconnecting?

No, active code execution doesn't prevent idle timeout disconnection. Colab's idle detection monitors user interaction with the interface rather than code execution. Your training script can run at 100 percent GPU utilization and still trigger idle timeout after 90 minutes without manual mouse or keyboard interaction. This is why keep-alive scripts that simulate user activity are necessary for unattended training sessions.

Can Colab detect and ban accounts using keep-alive scripts?

Google's terms of service prohibit "abusive use" of Colab resources including running indefinite background scripts. However, using keep-alive scripts for legitimate training projects during reasonable hours falls in a gray area. Most users report no issues with moderate keep-alive usage. Excessive use like running 24/7 scripts across multiple accounts or cryptocurrency mining attracts attention and potential bans. Use keep-alive responsibly for actual training projects to minimize risk.

Why does my keep-alive script stop working after Colab updates?

Colab's interface updates change HTML element IDs, classes, and structure that keep-alive scripts depend on. When Google updates the interface, querySelector selectors in scripts break causing click attempts to fail. This requires updating scripts to match new interface structure. Join Colab user communities on GitHub, Reddit, or Stack Overflow where users share updated scripts when interface changes break existing ones.

Is checkpointing necessary if I use keep-alive scripts?

Yes, checkpointing remains essential even with working keep-alive scripts. Keep-alive prevents idle timeout but doesn't protect against the hard runtime limit (12 hours free, 24 hours Pro), unexpected Colab crashes or maintenance, network disconnections breaking the session, or browser crashes killing the keep-alive script. Robust checkpointing provides protection against all disconnection causes and is considered best practice for any serious training project.

How often should I save checkpoints during training?

Checkpoint every 15-30 minutes for optimal balance between progress protection and training efficiency. More frequent checkpointing (every 5 minutes) wastes time on I/O overhead. Less frequent checkpointing (every 2 hours) risks losing substantial progress to unexpected disconnections. Monitor your checkpoint save times and adjust frequency accordingly. If checkpoints take 3 minutes to save, 20-30 minute intervals prevent spending excessive time on checkpointing relative to training.

Will Colab Pro prevent all disconnections?

No, Colab Pro still enforces the 90-minute idle timeout requiring keep-alive scripts for unattended training. Pro extends the maximum runtime from 12 to 24 hours but doesn't eliminate disconnections entirely. Pro provides better reliability through priority GPU access and longer runtimes but keep-alive scripts and checkpointing remain necessary for long training sessions on any Colab tier including Pro and Pro+.

Can I run multiple Colab notebooks with keep-alive scripts simultaneously?

Technically yes but this increases CAPTCHA likelihood and account restriction risk. Each notebook requires its own keep-alive script since idle timeouts are per-notebook. Running many simultaneous notebooks with keep-alive scripts looks suspicious to Google's abuse detection systems. For legitimate needs, running 2-3 notebooks simultaneously is generally acceptable but 10+ concurrent notebooks with keep-alive scripts invites problems. Consider alternatives like Vast.ai or RunPod for large-scale parallel training.

How much Google Drive storage do training checkpoints consume?

Checkpoint size depends on your model. Small models (SD 1.5 LoRA) create 50-200MB checkpoints. Medium models (SDXL LoRA) create 200-800MB checkpoints. Large models (full SDXL fine-tune) create 5-7GB checkpoints. Multiply checkpoint size by number of checkpoints you save. Implement checkpoint rotation keeping only the 3-5 most recent checkpoints to prevent unlimited storage growth. Free Google Drive provides 15GB which handles LoRA training but may be insufficient for full model fine-tuning requiring checkpoint rotation.

What happens to training if my browser closes while using keep-alive scripts?

Closing the browser tab running keep-alive scripts stops JavaScript execution allowing idle timeout to resume normal counting. Your training code on Colab's servers continues running temporarily but disconnects after approximately 90 minutes once keep-alive stops. This is why checkpointing is essential. When you realize the browser closed, immediately reopen the notebook, restart the keep-alive script, and monitor whether disconnection occurred. If disconnected, restart the notebook and training auto-resumes from last checkpoint.

Does Colab Pro+ background execution work reliably?

User reports indicate Pro+ background execution is unreliable in 2025. The feature promises allowing browser tab closure while training continues but implementation is inconsistent. Many users report training still disconnecting even with Pro+ when closing tabs. Don't depend on this feature currently. Use keep-alive scripts and checkpointing even with Pro+ subscription. Google may improve background execution reliability in future updates but treat it as experimental rather than dependable currently.

Building Reliable Training Workflows on Colab

You now understand the complete strategy for preventing Colab disconnections and protecting training progress. Successful Colab training combines multiple techniques in layered defense against disconnection causes.

Implement keep-alive JavaScript scripts to mitigate idle timeouts. Use the current working script variations shared in this guide and monitor Colab user communities for updated scripts when interface changes break existing ones. Run scripts responsibly at reasonable intervals (60-90 seconds) to minimize CAPTCHA triggers and account restriction risk.

Build robust checkpointing into every training project. Save complete training state including model weights, optimizer state, step counters, and training logs every 15-30 minutes to Google Drive. Implement auto-resume logic so restarting your notebook automatically continues from last checkpoint without manual intervention.

Consider Colab Pro subscription for projects requiring 12-24 hour training sessions. The $10 monthly cost provides extended runtimes, better GPU availability, and increased resource limits justifying the investment for serious projects. Evaluate Pro+ carefully as most users find better value in regular Pro or dedicated GPU cloud providers at that price point.

Structure training in resumable segments that survive multiple disconnections. Maintain persistent logs across sessions providing unified view of training progress. Track cumulative training time and session metadata enabling project planning and debugging.

Remember that Colab provides valuable free and low-cost GPU access but wasn't designed for long unattended training jobs. The platform excels at interactive development, experimentation, and training completing within 12-24 hour windows with proper checkpointing. For production training requiring guaranteed uptime and resources, consider dedicated alternatives.

While platforms like Apatero.com eliminate these disconnection challenges through stable managed infrastructure, mastering Colab techniques provides valuable cloud training experience and budget-conscious access to GPU resources. The skills you develop working within Colab's constraints transfer to understanding any cloud-based training environment.

Your layered approach combining keep-alive scripts, robust checkpointing, appropriate tier subscription, and resumable workflow design transforms Colab from frustrating disconnection source into reliable training platform suitable for serious AI projects within its intended use cases.

Ready to Create Your AI Influencer?

Join 115 students mastering ComfyUI and AI influencer marketing in our complete 51-lesson course.

Early-bird pricing ends in:
--
Days
:
--
Hours
:
--
Minutes
:
--
Seconds
Claim Your Spot - $199
Save $200 - Price Increases to $399 Forever