Inside Outpost’s computer vision architecture

A technical look at how we built a vision system that achieves 98%+ accuracy across millions of gate events and keeps getting better.

If you manage truck terminals, drop yards, or distribution facilities, your operations and security depend on accurate recognition at the gate: the equipment entering and exiting the property, the driver operating the vehicle, the condition of the truck and trailer, seal presence and integrity, all accurately timestamped and digitally recorded. When equipment reads are inaccurate or missed, the result can be a costly security incident, disputed damage claim, or simply unusable facility data.

Building a system that closes this gap, reliably, across every weather condition and equipment type, day or night, was the first problem we had to solve at Outpost’s 30+ terminals and drop yards. When we set out to build Outpost Gate Automation, we knew that recognition accuracy would be the foundation that our own operation depends on.

This post provides a technical dive into how our computer vision system works. We cover the model architecture, training approach, edge-cloud hybrid processing, and continuous improvement loop that keeps the system getting smarter over time.

The challenge: Real-world freight recognition

Gate recognition in freight logistics is harder than it might appear on the surface. Unlike controlled environments where vision systems thrive, terminal gates have numerous challenges:

Marking variability. Trailer IDs might be stenciled, painted, magnetic placards, or faded remnants of previous markings. They appear in different fonts, sizes, colors, and positions. Some equipment has overlapping IDs from previous owners or rentals.

Environmental conditions. Gates operate 24/7 with a requirement to accurately recognize equipment in daylight, dusk, darkness, rain, snow, fog, and blinding sun. Cameras get dirty. Lighting constantly changes.

Equipment diversity. A busy terminal might see dry vans, reefers, flatbeds, tankers, intermodal containers, chassis, straight trucks, and all forms of specialized equipment, each with different marking conventions and visual characteristics.

Speed and angles. Vehicles don’t stop and pose for photos. They pass through gates at varying speeds, from multiple angles, often partially obscured by other equipment or infrastructure.

Degraded markings. Numbers get scratched, faded, covered in dirt, or obscured by damage. The system has to read whatever’s there, not what should be there.

Traditional optical character recognition (OCR) approaches struggle with this variability. They’re designed for clean, standardized text, not “JBHU 456789” off a rust-stained, dust-covered container that’s been in service for 15 years.

Our approach: Vision language models

Outpost’s recognition system is a custom-trained vision language model (VLM). Unlike traditional optical character recognition that looks for character patterns, VLMs understand images more holistically and can identify what they’re looking at (say, a trailer, container chassis, or license plate), understand context (where IDs typically appear on different equipment types), and extract information even when individual characters are ambiguous.

Our VLM is specifically trained for freight and logistics applications. While general-purpose vision models have broad capabilities, they lack the domain knowledge needed to excel at gate recognition. Our model knows:

Standard Carrier Alpha Codes (SCACs) follow specific patterns and can distinguish between similar-looking characters based on valid code structures.
Container numbers have check digits that can validate or correct uncertain readings.
Certain trailer ID formats are associated with specific carriers or lessors.
Equipment types have characteristic marking locations and conventions.

This domain knowledge is encoded in the model through training on hundreds of millions of images from Outpost’s network of terminal deployments.

Multi-model consensus architecture

We don’t rely on a single model for recognition. Instead, we run multiple specialized models in parallel, each optimized for different aspects of the recognition task:

Primary VLM. Our core vision language model handles holistic scene understanding and text extraction.

License plate specialist. Our model trained on North American license plate formats, optimized for the angles and conditions typical of gate cameras.

Container code specialist. This focuses on intermodal container markings, including ISO codes, size and type indicators, and check digit validation.

Logo and marking detector. Our model identifies carrier logos, DOT numbers, and regulatory markings that provide additional context for identification.

These models process the same images independently, then a consensus layer combines their outputs. When multiple models agree, we have high confidence in the result. When they disagree, the system flags the discrepancy for resolution through additional analysis or human review.

This architecture provides several benefits:

Higher accuracy. Consensus across models catches errors that any single model might make.

Graceful degradation. If one model struggles with particular conditions (unusual lighting, for example), others may still succeed.

Uncertainty quantification. Disagreement between models provides a natural confidence signal that drives escalation decisions.

Multimodal analysis that goes beyond vision

While vision is our primary modality, we’ve found that incorporating other data streams significantly improves recognition accuracy.

Audio integration. Our kiosk capture audio from driver interactions. When a driver verbally provides their trailer number to the AI voice agent or through a call, that information correlates with visual recognition. If the driver says “ABCD 123456” and the vision system reads “ABCD 123456” or “ABCD 123458” (ambiguous character), the audio confirmation resolves the ambiguity.

Driver-provided data. Through our mobile web interface, drivers can enter or confirm equipment information, creating another data stream that corroborates or questions visual recognition results.

Historical context. If we’ve seen a particular tractor-trailer combination before, that history informs current recognition. A tractor that always pulls the same dedicated trailer creates a strong prior that influences how we interpret ambiguous readings.

Appointment and dispatch data. Integration with TMS and dispatch systems tells us what equipment is expected. If the dispatch says trailer XYZ should be arriving, and we see something that could be XYZ or XVZ, the contextual data helps resolve the ambiguity.

This multimodal approach is particularly valuable for edge cases where pure vision might struggle but where other signals can tip the balance toward correct recognition.

Edge-cloud hybrid processing

Our processing architecture splits work between edge devices at the gate and cloud-based services:

Edge processing (Nvidia Jetson platform)

Each gate deployment includes edge computing hardware based on Nvidia Jetson modules. These devices handle:

Real-time video processing. Capturing frames, detecting vehicles, and tracking movement through the gate zone.

Initial recognition. Running lightweight models that can make immediate decisions. Is this a vehicle we recognize? Should the gate open?

Local caching and resilience. Storing recent events and maintaining operation even if cloud connectivity is temporarily unavailable.

Edge processing ensures low latency for time-sensitive decisions. When a truck approaches the gate, you can’t wait for a round-trip to the cloud to decide whether to open the barrier. The edge handles these real-time requirements.

Cloud processing

More computationally intensive work happens in our cloud infrastructure:

Deep model inference. Our most sophisticated models run in the cloud, where we have access to powerful graphic processing unit (GPU) clusters. These models analyze images in detail, catching what lighter edge models might miss.

Cross-reference and validation. Cloud services check recognition results against databases, validate formats, and correlate with external data sources.

Continuous learning. Model training and updates happen in the cloud, with improved models pushed to edge devices automatically.

Multi-site coordination. For customers with multiple terminals, cloud services provide unified visibility and cross-site analytics.

Synchronization and resilience

The edge-cloud architecture is designed for resilience. If cloud connectivity is interrupted:

Edge devices continue operating with cached models and local processing.
Events are logged locally with full video and image capture.
When connectivity returns, events sync automatically with cloud services.
No data is lost, and gate operations continue uninterrupted.

This hybrid approach gives us the best of both worlds: edge speed and resilience combined with cloud sophistication and scale.

The training data advantage

Machine learning models are only as good as their training data. This is where Outpost’s position as a truck terminal operator (not just a software vendor) provides a significant advantage.

We operate gate automation across 40+ terminals (at the time of this article), processing millions of gate events. Every event generates training data: images, recognition results, and (critically) human validation of those results.

Our in-house remote operations team reviews recognition outputs, correcting errors and confirming successes. These corrections feed directly into our training pipeline:

Active learning. The system identifies cases of uncertainty and prioritizes those for human review, focusing effort on the examples that will most improve the model.

Error analysis. When the system makes mistakes, we analyze why. Was it a lighting issue, an unusual marking format, or a new equipment type? The answers tell us what data to collect and which models to improve.

Continuous retraining. We don’t train models once and deploy them forever. Our models are continuously updated with new data, adapting to new equipment types, new marking conventions, and new edge cases as they emerge.

The result: More deployments generate more data, which improves models, which improves future deployments. Customers benefit from the collective learning across our network.

Accuracy metrics: How we measure success

We track recognition performance across several dimensions:

Capture rate. What percentage of gate events result in successful identification? Our target is 99%+ capture, meaning we get a reading on virtually every vehicle that passes through.

Accuracy rate. Of the identifications we make, what percentage are correct? We consistently achieve 98%+ accuracy across our deployment base. This is a fully validated accuracy score, where any error or failure to cross-validate an ID using our multi-model and multi-modal consensus architecture is counted as a miss, ensuring a more rigorous and trustworthy benchmark than typical self-reported claims in our industry.

Latency. How quickly do we produce results? For stopless entry decisions, we target sub-second response times at the edge.

Confidence calibration. When the system says it’s 95% confident, is it right 95% of the time? Well-calibrated confidence enables appropriate escalation decisions.

These metrics are monitored continuously across our deployment base, with automatic alerting when performance degrades at any site.

Our roadmap for what’s next

Computer vision capabilities continue to rapidly advance, and we’re investing in several areas:

Damage detection refinement. Moving beyond detecting obvious damage to identifying subtle issues like hairline cracks, early rust, and component wear that might indicate maintenance needs.

Load status inference. Can we determine whether a trailer is loaded or empty based on visual cues? Early results are promising.

Predictive maintenance signals. Equipment condition observations over time could predict when maintenance will be needed, enabling proactive intervention.

Real-time video understanding. Moving from frame-by-frame analysis to true video understanding, capturing motion patterns and behaviors that static images miss.

The foundation we’ve built with the model architecture, training pipeline, and edge-cloud infrastructure positions us to incorporate these advances as they mature.

Conclusion

Building a vision system that works reliably in the demanding environment of freight terminal gates required rethinking traditional approaches. Our combination of domain-specific vision language models, consensus across multiple modes and models, and continuous learning from operational data has produced a system that achieves the accuracy needed to automate gate operations with confidence.

For IT leaders evaluating gate automation, the underlying technical architecture matters. The questions to ask are: How is the system trained? What happens when it’s uncertain? How does it improve over time? The answers reveal whether you’re getting a system that will perform reliably in production or a demo that falls apart when confronted with real-world complexity.

Want to learn more about our technical approach?

Download our Technical Architecture Whitepaper for detailed specifications, or schedule a technical deep dive with our engineering team.