robbyant
/

lingbot-map

Model card Files Files and versions

xet

Community

Add pipeline tag, license metadata and improve model card

by nielsr HF Staff - opened Apr 17

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+13

-52

Files changed (1) hide show

README.md +13 -52

README.md CHANGED Viewed

@@ -1,3 +1,8 @@
 <div align="center">
   <img src="assets/teaser.png" width="100%">
@@ -9,7 +14,7 @@ Robbyant Team
 <div align="center">
-[![Paper](https://img.shields.io/static/v1?label=Paper&message=arXiv&color=red&logo=arxiv)](https://arxiv.org/abs/2604.14141)
 [![PDF](https://img.shields.io/static/v1?label=Paper&message=PDF&color=red&logo=adobeacrobatreader)](lingbot-map_paper.pdf)
 [![Project](https://img.shields.io/badge/Project-Website-blue)](https://technology.robbyant.com/lingbot-map)
 [![HuggingFace](https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Model&message=HuggingFace&color=orange)](https://huggingface.co/robbyant/lingbot-map)
@@ -24,8 +29,9 @@ https://github.com/user-attachments/assets/fe39e095-af2c-4ec9-b68d-a8ba97e505ab
 ### 🗺️ Meet LingBot-Map! We've built a feed-forward 3D foundation model for streaming 3D reconstruction! 🏗️🌍
-LingBot-Map has focused on:
 - **Geometric Context Transformer**: Architecturally unifies coordinate grounding, dense geometric cues, and long-range drift correction within a single streaming framework through anchor context, pose-reference window, and trajectory memory.
 - **High-Efficiency Streaming Inference**: A feed-forward architecture with paged KV cache attention, enabling stable inference at ~20 FPS on 518×378 resolution over long sequences exceeding 10,000 frames.
 - **State-of-the-Art Reconstruction**: Superior performance on diverse benchmarks compared to both existing streaming and iterative optimization-based approaches.
@@ -49,8 +55,6 @@ conda activate lingbot-map
 pip install torch==2.9.1 torchvision==0.24.1 --index-url https://download.pytorch.org/whl/cu128
 ```
-> For other CUDA versions, see [PyTorch Get Started](https://pytorch.org/get-started/locally/).
 **3. Install lingbot-map**
 ```bash
@@ -66,21 +70,6 @@ FlashInfer provides paged KV cache attention for efficient streaming inference:
 pip install flashinfer-python -i https://flashinfer.ai/whl/cu128/torch2.9/
 ```
-> For other CUDA/PyTorch combinations, see [FlashInfer installation](https://docs.flashinfer.ai/installation.html).
-> If FlashInfer is not installed, the model falls back to SDPA (PyTorch native attention) via `--use_sdpa`.
-**5. Visualization dependencies (optional)**
-```bash
-pip install -e ".[vis]"
-```
-# 📦 Model Download
-| Model Name | Huggingface Repository | ModelScope Repository | Description |
-| :--- | :--- | :--- | :--- |
-| lingbot-map | [robbyant/lingbot-map](https://huggingface.co/robbyant/lingbot-map) | [Robbyant/lingbot-map](https://www.modelscope.cn/models/Robbyant/lingbot-map) | Base model checkpoint (4.63 GB) |
 # 🎬 Demo
 ### Streaming Inference from Images
@@ -99,37 +88,23 @@ python demo.py --model_path /path/to/checkpoint.pt \
 ### Streaming with Keyframe Interval
-Use `--keyframe_interval` to reduce KV cache memory by only keeping every N-th frame as a keyframe. Non-keyframe frames still produce predictions but are not stored in the cache. This is useful for long sequences
-which excesses 320 frames.
 ```bash
 python demo.py --model_path /path/to/checkpoint.pt \
     --image_folder /path/to/images/ --keyframe_interval 6
 ```
-### Windowed Inference (for long sequences, >3000 frames)
-```bash
-python demo.py --model_path /path/to/checkpoint.pt \
-    --video_path video.mp4 --fps 10 \
-    --mode windowed --window_size 64
-```
 ### Sky Masking
-Sky masking uses an ONNX sky segmentation model to filter out sky points from the reconstructed point cloud, which improves visualization quality for outdoor scenes.
 **Setup:**
 ```bash
-# Install onnxruntime (required)
-pip install onnxruntime        # CPU
-# or
-pip install onnxruntime-gpu    # GPU (faster for large image sets)
 ```
-The sky segmentation model (`skyseg.onnx`) will be automatically downloaded from [HuggingFace](https://huggingface.co/JianyuanWang/skyseg/resolve/main/skyseg.onnx) on first use.
 **Usage:**
 ```bash
@@ -137,15 +112,6 @@ python demo.py --model_path /path/to/checkpoint.pt \
     --image_folder /path/to/images/ --mask_sky
 ```
-Sky masks are cached in `<image_folder>_sky_masks/` so subsequent runs skip regeneration.
-### Without FlashInfer (SDPA fallback)
-```bash
-python demo.py --model_path /path/to/checkpoint.pt \
-    --image_folder /path/to/images/ --use_sdpa
-```
 # 📜 License
 This project is released under the Apache License 2.0. See [LICENSE](LICENSE.txt) file for details.
@@ -163,12 +129,7 @@ This project is released under the Apache License 2.0. See [LICENSE](LICENSE.txt
 # ✨ Acknowledgments
-We thank Shangzhan Zhang, Jianyuan Wang, Yudong Jin, Christian Rupprecht, and Xun Cao for their helpful discussions and support.
-This work builds upon several excellent open-source projects:
 - [VGGT](https://github.com/facebookresearch/vggt)
 - [DINOv2](https://github.com/facebookresearch/dinov2)
-- [Flashinfer](https://github.com/flashinfer-ai/flashinfer)
----

+---
+license: apache-2.0
+pipeline_tag: image-to-3d
+---
 <div align="center">
   <img src="assets/teaser.png" width="100%">
 <div align="center">
+[![Paper](https://img.shields.io/static/v1?label=Paper&message=arXiv&color=red&logo=arxiv)](https://huggingface.co/papers/2604.14141)
 [![PDF](https://img.shields.io/static/v1?label=Paper&message=PDF&color=red&logo=adobeacrobatreader)](lingbot-map_paper.pdf)
 [![Project](https://img.shields.io/badge/Project-Website-blue)](https://technology.robbyant.com/lingbot-map)
 [![HuggingFace](https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Model&message=HuggingFace&color=orange)](https://huggingface.co/robbyant/lingbot-map)
 ### 🗺️ Meet LingBot-Map! We've built a feed-forward 3D foundation model for streaming 3D reconstruction! 🏗️🌍
+LingBot-Map is a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture.
+Key features include:
 - **Geometric Context Transformer**: Architecturally unifies coordinate grounding, dense geometric cues, and long-range drift correction within a single streaming framework through anchor context, pose-reference window, and trajectory memory.
 - **High-Efficiency Streaming Inference**: A feed-forward architecture with paged KV cache attention, enabling stable inference at ~20 FPS on 518×378 resolution over long sequences exceeding 10,000 frames.
 - **State-of-the-Art Reconstruction**: Superior performance on diverse benchmarks compared to both existing streaming and iterative optimization-based approaches.
 pip install torch==2.9.1 torchvision==0.24.1 --index-url https://download.pytorch.org/whl/cu128
 ```
 **3. Install lingbot-map**
 ```bash
 pip install flashinfer-python -i https://flashinfer.ai/whl/cu128/torch2.9/
 ```
 # 🎬 Demo
 ### Streaming Inference from Images
 ### Streaming with Keyframe Interval
+Use `--keyframe_interval` to reduce KV cache memory by only keeping every N-th frame as a keyframe.
 ```bash
 python demo.py --model_path /path/to/checkpoint.pt \
     --image_folder /path/to/images/ --keyframe_interval 6
 ```
 ### Sky Masking
+Sky masking filters out sky points from the reconstructed point cloud.
 **Setup:**
 ```bash
+pip install onnxruntime
 ```
 **Usage:**
 ```bash
     --image_folder /path/to/images/ --mask_sky
 ```
 # 📜 License
 This project is released under the Apache License 2.0. See [LICENSE](LICENSE.txt) file for details.
 # ✨ Acknowledgments
+This work builds upon several open-source projects:
 - [VGGT](https://github.com/facebookresearch/vggt)
 - [DINOv2](https://github.com/facebookresearch/dinov2)
+- [Flashinfer](https://github.com/flashinfer-ai/flashinfer)