[1] HJELM R D, FEDOROV A, LAVOIE-MARCHILDON S,
et al. Learning deep representations by mutual information
estimation and maximization[EB/OL]. [2024-12-19].
https://arxiv.org/abs/1808.06670.
[2] 张丽英,裴韬,陈宜金,等.基于街景图像的城市环境评价
研究综述[J].地球信息科学学报, 2019, 21(1):13.
ZHANG L Y, PEI T, CHEN Y J, et al. A Review of Urban
Environmental Assessment based on Street View
Images[J]. Journal of Earth Information Science, 2019,
21(1): 13.(in Chinese)
[3] DOERSCH C, SINGH S, GUPTA A, et al. What makes
Paris look like Paris?[J]. Communications of the ACM,
2015, 58(12): 103-110.
[4] NGUYEN Q C, SAJJADI M, MCCULLOUGH M, et al.
Neighbourhood looking glass: 360º automated
characterisation of the built environment for
neighbourhood effects research[J]. J Epidemiol
Community Health, 2018, 72(3): 260-266.
[5] ZHANG F, ZHANG D, LIU Y, et al. Representing place
locales using scene elements[J]. Computers, Environment
and Urban Systems, 2018, 71: 153-164.
[6] DEWI C, CHEN R C, ZHUANG Y C, et al. Image
Enhancement Method Utilizing Yolo Models to Recognize
Road Markings at Night[J]. IEEE Access, 2024.
[7] Zhao H , Shi J , Qi X ,et al. Pyramid scene parsing
network[C]//Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017: 2881-2890.
[8] LIANG X, ZHAO T, BILJECKI F. Revealing
spatio-temporal evolution of urban visual environments
with street view imagery[J]. Landscape and Urban
Planning, 2023, 237: 104802.
[9] XU S, ZHANG C, FAN L, et al. Addressclip: Empowering
vision-language models for city-wide image address
localization[C]//European Conference on Computer Vision.
Springer, Cham, 2025: 76-92.
[10] NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep
learning[C]//Proceedings of the International Conference
on Machine Learning (ICML). 2011: 689 - 696.
[11] HE W, MA H, LI S, et al. Using augmented small
multimodal models to guide large language models for
multimodal relation extraction[J]. APPLIED SCIENCES,
2023, 13(22): 12208.
[12] OUYANG T, ZHANG X, HAN Z, et al. Health CLIP:
Depression rate prediction using health-related features in
satellite and street view images[C]//Companion
Proceedings of the ACM on Web Conference 2024. 2024:
1142-1145.
[13] ACHIAM J, ADLER S, AGARWAL S, et al. GPT - 4
technical report[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2303.08774.
[14] LIU H, LI C, WU Q, et al. Visual instruction tuning[J].
Advances in Neural Information Processing Systems, 2024,
36.
[15] ZHANG Y, ZHANG F, CHEN N. Migratable urban street
scene sensing method based on vision language pre-trained
model[J]. International Journal of Applied Earth
Observation and Geoinformation, 2022, 113: 102989.
[16] JI Y, GAO S. Evaluating the effectiveness of large
language models in representing textual descriptions of
geometry and spatial relations[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2307.03678.
[17] CHIANG W L, LI Z, LIN Z, et al. Vicuna: An open-source
chatbot impressing GPT-4 with 90%* ChatGPT
quality[EB/OL]. (2023-03-30) [2024-12-19].
https://lmsys.org/blog/2023-03-30-vicuna/
[18] RADFORD A, KIM J W, HALLACY C, et al. Learning
transferable visual models from natural languagesupervision[C]//International Conference on Machine
Learning. PMLR, 2021: 8748-8763.
[19] HU E J, SHEN Y, WALLIS P, et al. LoRA: Low-Rank
Adaptation of Large Language Models[EB/OL].
[2024-12-19]. https://arxiv.org/abs/2106.09685.
[20] BAI Y, ZHAO Y, SHAO Y, et al. Deep learning in different
remote sensing image categories and applications: status
and prospects[J]. International Journal of Remote Sensing,
2022, 43(5): 1800-1847.
[21] 徐永智.基于街景影像的建筑物底部轮廓提取[D].北京:
北京建筑大学,2017.
XU Y Z. Extracting Building Footprints from Digital
Measurable Images[D]. Beijing: Beijing University of
Civil Engineering and Architecture, 2017.(in Chinese)
[22] LIU J, YANG J, BATRA D, et al. Neural baby
talk[C]//Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2018: 7219-7228.
[23] CAMPBELL A, BOTH A, SUN Q C. Detecting and
mapping traffic signs from Google Street View images
using deep learning and GIS[J]. Computers, Environment
and Urban Systems, 2019, 77: 101350.
[24] QIU W, ZHANG Z, LIU X, et al. Subjective or objective
measures of street environment, which are more effective
in explaining housing prices?[J]. Landscape and Urban
Planning, 2022, 221: 104358.
[25] CHEN L C, PAPANDREOU G, SCHROFF F, et al.
Rethinking atrous convolution for semantic image
segmentation[EB/OL]. [2024-12-19].
https://arxiv.org/abs/1706.05587.
[26] SHIHAB I F, ALVEE B I, BHAGAT S R, et al. Precise and
Robust Sidewalk Detection: Leveraging Ensemble
Learning to Surpass LLM Limitations in Urban
Environments[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2405.14876.
[27] REDMON J, FARHADI A. YOLO9000: better, faster,
stronger[C]//Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2017:
7263-7271.
[28] CORDTS M, OMRAN M, RAMOS S, et al. The
Cityscapes Dataset for Semantic Urban Scene
Understanding[C]//Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2016:
3213-3223.
[29] ZHOU B, ZHAO H, PUIG X, et al. Semantic
understanding of scenes through the ADE20K dataset[J].
International Journal of Computer Vision, 2019, 127:
302-321.
[30] ZHANG Y, LIU P, BILJECKI F. Knowledge and topology:
A two-layer spatially dependent graph neural network to
identify urban functions with time-series street view
image[J]. ISPRS Journal of Photogrammetry and Remote
Sensing, 2023, 198: 153-168.
[31] WU M, HUANG Q, GAO S, et al. Mixed land use
measurement and mapping with street view images and
spatial context-aware prompts via zero-shot multimodal
learning[J]. International Journal of Applied Earth
Observation and Geoinformation, 2023, 125: 103591.
[32] ZHAO Y, ZHONG E, YUAN C, et al. TG-LMM:
Enhancing Medical Image Segmentation Accuracy through
Text-Guided Large Multi-Modal Model[EB/OL].
[2024-12-19]. https://arxiv.org/abs/2409.03412.
[33] PICARD C, EDWARDS K M, DORIS A C, et al. From
Concept to Manufacturing: Evaluating Vision-Language
Models for Engineering Design[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2311.12668.
[34] YANG Y, WANG S, LI D, et al. GeoLocator: A
Location-Integrated Large Multimodal Model (LMM) for
Inferring Geo-Privacy[J]. Applied Sciences, 2024, 14(16):
7091.
[35] JAYATI S, CHOI E, BURTON H, et al. Leveraging Large
Multimodal Models to Augment Image-Based Building
Damage Assessment[C]//Proceedings of the 7th ACM
SIGSPATIAL International Workshop on AI for
Geographic Knowledge Discovery. 2024: 79-85.
[36] HAO X, CHEN W, YAN Y, et al. UrbanVLP:
Multi-Granularity Vision-Language Pretraining for Urban
Region Profiling[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2403.16831.
[37] DOSOVITSKIY A. An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale[EB/OL].
[2024-12-19]. https://arxiv.org/abs/2010.11929.
[38] CHEN X, FAN H, GIRSHICK R, et al. Improved
Baselines with Momentum Contrastive Learning[EB/OL].
[2024-12-19]. https://arxiv.org/abs/2003.04297. [39] TOLSTIKHIN I O, HOUISBY N, KOLESNIKOV A, et al.
MLP-Mixer: An all-MLP Architecture for Vision[J].
Advances in Neural Information Processing Systems, 2021,
34: 24261-24272.
[40] AYUPOV S, CHIRKOVA N. Parameter-efficient
finetuning of transformers for source code[EB/OL].
[2024-12-19]. https://arxiv.org/abs/2212.05901.
[41] ZHANG Z, SABUNCU M. Generalized cross-entropy loss
for training deep neural networks with noisy labels[J].
Advances in Neural Information Processing Systems, 2018,
31.
[42] 杨冬菊,黄俊涛.基于大语言模型的中文科技文献标注方
法[J].计算机工程, 2024, 50(9):113-120.
YANG D J, HUANG J T. Chinese Scientific Literature
Annotation Method Based on Large Language Model[J].
Computer Engineering, 2024, 50(9): 113-120.(in Chinese)
[43] YASEEN M. What is What is YOLOv9: An In-Depth
Exploration of the Internal Features of the
Next-Generation Object Detector[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2409.07813.
[44] KHANAM R, HUSSAIN M. YOLOv11: An Overview of
the Key Architectural Enhancements[EB/OL].
[2024-12-19]. https://arxiv.org/abs/2410.17725.
[45] WU Z, LIU X, GILITSCHENSKI I. EventCLIP: Adapting
CLIP for Event-based Object Recognition[EB/OL].
[2024-12-19]. https://arxiv.org/abs/2306.06354.
[46] CHEN Z, WANG W, CAO Y, et al. Expanding
Performance Boundaries of Open-Source Multimodal
Models with Model, Data, and Test-Time Scaling[EB/OL].
[2024-12-19]. https://arxiv.org/abs/2412.05271.
[47] DUBEY A, JAUHRI A, PANDEY A, et al. The Llama 3
Herd of Models[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2407.21783.
[48] WANG P, BAI S, TAN S, et al. Qwen2-VL: Enhancing
Vision-Language Model's Perception of the World at Any
Resolution[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2409.12191.
[49] AGRAWAL P, ANTONIAK S, HANNA E B, et al. Pixtral
12B[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2410.07073.
[50] TEAM G, GEORGIEV P, LEI V I, et al. Gemini 1.5:
Unlocking multimodal understanding across millions of
tokens of context[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2403.05530.
[51] CHIANG W L, ZHENG L, SHENG Y, et al. Chatbot
Arena: An Open Platform for Evaluating LLMs by Human
Preference[EB/OL]. [2024-12-19].
https://arxiv.org/abs/2403.04132.
|