A fragment-based spatial-temporal video analysis method for detecting anomalous violent events
Main Article Content
Abstract
In this work, we address the problem of violent event detection and classification in video streams under realistic computational constraints. Many safety-critical events, such as violent interactions or abnormal behavior, are characterized by short-term and spatially localized motion patterns, while the majority of video content remains static or irrelevant. Conventional deep learning approaches typically process full video frames or dense spatio-temporal representations, which leads to high computational cost and inefficient use of computational resources. We propose a fragment-based spatio-temporal video analysis method inspired by principles of video coding. Each video frame is divided into fragments, and motion activity is estimated using dense optical flow between consecutive frames. Only fragments exhibiting significant temporal changes are selected for further processing, while static regions are suppressed at an early stage. The fragment size is adaptively adjusted according to local motion intensity, allowing finer spatial resolution in dynamic regions and coarser representation in static areas. The selected fragments form a compact representation that is subsequently used for event classification via lightweight temporal aggregation. By reducing spatio-temporal redundancy prior to feature extraction, the proposed method significantly lowers computational complexity while preserving discriminative motion cues. The proposed method is evaluated on the UBI-Fights dataset, with additional training data augmentation using the Video Fight Detection (VFD2000) dataset. Experimental results demonstrate that the method achieves competitive performance with area under the receiver operating characteristic curve up to 0.72, area under the precision-recall curve up to 0.63, and binary F1-score up to 0.60, while maintaining efficient inference speed. These results indicate a favorable trade-off between accuracy and efficiency compared to dense frame-based baselines, making the method suitable for real-time and resource-constrained video analysis systems.

