This paper investigates acoustic modeling for recognition of bird species from audio field recordings. First, the acoustic scene is decomposed into isolated segments, corresponding to detected sinusoids. Each segment is represented by a sequence of the frequency and normalized magnitude values of the sinusoid. The temporal evolution of these features is modeled using hidden Markov models(HMMs).A novel method for an unsupervised modeling of individual bird vocalization elements is proposed.The element models are initialized using HMM-based clustering and then further trained using an iterative maximum likelihood label re-assignment procedure. State duration modeling, performed in a post-recognition stage, is explored. Finally, we developed a hybrid deep neural network—hidden Markov model. The developed acoustic models are employed for bird species identification, detection of specific species, and recognition of multiple bird species vocalizing in a given recording.The detection system employs score normalization. Recognition of multiple bird species is performed based on maximizing the likelihood of a set of segments on a subset of bird species models, with penalization based on Bayesian information criterion applied. Experimental evaluations are performed on more than 37 h of sound field recordings, containing vocalizations of 48 bird species, plus more than 16 h of non-bird sound recordings. Using 3 s of the detected signal, the best system achieved: identification accuracy of 98.7%,detection with the equal error rate of 2.7%, and recognition accuracy of 97.3% and 95.4% when vocalizations of multiple bird species are present, with the number of bird species known and estimated, respectively.