Air pollution has altered the Earth radiation balance, disturbed the ecosystem and increased human morbidity and mortality. Accordingly, a full-coverage high-resolution air pollutant dataset with timely updates and historical long-term records is essential to support both research and environmental management. Here, for the first time, we develop a near real-time air pollutant database known as Tracking Air Pollution in China (TAP, tapdata.org) that combines information from multiple data sources, including ground measurements, satellite retrievals, dynamically updated emission inventories, operational chemical transport model simulations and other ancillary data. Daily full-coverage PM2.5 data at a spatial resolution of 10 km is our first near real-time product. The TAP PM2.5 is estimated based on a two-stage machine learning model coupled with the synthetic minority oversampling technique and a tree-based gap-filling method. Our model has an averaged out-of-bag cross-validation R2 of 0.83 for different years, which is comparable to those of other studies, but improves its performance at high pollution levels and fills the gaps in missing AOD on daily scale. The full coverage and near real-time updates of the daily PM2.5 data allow us to track the day-to-day variations in PM2.5 concentrations over China in a timely manner. The long-term records of PM2.5 data since 2000 will also support policy assessments and health impact studies. The TAP PM2.5 data are publicly available through our website for sharing with the research and policy communities.
There exists a large number of datasets for organ segmentation, which are partially annotated and sequentially constructed. A typical dataset is constructed at a certain time by curating medical images and annotating the organs of interest. In other words, new datasets with annotations of new organ categories are built over time. To unleash the potential behind these partially labeled, sequentially-constructed datasets, we propose to incrementally learn a multi-organ segmentation model. In each incremental learning (IL) stage, we lose the access to previous data and annotations, whose knowledge is assumingly captured by the current model, and gain the access to a new dataset with annotations of new organ categories, from which we learn to update the organ segmentation model to include the new organs. While IL is notorious for its `catastrophic forgetting' weakness in the context of natural image analysis, we experimentally discover that such a weakness mostly disappears for CT multi-organ segmentation. To further stabilize the model performance across the IL stages, we introduce a light memory module and some loss functions to restrain the representation of different categories in feature space, aggregating feature representation of the same class and separating feature representation of different classes. Extensive experiments on five open-sourced datasets are conducted to illustrate the effectiveness of our method.
The limited bandwidth of optical wireless communication (OWC) front-end devices motivates the use of multiple-input-multiple-output (MIMO) techniques to enhance data rates. It is known that very high multiplexing gains could be achieved by spatial multiplexing (SMX) in exchange for exhaustive detection complexity. Alternatively, in spatial modulation (SM), a single light emitting diode (LED) is activated per time instance where information is carried by both the signal and the LED index. Since only an LED is active, both transmitter (TX) and receiver (RX) complexity reduces significantly while retaining the information transmission in the spatial domain. However, significant spectral efficiency losses occur in SM compared to SMX. In this paper, we propose a technique which adopts the advantages of both systems. Accordingly, the proposed flexible LED index modulation (FLIM) technique harnesses the inactive state of the LEDs as a transmit symbol. Therefore, the number of active LEDs changes in each transmission, unlike conventional techniques. Moreover, the system complexity is reduced by employing a linear minimum mean squared error (MMSE) equalizer and an angle perturbed receiver at the RX. Numerical results show that FLIM outperforms the reference systems by at least 6 dB in the low and medium/high spectral efficiency regions.
Public transportation system commuters are often interested in getting accurate travel time information to plan their daily activities. However, this information is often difficult to predict accurately due to the irregularities of road traffic, caused by factors such as weather conditions, road accidents, and traffic jams. In this study, two neural network models namely multi-layer(MLP) perceptron and long short-term model(LSTM) are developed for predicting link travel time of a busy route with input generated using Origin-Destination travel time matrix derived from a historical GPS dataset. The experiment result showed that both models can make near-accurate predictions however, LSTM is more susceptible to noise as time step increases.
We designed and built a game called \textit{Immersive Text Game}, which allows the player to choose a story and a character, and interact with other characters in the story in an immersive manner of dialogues. The game is based on several latest models, including text generation language model, information extraction model, commonsense reasoning model, and psychology evaluation model. In the past, similar text games usually let players choose from limited actions instead of answering on their own, and not every time what characters said are determined by the player. Through the combination of these models and elaborate game mechanics and modes, the player will find some novel experiences as driven through the storyline.
K-means clustering plays a vital role in data mining. However, its performance drastically drops when applied to huge amounts of data. We propose a new heuristic that is built on the basis of regular K-means for faster and more accurate big data clustering using the "less is more" and MSSC decomposition approaches. The main advantage of the proposed algorithm is that it naturally turns the K-means local search into global one through the process of decomposition of the MSSC problem. On one hand, decomposition of the MSSC problem into smaller subproblems reduces the computational complexity and allows for their parallel processing. On the other hand, the MSSC decomposition provides a new method for the natural data-driven shaking of the incumbent solution while introducing a new neighborhood structure for the solution of the MSSC problem. This leads to a new heuristic that improves K-means in big data conditions. The scalability of the algorithm to big data can be easily adjusted by choosing the appropriate number of subproblems and their size. The proposed algorithm is both scalable and accurate. In our experiments it outperforms all recent state-of-the-art algorithms for the MSSC in terms of time as well as the solution quality.
Wind speed forecasting models and their application to wind farm operations are attaining remarkable attention in the literature because of its benefits as a clean energy source. In this paper, we suggested the time series machine learning approach called random forest regression for predicting wind speed variations. The computed values of mutual information and auto-correlation shows that wind speed values depend on the past data up to 12 hours. The random forest model was trained using ensemble from two weeks data with previous 12 hours values as input for every value. The computed root mean square error shows that model trained with two weeks data can be employed to make reliable short-term predictions up to three years ahead.
In today's modern world, the usage of technology is unavoidable and the rapid advances in the Internet and communication fields have resulted to expand the Wireless Sensor Network (WSN) technology. A huge number of sensing devices collect and/or generate numerous sensory data throughout time for a wide range of fields and applications. However, WSN has been proven to be vulnerable to security breaches, the harsh and unattended deployment of these networks, combined with their constrained resources and the volume of data generated introduce a major security concern. WSN applications are extremely critical, it is essential to build reliable solutions that involve fast and continuous mechanisms for online data stream analysis enabling the detection of attacks and intrusions. In this context, our aim is to develop an intelligent, efficient, and updatable intrusion detection system by applying an important machine learning concept known as ensemble learning in order to improve detection performance. Although ensemble models have been proven to be useful in offline learning, they have received less attention in streaming applications. In this paper, we examine the application of different homogeneous and heterogeneous online ensembles in sensory data analysis, on a specialized wireless sensor network-detection system (WSN-DS) dataset in order to classify four types of attacks: Blackhole attack, Grayhole, Flooding, and Scheduling among normal network traffic. Among the proposed novel online ensembles, both the heterogeneous ensemble consisting of an Adaptive Random Forest (ARF) combined with the Hoeffding Adaptive Tree (HAT) algorithm and the homogeneous ensemble HAT made up of 10 models achieved higher detection rates of 96.84% and 97.2%, respectively. The above models are efficient and effective in dealing with concept drift, while taking into account the resource constraints of WSNs.
We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics. Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem.
We study learning contextual MDPs using a function approximation for both the rewards and the dynamics. We consider both the case where the dynamics is known and unknown, and the case that the dynamics dependent or independent of the context. For all four models we derive polynomial sample and time complexity (assuming an efficient ERM oracle). Our methodology gives a general reduction from learning contextual MDP to supervised learning.