Research projects


Social search and social networks

Online social networks (OSNs) such as Facebook and Twitter are new generation of Internet applications and account for a great amount of user dwell time on the Web. Meanwhile, social users also contribute huge amount of content, often known as user generated content (UGC), such as blogs, digital photos and videos, etc. My research in this area includes three aspects:
  1. User behavior analysis and security/privacy issues in online social networks;
  2. Algorithms and systems on social search;
  3. Algorithms and models on social content triggering and placement for Web search engines.
Most of my works are based on industry systems and large scale user data.

Users are basic elements in OSNs. Understanding how users consume and generate UGC objects is imperative to social network industry for the success of their business. My research on social user analysis focus on the statistical properties of user behavior patterns in online social networks, such as patterns of user content generation and online dwell time. We find that user posting behavior in social networks follows stretched exponential distributions instead of commonly assumed power law distributions, where the stretch factor has strong negative correlation with the effort to create an UGC object. Our studies on instant messaging (IM) traffic also shows similar distribution patterns on user activities in IM networks. Based on this model, we have proposed a statistical model to identify top users or top contributors in a social network system.

Online social networks have become an important vehicle for the distribution of spams, malicious content, and phishing sites in Web and social media. Current abnormality detection-based security approaches respond slowly with newly generated spams and malwares. With advanced information retrieval technologies and graph theory methods, it is possible to analyze the social connections of spam/malware distributors and their customers that are “hidden” in the Web and social media. Our purpose is to proactively crawl the “customer networks” of spams and malwares, build information systems to index and process the crawled graphs, and identify malicious content before it is widely spread and prevent its distribution in the early stage. With the analysis of user behavior features and patterns, we are working on spam and malicious content detection in social media now.

My research on social search algorithms and systems focus on UGC content search and high quality knowledge discovery in social media. UGC and social content are usually less structured than Web pages, so that link based algorithms such as PageRank and HITS are not effective. However, social users often organize and rate/comment the content that they consume or produce on the Web through a number of methods, such as bookmarking, tagging, and liking. We have developed ISID, a tag-based social interest discovery system to discover the common user interests and manage users and their created content by different interest topics. Currently, we are working on algorithms and models for high quality knowledge mining and information retrieval in social media, such as query/answer, named entity extraction, etc.

My most recent work on social networks is to serve users a social and personalized experience in Bing Search, based on the Likes of their Facebook friends. Facebook contains huge amount of information about web pages liked by it users, which are useful clues to help a user make decision when search. We have developed machine learning models and algorithms for personalized search result triggering/placement and like farm filtering of Facebook Likes in Bing Search, which was launched in October 2010 (See Bing announcement, New York Times report, and Wall Street Journal report for this system).


Traffic modeling and measurements

With the explosive increase of video content on the web, video traffic has dominated the Internet and keeps increasing, generating a number of problems to both academia and industry, from system design, traffic engineering, to data management:

  1. Streaming media delivery has high demands on CPU and bandwidth resources. What is the most important issue to design a scalable streaming system? Is caching an effective way to improve performance of a media system?
  2. To provide high quality streaming services to users, content delivery networks (CDNs) have been widely used, but the cost is high. P2P networks are scalable and of low cost, however, the service quality is often unstable. Can P2P networks provide high quality streaming services?
  3. Social networks such as YouTube have generated huge amount of video content and video traffic. How to manage these video objects and search related content in the network?

All these problems can be attributed to or highly related to the activity patterns of Internet users: in today’s Internet, not only media traffic is driven by user requests, but also media content is created by common users. Furthermore, the majority of Internet traffic is conveyed via overlay networks self-organized by common users, i.e., peer-to-peer networks. Thus, understanding traffic patterns, especially user activity patterns, is essential to design, manage, and evaluate Internet media delivery systems as well as content systems.

Analyzing workloads collected with large scale Internet measurements, we have proposed a general model of Internet media access patterns and studied the performance of media caching systems. We found unlike Web objects, whose access pattern is Zipf-like, the requests of media objects follow the stretched exponential distribution, whose parameters are determined by file size, client request rate, object birth rate, and time related factors in a media system. Our model indicates the performance of media caching systems is far less effective than that of Web caching systems, unless the cache size is extremely large. However, in a long-term, there is great potential to improve the caching performance, but it may take a long time and consume a great amount of storage.

The service quality of P2P systems, such as BitTorrent, is often unstable. Although the total amount of media traffic on the Internet always keeps increasing, the request rate of individual media files decreases with time. With the analysis of representative BitTorrent traffic, we have modeled the evolution of single torrent systems, and get the torrent lifespan constrained by file popularity decaying. Modeling interactions among multiple torrents, we find inter-torrent collaboration is much more effective than stimulating seeds to serve longer for addressing the service quality problems. We have also proposed an approach of inter-torrent collaboration under a “tit-for-tat” based instant incentive mechanism, in order to make BitTorrent a reliable and efficient content delivery vehicle.

A number of techniques for streaming media have been proposed and utilized in commercial media systems. In order to gain insights into current streaming services and thus provide guidance on designing resource-efficient and high quality streaming media systems, we have collected a large streaming media workload from thousands of broadband home users and business users hosted by a major ISP, and analyzed the most commonly used streaming techniques such as pseudo streaming, automatic protocol switch, Fast Streaming, MBR encoding and rate adaptation. Our measurement and analysis results show that with these techniques, current streaming systems tend to over-utilize CPU and bandwidth resources to provide better services to end users, which may not be a desirable and effective way to improve the quality of streaming media delivery.


Energy efficient and high performance wireless Internet

With the advancement of mobile technologies, WiFi and cellular networks have become an important part of Internet access, demanding high performance, energy efficient data communication protocols. We have systematically studied the traffic characteristics of streaming video and large data transmission in both client side and server side, and the quality requirement for user satisfaction. We conducted large scale Internet measurements in the packet level, and analyzed the key factors that cause streaming jitter and throughput downgrading. Based on the measurements and analysis, we further studied the limitations of current wireless protocols and mobile systems on power conservation, and proposed a number of technologies for different applications.

For client-server mode bulk data transmission, such as large file downloading and flash video streaming, we have proposed PSM-throttling, a power saving protocol for 802.11 networks. We found that due to the bandwidth increase of Internet backbone and high volume of user requests on the IDC server, the throughput of TCP streams is often constrained by the server capacity rather than the network capacity. By synchronizing the sleep period of network interface on a mobile client to the data sending rate on the server, PSM-throttling minimizes the energy consumption on large data transmission without compromising communication performance. PSM-throttling has been re-implemented to support different wireless devices (Ahmad Nazir Raja, Energy Efficient Client-centric Shaping of Multi-flow TCP Traffic).

For peer-to-peer and multi-source video streaming, we have proposed BlueStreaming, a technology that utilizes Bluetooth interface to deliver control traffic and utilizes WiFi interface to transmit content traffic for streaming video in a mobile system. In P2P and multi-source video streaming, a mobile client can frequently receive control packets to synchronize data transmission among different streams, reducing the sleep opportunities for power saving. With a low power, low bandwidth Bluetooth interface transmitting the large number of small control packets, BlueStreaming can shape the remaining streaming traffic into packet bursts so that the high power WiFi interface can utilize the burst interval to sleep, saving up to 46% battery power compared to the PSM scheme in commodity systems.

For multi-rate WLANs, we have proposed Cooperative Relay Service (CRS), a protocol that utilizes the idle communication power of high channel rate stations to relay data frames between a low channel rate mobile client and Access Point. Improving the throughput and energy per bit of both client and proxy stations, CRS can maximize the channel utilization in a high contention WLAN with large coverage. CRS is a win-win solution to all wireless stations in the network.

We have also proposed CUBS, an application independent and ISP transparent system to coordinately utilize the idle upload bandwidth of neighbors in a residential network.


Streaming media systems

Streaming media accounts for the majority of network traffic on the Internet. Based on our research on Internet media traffic modeling and analysis, we have studied current streaming systems on the Internet, including media proxy systems, streaming servers, and live streaming systems. To efficiently deliver streaming media in large scale, we have proposed PROP, a scalable and reliable P2P-assisted media proxy system, which utilizes P2P sharing to provide redundancy and scalability, and a dedicated proxy to provide reliability. In order to support interactive access of a streaming media object, we have proposed DISC, a dynamic interleaved segment caching algorithm on a media proxy to speed up the jump operation of streaming media clients. Analyzing PPLive data exchange patterns, we have proposed IDEA, an Improved peer Data Exchange Algorithm, which significantly improves the efficiency of live video delivery by randomizing chunk requests and reducing data contention.


Topology-aware overlay on the Internet

In a P2P system, two neighbors in the overlay can be far from each other in the underlying network, causing extra message delays and redundant cross domain traffic on the Internet. We have also studied topology-ware P2P overlay on the Internet for VoIP and file sharing. We have investigated the performance of the Skype VoIP system with intensive Internet measurements and found relay peer selections do not take Autonomous System (AS) topology into consideration, resulting in long waiting time and unnecessary probing. We propose an AS-aware peer-relay protocol called ASAP, which significantly improves VoIP quality and system scalability with low overhead. For large file distribution, we have proposed TopBT, a topology-aware and infrastructure-independent BitTorrent client to minimize redundant traffic for data transmission, the software has become an open-sourced project and free for download.


Peer-to-peer search

Peer-to-peer networks are self-organized systems without a central point for global peer information. Flooding is the most commonly used method for content search and message propagation. We have proposed LightFlood, a lightweight broadcast scheme to minimize redundant traffic in the overlay level, without compromising the coverage of reached peers. We have also proposed CAP-SPIRP, a fast and low-cost P2P search, a fast and low-cost P2P searching algorithm by exploiting content localities in peer communities.


[Research projects] [Publications] [Industrial systems] [Patents] [Selected and representative papers]