Spam Detection in Social Bookmarking Systems
With the growing popularity of social bookmarking systems, spammers discovered this kind of service as a playground for their activities. Usually they pursue two goals: On the one hand, they place links in the system to attract people to advertising sites. On the other hand, they increase the PageRank of their sites by placing links in as many popular web 2.0 sites as possible, in order to increase their visibility in Google and other search engines. Usual counter-measures like captchas are not efficient enough to effectively prevent the misuse of the system. In the last year, we were able to collect data of more than 2,000 active users and more than 25,000 spammers by manually labeling spammers and non-spammers. The provided dataset consists of these users and of all their posts. This includes all public information such as the url, the description and all tags of the post. The goal of this challenge is to learn a model which predicts whether a user is a spammer or not. In order to detect spammers as early as possible, the model should make good predictions for a user when he submits his first post.
Dataset description
A general description of the dataset can be found here. For the spam detection task all provided files are relevant.
Evaluation
All participants can use the training dataset to fit the model. The training dataset contains flags that identify users as spammers or non-spammers. The test dataset will have the same format as the training dataset and can be downloaded two days before the end of the competition. It will contain users of a future period. All participants must send a sorted file containing one line, for each user, composed by the user number and a confidence value separated by a tab. The higher the confidence value, the higher the probability that the user is a spammer. The highest confidence should come first.
user spam
1234 1
1235 0.987
1236 0.765
1239 0
If no prediction is provided we assume the user is not a spammer. The evaluation criterion is the AUC (the Area under the ROC Curve) value. We compare the submitted spammer predictions of the participants with the manually assigned labels on a user basis.
-----------------------------------------------
1. Give a evaluation model for the Users according the time line behavior, and by this model we can calculate the probability of a spammer.
2. Some sites are spam website, and the spam users are doing something with these websites.So the spam users recommend spam websites.
3. The spammers are similar, there working time, their frequence, the website they recommended, etc.
4. Artificial Neural Networks is not that good for this question, I think the regression model maybe work. To summarize some rule to identify the potential spammer.
The datasets are a little bit complex which is consist of 4 tables.
The dataset consists of seven files:
These are tab-separated files which have the following columns:
Tag ASsignments: Fact table; who attached which tag to which resource/content
- user (number; user names are anonymized)
- tag
- content_id (matches bookmark.content_id or bibtex.content_id)
- content_type (1 = bookmark, 2 = bibtex)
- date
Dimension table for bookmark data
- content_id (matches tas.content_id)
- url_hash (the URL as md5 hash)
- url
- description
- extended description
- date
Dimension table for BibTeX data
- content_id (matches tas.content_id)
- journal volume
- chapter
- edition
- month
- day
- booktitle
- howPublished
- institution
- organization
- publisher
- address
- school
- series
- bibtexKey (the bibtex key (in the @... line))
- url
- type
- description
- annote
- note
- pages
- bKey (the "key" field)
- number
- crossref
- misc
- bibtexAbstract
- simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
- simhash1 (hash for duplicate detection among users -- sloppy --)
- simhash2 (hash for duplicate detection within a user -- strict --)
- entrytype
- title
- author
- editor
- year
Mapping of non-spammer / spammer for each user. This file can be used for spam classification.
- user (matches tas.user)
- spam flag (0 = non-spammer, 1 = spammer)
Size of Files
Number of lines in files:
- tas 816,197 / tas_spam 13,258,759
- bookmark 181,833 / bookmark_spam 2,059,991
- bibtex 219,417 / bibtex_spam 716
- user_spam 31,715
Deadline:
| May 5, 2008 |
Tasks and datasets available online. |
| July 30th, 2008 |
Test dataset will be released (by midnight CEST). |
| August 1st, 2008 |
Result submission deadline (by midnight CEST). |
| August 4th, 2008 |
Workshop paper submission deadline. |
Submit URL: http://www.kde.cs.uni-kassel.de/ws/rsdc08/upload/
Technorati : Competition, KDD, Spam Detection, Tag Recommendation
Del.icio.us : Competition, KDD, Spam Detection, Tag Recommendation
Ice Rocket : Competition, KDD, Spam Detection, Tag Recommendation
Flickr : Competition, KDD, Spam Detection, Tag Recommendation
Zooomr : Competition, KDD, Spam Detection, Tag Recommendation
Buzznet : Competition, KDD, Spam Detection, Tag Recommendation