July 31, 2008

觉得有点意思的:原文链接 http://zhiqiang.org/blog/posts/rotate-coin-games.html

--
Alice和Bob两人玩一种硬币游戏。游戏在一个2\times2的棋盘上进行,棋盘上每个格子上都有一枚硬币。在每一回合,Alice可以决定选择翻转某两枚或者一枚硬币,接着Bob可以选择将棋盘旋转90,180或者270度,也可以什么都不做。

游戏轮流进行直到棋盘上所有硬币都正面朝上或者反面朝上,Alice获得胜利。

如果Alice在游戏过程中无法看到棋盘上的银币,也不知道游戏刚开始的状态,甚至不知道Bob每回合是否旋转了棋盘,那么Alice有策略能够获得胜利么?他的最优策略是什么?

接下来我们推广这个游戏。共有n枚硬币,分别放在一个正n边形棋盘的顶点上。每回合Alice可以翻转任何一些银币,Bob则可任意以n种不同的方式(旋转360/n的倍数角度)之一旋转棋盘。游戏一直到所有硬币正面朝上或者反面朝上,Alice获得胜利。

这时候Alice还能取胜吗?

解答在此,但强烈推荐独立思考此题,特别是n=4的情况。

Tags: ,.

Spam Detection in Social Bookmarking Systems

With the growing popularity of social bookmarking systems, spammers discovered this kind of service as a playground for their activities. Usually they pursue two goals: On the one hand, they place links in the system to attract people to advertising sites. On the other hand, they increase the PageRank of their sites by placing links in as many popular web 2.0 sites as possible, in order to increase their visibility in Google and other search engines. Usual counter-measures like captchas are not efficient enough to effectively prevent the misuse of the system. In the last year, we were able to collect data of more than 2,000 active users and more than 25,000 spammers by manually labeling spammers and non-spammers. The provided dataset consists of these users and of all their posts. This includes all public information such as the url, the description and all tags of the post. The goal of this challenge is to learn a model which predicts whether a user is a spammer or not. In order to detect spammers as early as possible, the model should make good predictions for a user when he submits his first post.

Dataset description

A general description of the dataset can be found here. For the spam detection task all provided files are relevant.

Evaluation

All participants can use the training dataset to fit the model. The training dataset contains flags that identify users as spammers or non-spammers. The test dataset will have the same format as the training dataset and can be downloaded two days before the end of the competition. It will contain users of a future period. All participants must send a sorted file containing one line, for each user, composed by the user number and a confidence value separated by a tab. The higher the confidence value, the higher the probability that the user is a spammer. The highest confidence should come first.

                     user spam
                        1234  1
                        1235  0.987
                        1236  0.765
                        1239  0

If no prediction is provided we assume the user is not a spammer. The evaluation criterion is the AUC (the Area under the ROC Curve) value. We compare the submitted spammer predictions of the participants with the manually assigned labels on a user basis.
-----------------------------------------------

1. Give a evaluation model for the Users according the time line behavior, and by this model we can calculate the probability of a spammer.
2. Some sites are spam website, and the spam users are doing something with these websites.So the spam users recommend spam websites.
3. The spammers are similar, there working time, their frequence, the website they recommended, etc.
4. Artificial Neural Networks is not that good for this question, I think the regression model maybe work. To summarize some rule to identify the potential spammer.

The datasets are a little bit complex which is consist of 4 tables.

The dataset consists of seven files:

These are tab-separated files which have the following columns:

Files tas and tas_spam

Tag ASsignments: Fact table; who attached which tag to which resource/content

  1. user (number; user names are anonymized)
  2. tag
  3. content_id (matches bookmark.content_id or bibtex.content_id)
  4. content_type (1 = bookmark, 2 = bibtex)
  5. date

Files bookmark and bookmark_spam

Dimension table for bookmark data

  1. content_id (matches tas.content_id)
  2. url_hash (the URL as md5 hash)
  3. url
  4. description
  5. extended description
  6. date

Files bibtex and bibtex_spam

Dimension table for BibTeX data

  1. content_id (matches tas.content_id)
  2. journal volume
  3. chapter
  4. edition
  5. month
  6. day
  7. booktitle
  8. howPublished
  9. institution
  10. organization
  11. publisher
  12. address
  13. school
  14. series
  15. bibtexKey (the bibtex key (in the @... line))
  16. url
  17. type
  18. description
  19. annote
  20. note
  21. pages
  22. bKey (the "key" field)
  23. number
  24. crossref
  25. misc
  26. bibtexAbstract
  27. simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
  28. simhash1 (hash for duplicate detection among users -- sloppy --)
  29. simhash2 (hash for duplicate detection within a user -- strict --)
  30. entrytype
  31. title
  32. author
  33. editor
  34. year

File user

Mapping of non-spammer / spammer for each user. This file can be used for spam classification.

  1. user (matches tas.user)
  2. spam flag (0 = non-spammer, 1 = spammer)

Size of Files

Number of lines in files:

  1. tas 816,197 / tas_spam 13,258,759
  2. bookmark 181,833 / bookmark_spam 2,059,991
  3. bibtex 219,417 / bibtex_spam 716
  4. user_spam 31,715

Deadline:

May 5, 2008 Tasks and datasets available online.
July 30th, 2008 Test dataset will be released (by midnight CEST).
August 1st, 2008 Result submission deadline (by midnight CEST).
August 4th, 2008 Workshop paper submission deadline.

Submit URL: http://www.kde.cs.uni-kassel.de/ws/rsdc08/upload/


Technorati : , , ,
Del.icio.us : , , ,
Ice Rocket : , , ,
Flickr : , , ,
Zooomr : , , ,
Buzznet : , , ,