4

I want to generate a popularity distribution for a small data set, which should follow Zipf law.

The available parameters are:
Total number of viewers : 1 Million
Total number of videos : 36

I want to associate total number of viewers to each video according to Zipf law. For instance, how many of viewers will watch video1, video2 .. and so on.
Can anyone please tell me the formula or methodology?

Anil_M
  • 10,893
  • 6
  • 47
  • 74
yours
  • 51
  • 1
  • Your parameters seem strange. Please, explain they better. Also, explain what do you mean with the Zipf Law. On https://en.wikipedia.org/wiki/Zipf's_law, there are plenty of formulas, why they do not fit to you? Do you do not understand they? What you do not understand? – Evandro Coan Jan 03 '17 at 08:26
  • 1
    @addons_zz I want to generate a synthetic workload, where I have 36 videos, and around 1 million viewers. Now how many times each video is viewed according to zipf law is what I want to find. So if N = 1 Million (means there are 1 million viewers) k= 1..30, mean K1 be the rank of video V1 and so on .. then what would be the s ??? as mentioned in the wikipedia N be the number of elements; k be their rank; **s be the value of the exponent characterizing the distribution.** How can I calculate or choose "s" in this scenario ?? – yours Jan 03 '17 at 11:29
  • Can you edit the question providing these new information, instead of adding them at once as a comment? – Evandro Coan Jan 03 '17 at 12:10

1 Answers1

0

The Wikipedia article on Zipf's law includes a few descriptions of the distribution, including some ways to calculate it:

https://en.wikipedia.org/wiki/Zipf%27s_law

The first equation under the Theoretical review section might help. Using this, we can write a short Python script to associate the total number of viewers for each of 36 videos:

N_elements = 1000000
video_exponent = 1

distribution_sum = 0
total_viewers = 0


# First, add up the relative number of viewers across all 36 movie ranks
for k_rank in range(1,36):

    sum = 0
    for n in range(1, N_elements):
        sum = sum + 1/(n**video_exponent)

    distribution_sum = distribution_sum + (1/(k_rank**video_exponent))/sum


# Next, distribute the number of viewers so that the total comes to 1,000,000
print("Movie Rank | # of Viewers")
for k_rank in range(1,36):

    sum = 0
    for n in range(1, N_elements):
        sum = sum + 1/(n**video_exponent)

    viewers_at_k_rank = round((N_elements/(k_rank**video_exponent))/(sum * distribution_sum))

    print(k_rank, end="|")
    print(viewers_at_k_rank)

    total_viewers = total_viewers + viewers_at_k_rank


print("\nSum of all viewers accounted for so far, to make sure we're at 1,000,000")
print(total_viewers)

The result adds up to 1,000,002 viewers, but that's not a big deal. Why isn't it a big deal, you ask? It seems that while many different things follow a general Zipfian distribution, they tend to be slightly different based on what type of thing it is. The video_exponent variable can be adjusted so that the Zipfian distribution simulated above can more closely match actual video statistics. The difference is usually much greater than 2 in 1,000,000.

You can get an idea of what the real-world video_exponent is by finding some real ranked videos, and adjusting video_exponent and N_elements until the code matches the real numbers. Then, reset N_elements back to 1,000,000, and you'll have a realistic video-watching data set.

bey
  • 74
  • 5