Estimate Cache Size when using Nearline

Last updated: May, 2011

Multimedia companies serve large amounts of media files (audio, video) in various formats (mp3, mp4, wav, etc...) through their content production system. It is vital that this media content can be retrieved quickly, has a high availability and is secured against data loss (e.g. accidental deletion or damage). High availability can be achieved by having multiple storage locations on which the data is replicated in real-time. However this does not address the data loss problem: any (human) error will be propagated in real-time too.

Traditionally data security meant creating a copy (backup) and storing this backup in a place that could not be used/accessed by the production system. A drawback of such an approach is that the storage size for your data doubles. Furthermore, consumption of media content follows a (usually) predictable pattern. On any day ~80% of the media requests (from website visitors) are for media from the last few days, ~90% of the media requests are for media from the last week, and ~95% of the requests for media are for media from the last month, etc.... Figure 1 shows a graphical depiction of this "long tail" behaviour (see also: Wikipedia Article ). "data freshness" refers to how recent the data was created. This long tail behaviour can be described with the Bradford or Pareto distribution. The data that is most recent we will call the head. Long Tail Data Consumption

Figure 1. Data consumption versus creation date of data

The long tail behaviour is not applicable to all media compagnies. For example if a media company has an archive website media requests from website visitors might follow a different pattern because they are not searching for the latests news but for specific events in the past.

Another aspect of media content production is that media is usually produced once into several transcoded formats. Sometimes when a new format is introduced the media company might decide to re-transcode old files but typically the files follow the WORM model (Write Once Read Many).

To address the data storage/size costs described in the first paragraph, there are systems available that offer an hybrid approach to data security (also called nearline). Data can be moved to an archive system (backup) and be made available as read only. Whenever a web site visitor requests a media file that is in the archive, this file is copied to the production system and this copy will stay there for a predefined number of days. The first website visitor would experience a reduced performance but the next request (if made within the predefined number of days) would have no performance loss. The production system acts as a cache for the archive system. Data that has not yet been moved to nearline (e.g. data that has been recently created) still has to be backed up in the classical way?

Let S(P) be the size of the production system, S(A) the size of the archive system, S(C) the size of the cache needed on the production system, S(B) the size of the backup system, and S(D) be the size of the data.

Traditional backup approach:

Size needed: 2xS(P) where S(P) = S(D) and S(P) = S(B) implies size needed: 2xS(D)

Nearline approach:

Size needed: S(P)+S(C) + S(A) + S(B) where S(P) + S(A) = S(D) implies size needed: S(D) + S(C) + S(B)

If S(C) + S(B) << S(D) you can significantly reduce the storage costs.

The question a multimedia company needs to answer when choosing such a solution: How big does S(C) needs to be without reducing the overall experience for our website visitors? To answer this question we can use the log entries from the production system. The logs show when users requested certain media files. If we merge and chronologically order these logs we can use that as the input for a discrete event simulator and simulate the nearline with various archive rules.

Nearline systems enable the specification of many different rules for moving data between the production and archive system, including different rules for different formats of files. In the analysis below we will use a simple approach which will give sufficient insight on the performance loss and cost savings when using a nearline system.

The following input parameters are used for the simulation:

  • days: the number of days we analyze the log files.
  • first-time: the time between the creation of new content and when it is moved to nearline. E.g. if first-time=5 then 5 days after creation the file will be moved to nearline.
  • Next-time: the time between content being copied from nearline to production until it is removed from the production cache.
Assuming that we have a chronologically ordered list of log entries the simulation will do the following:
# events is a set of events. These events are either file requests (from the log entries) or cache purge events (when a file is expired).
while events is not empty:
    get next event
    if event is log entry: 
        cd = creation date of the file being requested in log entry
	rd = request date for the file (log entry timestamp)
    if (rd - cd) < first-time:
        * file on production system (production hit)	
    else: 
        if file in cache and file has not been expired (older than next-time days):	 
	    * file on production system (production hit) 
        else:	 
            * retrieve file from nearline and put it in cache
	    * cache size is incremented with the size of the file
	    * insert cache purge event in entry list
if entry is cache purge event:
   Purge this file from cache
   Reduce the size of the cache
        

The "hit rate" is the the number of production hits relative to the total number of requests. The time series of the cache increment and purge events. The number of log entries that where processed in the simulation ranged from ~4000000 to ~20000000 depending on the "days" parameter. The total data size analyzed was ~12 TB video and ~ 4 TB (audio).

Unfortunately the supplied log entries data where not always ordered correctly and the creation date of files was not always available as certain files are removed a few days after they have been created. To corect for these errors we first ran the simulation with first-time = next-time = 0 . The "hit rate" from these simulations would be our uncertainty (error margin). The error margin (for various number of days) was typically between 3-4%. Several simulations where run for both audio and video media files. First-time = next-time = 1 day for video files gives us a "hit rate" of ~98%. This number does not change much if the first-time value is increased to 7. For radio first-time = 14 and next-time =7 yielded a 98% "hit rate". The figures below show the evolution of the cache size for the various parameter values first-time, next-time for radio and television. Cache size audio during simulation

Figure 2. Cache size audio during simulation
Cache size video during simulation
Figure 3. Cache size video during simulation

Assuming that a "hit rate" of ~98% is acceptible we can expect we need a cache size of approximatly 700 GB (video) + 600 GB (radio) = ~1.3 TB if we use first-time = 7, next-time =3 (video) and first-time=14, next-time=7 (radio). If on average 10 GB of video and 5 GB of radio is created every day resulting in an S(B) of: 7x10GB + 14x5GB = 140 GB. If the total amount of data is approximately 16 TB, this results in a ratio of (S(B)+S(C)) / S(D) =~ 1.44 / 16 = 0.09.