Aveneu Park, Starling, Australia

Abstract: be located on CPU chip or

Abstract: Cache
memory is small and very fast memory that sits between computer’s main memory
and processor which may be located on CPU chip or outside. Caching always speedup the process of
data access in a computer system and is
a technique to reduce bus traffic and latency of data access. But there
is always the requirement of some more improvement in efficiency. In this paper
my work is in same direction. As the cache memory is high speed memory with
small size. The paper will be about to keep only the most required data items
in cache. The requirement depends on the frequency of the reuse of data item. The
work is about to identify the frequency of each data items from previous
history. For this analysis the improved inverted list is suggested here. It
will also predict the group of data items that should be keep in a sequence to
get better optimization to do this I have used the new trend policy that is
called the Taylor Series Prediction (TSP) policy, provides more accurate
prediction on future accessing trends when the access patterns vary greatly. As
the most required data items are available in cache itself. It will give better
hit ratio. The proposed work will improve the efficiency and reliability.

                                                                                                                                                              
I.           
INTRODUCTION

A. History

                Earlier
work followed the sequential access based replacement policy like Moor’s
Law 1 tells us, the number of the transistors doubles about every 24 months.
That means performance and capabilities of semiconductors are growing
exponentially. As a kind of semiconductor devices, main memory gains
significant improvement on access speed by more than one order of magnitude in
the past decade. However, hard disk gains only about 10% benefit almost at the
same time due to the burden of a lot of mechanical parts 2. The huge gaps between
the speeds of main memory and hard disk would continue to be ever-widen in the
future only if hard disk manufacturers would not abandon out-of date mechanical
technology.

 Figure1. Comparison between the burst
and sustained data rate.

The
major problem of past researchers on designing a caching algorithm is how to
balance the recency and frequency, and how to improve the cache hit ratio.
However, the non-uniform access rate, which is also the inherent property of
the hard disks, has not attracted researchers’ attention. Take Seagate’s recent
product Barracuda 3 7200.11 (Model Number: ST31000340AS) as examples. The
above problem is for the sequential access based cache replacement but the
problem of sequential replacement is that it cannot predict the data element
frequencies which are followed in the cache.

 

                                                                                                                                                             
II.           
BACKGROUND

Many replacement policies have been
proposed over the years, including the LRU-K algorithm 4, which rejects the
Least-Recently-Used page in most recent K accesses, the GD-size policy 5
which considers access costs and varying page sizes, and an enhancement of the
GD-size algorithm known as GDSF 6, 7 which incorporates the frequency
information. The basic idea is that the trend or page reference rate of a web
page can be considered in the same way as the concept of “speed”, which
captures how fast an object moves in classical mechanics. Although this rate is
an adequate measure for how likely a page will be referenced next, it fails to
capture the trend in which the reference rate itself changes in the near
future. In the analogy with classical mechanics, this reference-rate trend
corresponds to the concept of “acceleration.” CMP
(Chip multiprocessor) has become more and more important both in academic
research and industrial applications. Small-scale CMPs, with two or four cores
per chip, are already commercially available 8, 9. With the increase of
transistor integration density, more and more cores will be integrated in a
single chip 10, 11. At the same time, on-chip memory hierarchies are
summoning innovative designs.

 

A.  Prediction
based cache migration policy

A lot of hardware prediction techniques have been developed
for data prefetch 8, 9, 10, 11. Generally they can be classified into
sequential prefetch, related prefetch and stride prefetch. Next-line or
one-block look-ahead prefetch is one kind of the sequential prefetch
techniques. In case of cache miss, it can prefetch the block which has cache
misses as well as also the next block. This method uses the spatial locality of
the program. Related prefetch techniques use the history information of
addresses to prefetch the data. But in this method a very large related
prediction table must be set. Stride prefetch technique checks the memory
access address to find the available stride. If the stride is available, the prefetch
request is sent. This method can improve the precision of prediction. Most
prediction techniques mentioned above are used for data prefetch only when L2
miss occurs. In traditional cache architecture, a block has to be replaced by
another one, when L2 miss. In NUCA architecture, L2 cache data move when L2
miss occurs, also in case of migration. So the prediction techniques are
proposed in this paper which can also be used to design migration policy, called
as prediction based migration or PBM. That is, when a block is accessed and
migrated toward the processor core, the predicted block should also be migrated
toward the same core.

 

                                                                                                                                             
III.           
PROBLUM
STATEMENT

 

A.  Model of Problem

In order to analyze and evaluate different
applications performance on NUCA based L2 cache, L2 Cache Accessing Performance
(CL2AP) model is provided. It can be defined as a tuple of five
elements:

in which: TLocation is the searching time to identify
the location of a given memory address in L2 cache. Tbank_to_network
is the accessing time of a given bank and put the referenced L2 cache block
onto the network. Twire is the time of transferring the given cache
block on wire from source bank to target core. Toff_chip is the time
to get missed data from off-chip memory to L2 cache. Tcontention is
the waiting time when the referenced bank is being accessed by another
processor core.

Given an L2 Cache implementation, the value of

Tbank_to_network and Toff_chip
can’t be changed. And the value of TLocation depends on the hardware
implementation of the tag array. For centralized tag array TLocation
can’t be changed, and for distributed tag array TLocation depends on
the locating algorithm 12, 13. In my experiment, a simulation has been done
for a centralized tag array, so TLocation is constant. The value of Tcontention
depends on when the previous access from another core to the same bank will
finish.

 

B.   Indexing Taylor Series Prediction

Both the GD-size and GDSF formulas are similar to the LRU
algorithm by keeping track of the most recently accessed pages. Higher priority
is given to objects that are accessed the latest. In contrast, I wish to design
a policy based on a “look forward” estimate. Consider the time difference ?(T)
= Tp-Tc, where Tp is the predicted time of next access, and Tc is the current
time. In essence, objects that have large ?(T) should be punished, and objects
that have small ?(T) should be rewarded. Therefore, I have an opportunity to
enhance the GDSF formula i.e. Ki = L + Fi * Ci/Si, even further, by including the ?(T)  factor in the ranking:

 

Ki = (Fi * Ci) / (Si* ?(T) )

 

One of the most important factors in this new formula is ?(T) , which
can in fact be made accurate by including not just the _rst order estimation
but the second. In the _rst order estimation, only the current time and the
last access time Tc-1 are used to calculate ?(T):

 

?(T)  = Tc -Tc-1 / 1

 

 This difference estimation
corresponds to the concept of speed, where the difference in time is divided by
a unit distance. To obtain a better estimation of then time, I wish to use the
concept of “acceleration”, which gives rise to Taylor Series Prediction.

 

IV.EVALUATION

 

A  . Inverted List Access-mode predictions For TSP

The
effectiveness of our scheme mainly depends on the accuracy of predicting cache hits
or misses. Our motivation for using access-mode predictions in low-power cache
design using inverted list, comes from the observation that neither
way-prediction nor phased caches are energy efficient for both cache misses and
hits. I use a simplified timing and energy model to quantify this observation. Based
on the CL2AP model, it will take time , T i j Single
 to finish each L2 cache access
for Core i  to get data from Bankj .

 

For a given application or task, let Ri be the average L2 cache hit rate of Corei . Let Ci be the average bank contention rate
between i Corei and other processor cores sharing the same L2 cache. Let NAi,j, be the total number of L2
cache accesses

to Block j for Corei . Let NA i be the total number of L2 cache accesses for Corei . Let NB be the total number of L2 cache blocks which has been accessed
during the application’s runtime, then

So for a given application or task running on Corei, the total L2 cache access
time

T i total   will be:

 

 

 

To get the optimal solution of problem

Min: T i total

Two methods to reduce the value of T i total. The first is to make Core
i  perform its accesses
to Blockj in the nearest banks. The second is to
increase the value of Ri as large as possible. Let Twire i be the least time for Corei to access any bank, we can get the lower
bound of the problem,

LBi=NAi * (TLocation, Tbank_to_network,
Twire )

For all the cores, we can get the optimal problem as

And the lower bound will be

LB = max( LBi),1 ? i?C

in which: C is the total number of cores.

 

B .LGR Algorithm:

The algorithm grades each in-cache document basing on its passed
history, which are recency, frequency, perfect-history and size. When the set
is full, the least grade document will be replaced, but its grade will be
stored in a perfect-history grade depository (PHGD) for future references. Due
to the survey, I can draw a conclusion that the relatively most important
factor is its recency, and then frequency, perfect-history and size. I only
consider these four factors for grading because they are relatively most
important factors for real network traffic. The n-way set associative caches is
used here. The least grade page replacement algorithm is given below.

 

ALGORITHM :: ReplaceLeastGradePage (

/* In-cache document based on its passed history */

ICDR: In-Cache Document’s Records

/* Discarded document in cache */

PDDG: Previously Discarded Document’s Grade

/* The de facto hit ratio beginning with 0 */

F : Frequency of ICDR ;

R : Recency Set of ICDR;

L : Length of document;

BG : Bonus Grade of document)

{

IF (document k in cache) THEN

FOREACH doc in ICDR DO

WHILE(doc.F in F

& doc.R in R

& Size(doc) in L)

doc ? newValue;

PDDG ?weight? , ? ?(0,
1);

ELSE

FetchDocFromOrigina();

DiscardInCachedDoc();

}

/*Fetch document k from original site*/

 

PROCEDURE: FetchDocFromOrigina()

{

IF(L is NOT NULL) THEN

INSERT k into Cache;

UPDATE each ICDR;

k.BG = 0;

    ELSE

FOREACH g in grade DO

g = ?1/R + ?2 * F
+ ?3 * C+ ?4 * BG;

}

/*discard in-cached doc with the least grade*/

PROCEDURE: DiscardInCachedDoc()

{

PDDG ? PHGD.grade;

INSERT k into Cache;

IF ( PDDG of k in PHGD) THEN

k.BG?PDDG;

Delete its PDDG in PHGD;

ELSE

k.BG=0;

Update each ICDR and PDDG;

}

 

 

 

 

C. Simulation
Implementation

Step 1: Initialize 2- and 4-way set associative caches;

Step 2: Determine the cache entry and the output link:

Step 3: Tokenize the whole source IP of a document into 4
integers by dots, and let them be I0, I1, I2, I3, and its length be L.

Step 4: Employ the following algorithm (in
order to distribute index as broadly as possible, we use full component of IP
address and its length) to obtain cache set index for the packet.

Step 5: For simplifying the simulation, let
link=I3

+L/100+1;

Step 6: Compare results of LFU and LRU.

 

V. SIMULATION
EXPERIMENT AND RESULTS

 

A. Experiment
setup

In order to evaluate the proposed PBM
policy, I have set up a NUCA-based L2 cache model, as shown in figure 2, and the
system parameters listed in Table 1 have been verified to be optimized in 14

.

Figure 2. The
simulation model for experiment

 

To simulate real applications’ L2 cache
behavior, I ran the SPEC2000 benchmark on an SMP computer (the parameters are
shown in table 3), and captured the memory trace with the HMTT toolkit 15.

The HMTT toolkit is designed especially for
memory trace capture. It is composed of a Memory Trace Board (MTB) and a trace Packets
Capture and Analysis Toolkits (PCAT). The MTB is a hardware monitor board. When
the SPEC2000 is running on the SMP machine, the MTB is plugged in an idle DIMM slot
and it can snoop on memory command signals which are sent to DDR SDRAM from
memory controller. I have shut down the L2 cache of the processors, so the
required memory addresses in the command signal is directly leaked from the L1
cache. Then MTB forwards the command to a Recorder machine. The Recorder
machine extracts the memory addresses from the command signal and eventually
organizes them into trace items (see figure 3) which is a tuple .After SPEC2000 finished running on the SMP machine,
the Recorder machine will get the whole memory trace.

 

Figure 3.
The format of memory trace items

 

 

TABLE 1. SYSTEM
PARAMETERS

 

TABLE 2

PARAMETERS OF COMPUTER USED FOR

CAPTURING MEMORY TRACE

 

B.  Experimental Results

In this part, the result of simulation experiments will show the
effect of PBM policy.

 

1) Sequential Prediction based Cache Migration

Our PBM policy is based on the sequential prediction, which claims
that the block after the presently accessed block will be referenced right
away. I have recorded the memory stride of the sequential memory trace. The result
is show in figure 4. As it can be seen from figure 4 that for both integer and
floating point applications the memory stride of one block accounts for the
largest proportion. So sequential prediction is accurate for most applications
to foretell which block will be referenced before long. However, we can also
see that for some applications (e.g. bzip2, vpr and apsi), the one-block stride
only accounts for over 30% (but still the largest proportion). For such
applications, the memory stride’s distribution is quite out of order. So how to
get more accurate prediction will be our next research.

 

2) Average Transfer distance

The target of our PBM policy is to reduce the transfer distance
between the source bank to the target processor core, which will reduce the T wire and consequently reduce the
average L2 access latency. Figure 4 shows the comparison of average transfer
distance between TMP and PBM policy. We can see that for each application the
average transfer distance is reduced in a certain proportion. In the best
situation, the average transfer distance is reduced by 16.9%. Figure 5 shows
the improvement on average access latency owing to the reduction of average
transfer distance. We can see the reduction of L2 access latency is achieved in
all the applications. And the improvement can reach up to 8.4%. Figure 5 shows
the difference of hit ratio and byte hit ratio for the EPA data between first
and second order TSP inverted list predictions. As can be seen, the difference
is rather large

 

Figure 4. Spatial locality frequency feature of SPEC
2000

Figure 5. Comparison of the average
transfer distance between TMP and PBM policy.

Figure 6. Comparison of L2 cache’s
average inverted list’s access Frequency (cycles)

Figure 7: CSSFU
hit ratio and byte hit ratio comparison

VI.         
RELATED WORK

                The increasing wire delay makes
the physical position of data in the cache very important to the access
latency. If a core in the CMP wants to use the data far from it, the access
latency will be significantly high compared to the data near the core. The NUCA
architecture 16 was first presented to hide the wire delay. In the NUCA, the
cache was divided into several banks, if a core wants to use data in a remote
bank, the data will be provided to it, and at the same time the data will be
moved to a bank near the core for the next access. As a development, Huh et al.
14 design a CMP cache to support a spectrum of sharing degrees, denoting the
number of processors sharing a pool of their local L2 banks. The average access
latency can be decreased by partitioning the aggregate on-chip cache into
disjoint pools, to fit the run application’s capacity requirement and sharing
patterns.

VII.         CONCLUSION & FUTURE SCOPE

In
this paper I conclude that the group of data items that should be keep in a
sequence to get better optimization to do this I use the new trend policy that
is called the Taylor Series Prediction (TSP) policy, provides more accurate
prediction on future accessing trends when the access patterns vary greatly. As
the most required data items are available in cache itself. It will give better
hit ratio. The proposed work will improve the efficiency and reliability. According
to this model, a prediction based migration policy is proposed and an active
data migration algorithm is designed. This proposed policy is based on the
principle of locality and employs the sequential prediction technology to indicate
the data to be accessed in the near future.

In
the future, it is possible for us to take into account other statistical
features such as the data trans mission ratesin predicting the data access
frequency as evidenced in applications on the Internet.