06 October 2008

unique_vector

Occasionally it's fun to share a programming experience from work. Last week I found myself having to make changes to a tool that builds a particular client data file for EverQuest II. I discovered that this tool would build unique arrays by searching linearly for the content and adding it if it didn't exist. Some of these arrays would get very large (on the order of 30,000 items or more) and the number of checks even larger (200,000+). There are much faster ways to search for a unique item, but we had the constraint of needing to assign a permanent index for the item (i.e. its position in an array).

I wrote out a quick nifty STL-like helper class to accomplish the same behavior, but much faster. The class definition looks like this (note that I used unsigned instead of defining my own size_type for brevity):
template < typename T >
class unique_vector
{
typedef std::vector< const T* > Tvector;
typedef std::map< T, unsigned > Tmap;

Tmap m_map;
mutable Tvector m_vector;
mutable bool m_dirty;

void m_build() const;

public:
unique_vector();

unsigned size() const;
unsigned add( const T& );
const T& operator [] ( unsigned ) const;
};


This stitches together a map (to achieve uniqueness and provide faster lookup time) and a vector to provide index-based lookups. This is mainly achieved through two functions. Let's take a look at the add() function:
template < typename T >
inline unsigned unique_vector<T>::add( const T& t )
{
unsigned index = m_map.size();
std::pair< Tmap::iterator, bool > result =
m_map.insert( std::make_pair( t, index ) );
m_dirty = true;
return (*result.first).second;
}


This function relies on the fact that std::map::insert() fails if the key (t in this case) is already present in the map, but returns an iterator to the existing item. This function basically tries to add the item to the map with the next index (using m_map.size() as the next index so the first item will have index 0, next will have index 1 and so on). If it's not there, insert() will succeed and return an iterator to the new record (using the provided index). If it's already there, insert() fails but returns the existing item. The function returns the index associated with the record, whether it's the new index from a freshly inserted item or the old index from a pre-existing item.

Notice that the m_dirty flag is set to true. This indicates that the vector needs to be rebuilt. The operator[]() calls m_build() if m_dirty is set to true:
template < typename T >
inline void unique_vector<T>::m_build() const
{
m_vector.resize( m_map.size() );
for ( Tmap::const_iterator iter( m_map.begin() );
iter != m_map.end();
++iter )
{
m_vector[ (*iter).second ] = &(*iter).first;
}
m_dirty = false;
}

This function quickly builds the vector using the unique indexes stored in the map as array indexes and pointers to the objects used as map keys. The function is const (and data members mutable) to allow it to be called from the const operator[]() function.

So, how much of an improvement was it? The tool uses complex structures as T for this template, but I ran a simple test using 10,000 pre-allocated std::string objects and 100,000 attempted adds. On my AMD 64 4000+ with 2GB RAM, the linear version took 5,077.160 ms (5 seconds). The unique_vector ran the same test in 49.972 ms. Algorithm changes almost always trump when optimizing.

P.S. In case you were wondering, here are the remaining very simple functions:
template < typename T >
inline unique_vector<T>::unique_vector() :
m_dirty( false )
{}

template < typename T >
inline unsigned unique_vector<T>::size() const
{
return m_map.size();
}

template < typename T >
inline const T& unique_vector<T>::operator [] ( unsigned u ) const
{
if ( m_dirty ) m_build();
return *m_vector[ u ];
}