a-short-visit-to-common-data-structures

shijiang1130 发表于 2014-11-02 11:35

For small amounts of data, there are basically two data structures that can be used. The first one is called a proplist. A proplist is any list of tuples of the form [{Key,Value}]. They're a weird kind of structure because there is no other rule than that. In fact the rules are so relaxed that the list can also contain boolean values, integers and whatever you want. We're rather interested by the idea of a tuple with a key and a value in a list here, though. To work with proplists, you can use the proplists module. It contains functions such as proplists:delete/2, proplists:get_value/2, proplists:get_all_values/2, proplists:lookup/2 and proplists:lookup_all/2.

You'll notice there is no function to add or update an element of the list. This shows how loosely defined proplists are as a data structure. To get these functionalities, you must cons your element manually () and use functions such as lists:keyreplace/4. Using two modules for one small data structure is not the cleanest thing, but because proplists are so loosely defined, they're often used to deal with configuration lists, and general description of a given item. Proplists are not exactly complete data structures. They're more of a common pattern that appears when using lists and tuples to represent some object or item; the proplists module is a bit of a toolbox over such a pattern.

If you do want a more complete key-value store for small amounts of data, the orddict module is what you need. Orddicts (ordered dictionaries) are proplists with a taste for formality. Each key can be there once, the whole list is sorted for faster average lookup, etc. Common functions for the CRUD usage include orddict:store/3, orddict:find/2 (when you do not know whether the key is in the dictionaries), orddict:fetch/2 (when you know it is there or that it must be there) and orddict:erase/2.

A dictionary with the definition of 'Awesome' being 'it's you!'
Orddicts are a generally good compromise between complexity and efficiency up to about 75 elements (see my benchmark). After that amount, you should switch to different key-value stores.

There are basically two key-value structures/modules to deal with larger amounts of data: dicts and gb_trees. Dictionaries have the same interface as orddicts: dict:store/3, dict:find/2, dict:fetch/2, dict:erase/2 and every other function, such as dict:map/2 and dict:fold/2 (pretty useful to work on the whole data structure!) Dicts are thus very good choices to scale orddicts up whenever it is needed.

General Balanced Trees, on the other hand, have a bunch more functions leaving you more direct control over how the structure is to be used. There are basically two modes for gb_trees: the mode where you know your structure in and out (I call this the 'smart mode'), and the mode where you can't assume much about it (I call this one the 'naive mode'). In naive mode, the functions are gb_trees:enter/3, gb_trees:lookup/2 and gb_trees:delete_any/2. The related smart functions are gb_trees:insert/3, gb_trees:get/2, gb_trees:update/3 and gb_trees:delete/2. There is also gb_trees:map/2, which is always a nice thing when you need it.

The disadvantage of 'naive' functions over 'smart' ones is that because gb_trees are balanced trees, whenever you insert a new element (or delete a bunch), it might be possible that the tree will need to balance itself. This can take time and memory (even in useless checks just to make sure). The 'smart' function all assume that the key is present in the tree: this lets you skip all the safety checks and results in faster times.

When should you use gb_trees over dicts? Well, it's not a clear decision. As the benchmark module I have written will show, gb_trees and dicts have somewhat similar performances in many respects. However, the benchmark demonstrates that dicts have the best read speeds while the gb_trees tend to be a little quicker on other operations. You can judge based on your own needs which one would be the best.

Oh and also note that while dicts have a fold function, gb_trees don't: they instead have an iterator function, which returns a bit of the tree on which you can call gb_trees:next(Iterator) to get the following values in order. What this means is that you need to write your own recursive functions on top of gb_trees rather than use a generic fold. On the other hand, gb_trees let you have quick access to the smallest and largest elements of the structure with gb_trees:smallest/1 and gb_trees:largest/1.

I would therefore say that your application's needs is what should govern which key-value store to choose. Different factors such as how much data you've got to store, what you need to do with it and whatnot all have their importance. Measure, profile and benchmark to make sure.

Note: some special key-value stores exist to deal with resources of different size. Such stores are ETS tables, DETS tables and the mnesia database. However, their use is strongly related to the concepts of multiple processes and distribution. Because of this, they'll only be approached later on. I'm leaving this as a reference to pique your curiosity and for those interested.

Update:
Starting with version 17.0, the language supports a new native key-value data type, described in Postscript: Maps. They should be the new de-facto replacement for dicts.

http://learnyousomeerlang.com/a-short-visit-to-common-data-structures

shijiang1130 发表于 2014-11-02 11:45

用法：-module(keyval_benchmark).
-compile(export_all).

%% Runs all benchmarks with Reps number of elements.
bench(Reps) ->
io:format("Base Case:~n"),
io:format("Operation\tTotal (碌s)\tAverage (碌s)~n"),
print(base_case(Reps)),
io:format("~nNaive Orddict:~n"),
io:format("Operation\tTotal (碌s)\tAverage (碌s)~n"),
print(naive_orddict(Reps)),
io:format("~nSmart Orddict:~n"),
io:format("Operation\tTotal (碌s)\tAverage (碌s)~n"),
print(smart_orddict(Reps)),
io:format("~nNaive Dict:~n"),
io:format("Operation\tTotal (碌s)\tAverage (碌s)~n"),
print(naive_dict(Reps)),
io:format("~nSmart Dict:~n"),
io:format("Operation\tTotal (碌s)\tAverage (碌s)~n"),
print(smart_dict(Reps)),
io:format("~nNaive gb_trees:~n"),
io:format("Operation\tTotal (碌s)\tAverage (碌s)~n"),
print(naive_gb_trees(Reps)),
io:format("~nSmart gb_trees:~n"),
io:format("Operation\tTotal (碌s)\tAverage (碌s)~n"),
print(smart_gb_trees(Reps)).

%% formats the benchmark results cleanly.
print([]) -> ok;
print([{Op, Total, Avg} | Rest]) ->
io:format("~8s\t~10B\t~.6f~n", ),
print(Rest).

%% Example of a base benchmark function. This one actually does
%% nothing and can be used as a control for all the benchmark - it
%% will measure how much time it takes just to loop over data and
%% apply functions.
base_case(Reps) ->
benchmark(Reps,             % N repetitions
         [],                % Empty data structure
         fun ?MODULE:null/3, % Create
         fun ?MODULE:null/2, % Read
         fun ?MODULE:null/3, % Update
         fun ?MODULE:null/2).% Delete

%% Ordered dictionaries. Assumes that the value is present on reads.
smart_orddict(Reps) ->
benchmark(Reps,
         orddict:new(),
         fun orddict:store/3,
         fun orddict:fetch/2,
         fun orddict:store/3,
         fun orddict:erase/2).

%% Ordered dictionaries. Doesn't know whether a key is there or not on
%% reads.
naive_orddict(Reps) ->
benchmark(Reps,
         orddict:new(),
         fun orddict:store/3,
         fun orddict:find/2,
         fun orddict:store/3,
         fun orddict:erase/2).

%% Dictionary benchmark. Assumes that the value is present on reads.
smart_dict(Reps) ->
benchmark(Reps,
         dict:new(),
         fun dict:store/3,
         fun dict:fetch/2,
         fun dict:store/3,
         fun dict:erase/2).

%% Dictionary benchmark. Doesn't know if the value exisst at read time.
naive_dict(Reps) ->
benchmark(Reps,
         dict:new(),
         fun dict:store/3,
         fun dict:find/2,
         fun dict:store/3,
         fun dict:erase/2).

%% gb_trees benchmark. Uses the most general functions -- i.e.: it never
%% assumes that the value is not in a tree when inserting and never assumes
%% it is there when updating or deleting.
naive_gb_trees(Reps) ->
benchmark(Reps,
         gb_trees:empty(),
         fun gb_trees:enter/3,
         fun gb_trees:lookup/2,
         fun gb_trees:enter/3,
         fun gb_trees:delete_any/2).

%% gb_trees benchmark. Uses specific function: it assumes that the values
%% are not there when inserting and assumes it exists when updating or
%% deleting.
smart_gb_trees(Reps) ->
benchmark(Reps,
         gb_trees:empty(),
         fun gb_trees:insert/3,
         fun gb_trees:get/2,
         fun gb_trees:update/3,
         fun gb_trees:delete/2).

%% Empty functions used for the 'base_case/1' benchmark. They must do
%% nothing interesting.
null(_, _) -> ok.
null(_, _, _) -> ok.

%% Runs all the functions of 4 formats: Create, Read, Update, Delete.
%% Create: it's a regular insertion, but it goes from an empty structure
%%       to a filled one. Requires an empty data structure,
%% Read: reads every key from a complete data structure.
%% Update: usually, this is the same as the insertion from 'create',
%%       except that it's run on full data structures. In some cases,
%%       like gb_trees, there also exist operations for updates when
%%       the keys are known that act differently from regular inserts.
%% Delete: removes a key from a tree. Because we want to test the
%%       efficiency of it all, we will always delete from a complete
%%       structure.
%% The function returns a list of all times averaged over the number
%% of repetitions (Reps) needed.
benchmark(Reps, Empty, CreateFun, ReadFun, UpdateFun, DeleteFun) ->
Keys = make_keys(Reps),
{TimeC, Struct} = timer:tc(?MODULE, create, ),
{TimeR, _} = timer:tc(?MODULE, read, ),
{TimeU, _} = timer:tc(?MODULE, update, ),
{TimeD, _} = timer:tc(?MODULE, delete, ),
[{create, TimeC, TimeC/Reps},
{read, TimeR, TimeR/Reps},
{update, TimeU, TimeU/Reps},
{delete, TimeD, TimeD/Reps}].

%% Generate unique random numbers. No repetition allowed
make_keys(N) ->
%% The trick is to generate all numbers as usual, but match them
%% with a random value in a tuple of the form {Random, Number}.
%% The idea is to then sort the list generated that way; done in
%% this manner, we know all values will be unique and the randomness
%% will be done by the sorting.
Random = lists:sort([{random:uniform(N), X} || X <- lists:seq(1, N)]),
%% it's a good idea to then filter out the index (the random index)
%% to only return the real numbers we want. This is simple to do
%% with a list comprehension where '_' removes the extraneous data.
%% The keys are then fit into a tuple to make the test a bit more
%% realistic for comparison.
[{some, key, X} || {_, X} <- Random].

%% Loop function to apply the construction of a data structure.
%% The parameters passed are a list of all keys to use and then the
%% higher order function responsible of the creation of a data
%% structure. This is usually a function of the form
%% F(Key, Value, Structure).
%% In the first call, the structure has to be the empty data structure
%% that will progressively be filled.
%% The only value inserted for each key is 'some_data'; we only care
%% about the keys when dealing with key/value stuff.
create([], _, Acc) -> Acc;
create(, Fun, Acc) ->
create(Rest, Fun, Fun(Key, some_data, Acc)).

%% Loop function to apply successive readings to a data structure.
%% The parameters passed are a list of all keys, the complete data
%% structure and then a higher order function responsible for
%% fetching the data. Such functions are usually of the form
%% F(Key, Structure).
read([], _, _) -> ok;
read(, Struct, Fun) ->
Fun(Key, Struct),
read(Rest, Struct, Fun).

%% Loop function to apply updates to a data structure.
%% Takes a list of keys, a full data structure and a higher order
%% function responsible for the updating, usually of the form
%% F(Key, NewValue, Structure).
%% All values for a given key are replaced by 'newval', again because
%% we don't care about the values, but merely the operations with
%% the keys.
update([], _, _) -> ok;
update(, Struct, Fun) ->
Fun(Key, newval, Struct),
update(Rest, Struct, Fun).

%% Loop function to apply deletions to a data structure.
%% Each deletion is made on a full data structure.
%% Takes a list of keys, a data structure and a higher order function
%% to do the deletion. Usually of the form
%% F(Key, Structure).
delete([], _, _) -> ok;
delete(, Struct, Fun) ->
Fun(Key, Struct),
delete(Rest, Struct, Fun).

shijiang1130 发表于 2014-11-02 11:45

出处 http://learnyousomeerlang.com/static/erlang/keyval_benchmark.erl

shijiang1130 发表于 2014-11-02 11:50

2> keyval_benchmark:bench(3).
Base Case:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Naive Orddict:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Smart Orddict:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Naive Dict:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Smart Dict:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Naive gb_trees:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Smart gb_trees:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000
ok
3> keyval_benchmark:bench(1000).
Base Case:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Naive Orddict:
Operation    Total (μs)    Average (μs)
create          40000    40.000000
read          30000    30.000000
update          40000    40.000000
delete          40000    40.000000

Smart Orddict:
Operation    Total (μs)    Average (μs)
create          20000    20.000000
read          20000    20.000000
update          40000    40.000000
delete          50000    50.000000

Naive Dict:
Operation    Total (μs)    Average (μs)
create          10000    10.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Smart Dict:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Naive gb_trees:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update          10000    10.000000
delete             0    0.000000

Smart gb_trees:
Operation    Total (μs)    Average (μs)
create             0    0.000000
read             0    0.000000
update             0    0.000000
delete             0    0.000000
ok
4> keyval_benchmark:bench(100000).
Base Case:
Operation    Total (μs)    Average (μs)
create          10000    0.100000
read             0    0.000000
update             0    0.000000
delete             0    0.000000

Naive Orddict:
Operation    Total (μs)    Average (μs)

页: [1]

Chinaunix's Archiver

a-short-visit-to-common-data-structures