Erlang peculiarities


While working on my WorkerNet post, I stumbled across a weird behaviour with start_links, trap_exit and slave nodes.

Long Story (sorry, there is no short one)

As I was setting up a distributed test with slaves, I also wanted one gen_server to trap_exit’s for the offsprings sake which I did not wish to be put under a supervisor (shame on me ;), suddenly – all of the tests stopped working! All of them where either timing out or reporting direct noprocs. Bewildered and wide eyed at 23:40 I gave it a go with the dbg tracer and even went through some of the gen_server source.

No answer.

I chalked it up to the rpc calls for the remote nodes, tried printing out the process numbers in each step. But no – it was a fact. My gen_servers died the instant they where created… Brooding over it, I tried some more but finally went to sleep. Up to then, I knew that the problem was caused by the following two snippets in combination with rpc calls to my local slave nodes

start_link() ->
    gen_server:start_link({local,?MODULE},?MODULE,[],[]).    

init([]) ->
    process_flag(trap_exit,true),
    {ok, ok}.

While the non trap_exit’d version worked like a charm. Not wanting to waste more time on it, I just circumvented it like a cheap rug on a very dark and very deep embarrassing hole in the floor with

start_link(succeed) ->
    {ok,Pid} = gen_server:start({local, ?MODULE}, ?MODULE, [], []),
    link(Pid),
    {ok,Pid}.
init([]) ->
    process_flag(trap_exit,true),
    {ok, ok}.

But I couldn’t leave it at just that. I had to seek help, and so I showed it to my senior colleague Nicolas, I had then devised a test which would reproduce this neatly. He cut it down a bit, and I boiled it to the broth you see here and can compile and run for yourself.

Just for the record: The seemingly expected behaviour would be to see the exit signals appear in the handle_info/2 – not causing the process to crash.

%%%-------------------------------------------------------------------
%%% @author Gianfranco <zenon@zen.local>
%%% @copyright (C) 2011, Gianfranco
%%% Created : 17 Jan 2011 by Gianfranco <zenon@zen.local>
%%%-------------------------------------------------------------------
-module(test).

%% API
-export([start_link/1]).
-export([test/1,init/1,handle_info/2,terminate/2]).

-spec(test(fail|succeed) -> term()).
test(Mode) ->
    io:format("Current 0 ~p~n",[self()]),
    spawn(fun() -> io:format("Current 1 ~p~n",[self()]),
                  {ok, _P} = ?MODULE:start_link(Mode)
          end).

start_link(fail) ->
    gen_server:start_link({local,?MODULE},?MODULE,[],[]);
start_link(succeed) ->
    {ok,Pid} = gen_server:start({local, ?MODULE}, ?MODULE, [], []),
    link(Pid),
    {ok,Pid}.    

init([]) ->
    process_flag(trap_exit,true),
    {ok, ok}.

handle_info(timeout,State) -> {stop,normal,State};
handle_info(_Info, State) ->
    io:format("info ~p~n",[_Info]),
    {noreply, State,5000}.

terminate(_Reason, _State) ->
    io:format("reason ~p~n",[_Reason]),
    ok.

Compiling and running we see the expected and unexpected, I chose to call it succeed and fail, based on that the process dies (fails) and succeeds (succeed) in trapping

zen:Downloads zenon$ erlc test.erl
zen:Downloads zenon$ erl
Erlang R14B (erts-5.8.1) [source] [smp:4:4] [rq:4] [async-threads:0] [hipe]
[kernel-poll:false]

Eshell V5.8.1  (abort with ^G)
1> test:test(fail).
Current 0 <0.31.0>
Current 1 <0.33.0>
<0.33.0>
reason normal
2> test:test(succeed).
Current 0 <0.31.0>
Current 1 <0.36.0>
<0.36.0>
info {'EXIT',<0.36.0>,normal}
                             (5 seconds later)
reason normal
3>

As you see, the process did not die after initialization. It trapped the spawner’s end.  One possible explanation could be the one stated is in the module gen_server.erl (read the source Luke!)

%%% ---------------------------------------------------
%%%
%%% The idea behind THIS server is that the user module
%%% provides (different) functions to handle different
%%% kind of inputs.
%%% If the Parent process terminates the Module:terminate/2
%%% function is called.
%%%

Some more digging into this, Nicolas came with the idea of sys:get_status/1 ing the processes. What was revealed can be seen below! The parent of the gen_server:start/1-ed process is itself!

Sys:get_status(<0.37.0>) = {status,<0.37.0>,
                               {module,gen_server},
                               [[{'$ancestors',[<0.36.0>]},
                                 {'$initial_call',{test,init,1}}],
                                running,<0.37.0>,[],
                                [{header,"Status for generic server test"},
                                 {data,
                                     [{"Status",running},
                                      {"Parent",<0.37.0>},
                                      {"Logged events",[]}]},
                                 {data,[{"State",ok}]}]]}

/G

Leave a comment

5 Comments

  1. Interesting but how come this can be. I would assume the erlang VM would prevent that to happen. I understand it looks like a bug but I’m not sure how that bug could even exist.

    Reply
  2. @Sylvain: You must realise in that in Erlang the language the process space is flat, all processes are equal, and there is no built-in process hierarchy with the concepts of parent or child processes. Once a process has been started it has no implicit knowledge of who started it or which processes it may later start. All such information must be explicitly maintained in the code the process runs. Process links have no relationship with process creation. This was done intentionally when Erlang was designed.

    OTP imposes its process hierarchy on top of Erlang. It uses this to implement its supervision trees where the one of the basic principles is that if a process detects that its parent has died then it should terminate itself in the proper way. What is its proper way depends on which type of process it is.

    So this has really nothing to do with the Erlang VM and there is nothing for it to prevent. There is no bug.

    Reply
  3. @Robert: I see. To be honest when I invite people to try erlang these days, I make sure they start without OTP first. I’ve found the hardway that OTP really complicates a very straightforward and simple language.

    Reply
    • @Sylvain: Yes, OTP does add a lot of functionality. It gives you one way of building robust and fault-tolerant systems with many support tools, but it is not mandatory that you use it. I agree that you should know what is Erlang and what is OTP to make better use of both parts.

      Reply
  4. I have to say that both your cases are behaving exactly as they should. OTP behaviours are meant to be run in supervision trees which impose a hierarchy on processes. One of the basic principles of a supervision trees is that if a process’s parent dies then the process should terminate itself in the appropriate way. In the case of a gen_server it first calls the Mod:terminate/2 callback to allow specific server clean-up and after that kills itself.. A supervisor, however, would first terminate its children before killing itself. Behaviours use links to detect process termination, and the gen_server:start_link creates a link and sets up the gen_server to check this link from its parent to detect if the parent dies.

    In you ‘fail’ example you spawn a process which starts the gen_server with a gen_server:start_link. After this the spawner process terminates. So the linked gen_server detects that its parent has died, calls Mod:terminate/2 and then dies. Exactly as it should!

    The ‘succeed’ is a bit different here and I am sort of guessing as I have not really looked at the actual code. Hopefully an educated guess. It seems as if when you because you start it with gen_server:start (no link) it not only does not create a link to the parent but also tries to avoid any other process being mistaken for its parent. This it does by putting itself as parent. When it is trapping exits and the spawner explicitly links to it and then dies the gen_server receives an exit message from the spawner. As this is not recognised as coming from the parent it doesn’t cause the server to terminate, instead it becomes an unrecognised message which is handled by Mod:handle_info/2. As it should be. Mod:handle_info/2 then gets the timeout and returns that the gen_server should stop, which it then does calling Mod:terminate/2. The only “strange” bit is how it handles its parent.

    N.B. The process hierarchy is done by OTP on top of Erlang and is not a part of Erlang the language so there is no direct support for it in the Erlang VM.

    Reply

Leave a comment